From d76c26421d421189ee7aad3a4e750665faa2768f Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 11 Jul 2024 16:29:31 +0800 Subject: [PATCH 01/32] propose new build system; 2 examples --- Makefile | 1940 +---------------- {tests => envs/common}/inc/hal.h | 0 {tests => envs/common}/inc/misc.h | 0 {tests => envs/common}/inc/poly.h | 0 .../src/test_common => common/src}/misc.c | 0 .../src/test_common => common/src}/poly.c | 0 envs/m55-an547/Makefile | 66 +- envs/m55-an547/build/test_common/dummy | 0 envs/m55-an547/build/test_src/auto/dummy | 0 envs/m55-an547/build/test_src/external/dummy | 0 envs/m55-an547/build/test_src/manual/dummy | 0 envs/m55-an547/build/test_src/mve_test.d | 1 - envs/m55-an547/inc/test_inc | 1 - envs/m55-an547/src/test_src | 1 - envs/m55-an547/test.bin | Bin 37296 -> 0 bytes envs/m55-an547/test.elf | Bin 218076 -> 0 bytes envs/m55-an547/test_loaded_sqmag | 0 envs/m85-an555/Makefile | 63 +- envs/m85-an555/build/test_common/dummy | 0 envs/m85-an555/build/test_src/auto/dummy | 0 envs/m85-an555/build/test_src/external/dummy | 0 envs/m85-an555/build/test_src/manual/dummy | 0 envs/m85-an555/build/test_src/mve_test.d | 1 - envs/m85-an555/inc/test_inc | 1 - .../m85-an555/src/platform/gcc_arm_sse_310.ld | 12 +- envs/m85-an555/src/platform/mps3-an555.mk | 55 - envs/m85-an555/src/test_common/misc.c | 137 -- envs/m85-an555/src/test_common/poly.c | 264 --- envs/m85-an555/src/test_src | 1 - envs/m85-an555/test.bin | Bin 36324 -> 0 bytes envs/m85-an555/test.elf | Bin 159928 -> 0 bytes envs/m85-an555/test_loaded_sqmag | 0 tests/common/misc.c | 137 -- tests/common/poly.c | 264 --- .../manual/ntt_dilithium_123_456_78.s | 322 --- .../ntt_dilithium_123_456_78_opt_size_m55.s | 717 ------ .../ntt_dilithium_123_456_78_opt_size_m85.s | 718 ------ .../ntt_dilithium_123_456_78_twiddles.s | 537 ----- .../manual/ntt_dilithium_12_34_56_78.s | 210 -- .../ntt_dilithium_12_34_56_78_no_trans_vld4.s | 209 -- ...ithium_12_34_56_78_no_trans_vld4_opt_m55.s | 668 ------ ...ithium_12_34_56_78_no_trans_vld4_opt_m85.s | 600 ----- .../ntt_dilithium_12_34_56_78_opt_m55.s | 670 ------ .../ntt_dilithium_12_34_56_78_opt_m85.s | 668 ------ .../ntt_dilithium_12_34_56_78_twiddles.s | 538 ----- tests/ntt_dilithium/ntt_dilithium.mk | 17 + tests/ntt_kyber/manual/ntt_kyber_12_345_67.s | 258 --- .../manual/ntt_kyber_12_345_67_opt_size_m55.s | 642 ------ .../manual/ntt_kyber_12_345_67_opt_size_m85.s | 642 ------ .../manual/ntt_kyber_12_345_67_twiddles.s | 474 ---- .../manual/ntt_kyber_1_23_45_67_no_trans.s | 217 -- .../ntt_kyber_1_23_45_67_no_trans_opt_m55.s | 620 ------ .../ntt_kyber_1_23_45_67_no_trans_opt_m85.s | 618 ------ .../ntt_kyber_1_23_45_67_no_trans_vld4.s | 217 -- ...t_kyber_1_23_45_67_no_trans_vld4_opt_m55.s | 622 ------ ...t_kyber_1_23_45_67_no_trans_vld4_opt_m85.s | 545 ----- .../manual/ntt_kyber_1_23_45_67_twiddles.s | 473 ---- tests/ntt_kyber/ntt_kyber.mk | 17 + 58 files changed, 132 insertions(+), 14031 deletions(-) rename {tests => envs/common}/inc/hal.h (100%) rename {tests => envs/common}/inc/misc.h (100%) rename {tests => envs/common}/inc/poly.h (100%) rename envs/{m55-an547/src/test_common => common/src}/misc.c (100%) rename envs/{m55-an547/src/test_common => common/src}/poly.c (100%) delete mode 100644 envs/m55-an547/build/test_common/dummy delete mode 100644 envs/m55-an547/build/test_src/auto/dummy delete mode 100644 envs/m55-an547/build/test_src/external/dummy delete mode 100644 envs/m55-an547/build/test_src/manual/dummy delete mode 100644 envs/m55-an547/build/test_src/mve_test.d delete mode 120000 envs/m55-an547/inc/test_inc delete mode 120000 envs/m55-an547/src/test_src delete mode 100755 envs/m55-an547/test.bin delete mode 100755 envs/m55-an547/test.elf delete mode 100644 envs/m55-an547/test_loaded_sqmag delete mode 100644 envs/m85-an555/build/test_common/dummy delete mode 100644 envs/m85-an555/build/test_src/auto/dummy delete mode 100644 envs/m85-an555/build/test_src/external/dummy delete mode 100644 envs/m85-an555/build/test_src/manual/dummy delete mode 100644 envs/m85-an555/build/test_src/mve_test.d delete mode 120000 envs/m85-an555/inc/test_inc delete mode 100644 envs/m85-an555/src/platform/mps3-an555.mk delete mode 100644 envs/m85-an555/src/test_common/misc.c delete mode 100644 envs/m85-an555/src/test_common/poly.c delete mode 120000 envs/m85-an555/src/test_src delete mode 100755 envs/m85-an555/test.bin delete mode 100755 envs/m85-an555/test.elf delete mode 100644 envs/m85-an555/test_loaded_sqmag delete mode 100644 tests/common/misc.c delete mode 100644 tests/common/poly.c delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_123_456_78.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m55.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m85.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_twiddles.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m55.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m85.s delete mode 100644 tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_twiddles.s create mode 100644 tests/ntt_dilithium/ntt_dilithium.mk delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_12_345_67.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m55.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m85.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_12_345_67_twiddles.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m55.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m85.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85.s delete mode 100644 tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_twiddles.s create mode 100644 tests/ntt_kyber/ntt_kyber.mk diff --git a/Makefile b/Makefile index 3045aa7..aa532c8 100644 --- a/Makefile +++ b/Makefile @@ -1,1926 +1,36 @@ +# Tests +include tests/ntt_dilithium/ntt_dilithium.mk +include tests/ntt_kyber/ntt_kyber.mk -CODEGEN_DIR=asm +totestname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') +totestsources = $(addprefix ../../tests/$(1)/,$($(call totestname,$(1))_SOURCES)) +totestasm = $(addprefix ../../tests/$(1)/,$($(call totestname,$(1))_ASMS)) +toplatform = $(addsuffix --$(1),$($(call totestname,$(1))_PLATFORMS)) -AUTOGEN_SRCS_DIR=$(CODEGEN_DIR)/auto -AUTOGEN_SRCS_POLY_TOOM4_DIR=$(AUTOGEN_SRCS_DIR)/poly/toom4 -AUTOGEN_SRCS_POLY_TOOM3_DIR=$(AUTOGEN_SRCS_DIR)/poly/toom3 -AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR=$(AUTOGEN_SRCS_DIR)/poly/schoolbook -AUTOGEN_SRCS_POLY_SIMD_DIR=$(AUTOGEN_SRCS_DIR)/poly/simd -AUTOGEN_SRCS_PERMUTE_DIR=$(AUTOGEN_SRCS_DIR)/permute -AUTOGEN_SRCS_TRANSPOSE_DIR=$(AUTOGEN_SRCS_DIR)/transpose -AUTOGEN_SRCS_NTT_N256_DIR=$(AUTOGEN_SRCS_DIR)/ntt_n256 -AUTOGEN_SRCS_NTT_256_DIR=$(AUTOGEN_SRCS_DIR)/ntt_256 -AUTOGEN_SRCS_NTT_1024_DIR=$(AUTOGEN_SRCS_DIR)/ntt_1024 -AUTOGEN_SRCS_NTT_512_DIR=$(AUTOGEN_SRCS_DIR)/ntt_512 -AUTOGEN_SRCS_NTT_768_DIR=$(AUTOGEN_SRCS_DIR)/ntt_768 -AUTOGEN_SRCS_NTT_192_DIR=$(AUTOGEN_SRCS_DIR)/ntt_192 -AUTOGEN_SRCS_NTT_384_DIR=$(AUTOGEN_SRCS_DIR)/ntt_384 -AUTOGEN_SRCS_UNPACK_DIR=$(AUTOGEN_SRCS_DIR)/unpack +platformtests := $(foreach test,$(TESTS), $(call toplatform,$(test))) -MANUAL_SRCS_DIR=$(CODEGEN_DIR)/manual - -MANUAL_SRCS_MONTGOMERY_DIR=$(MANUAL_SRCS_DIR)/montgomery -MANUAL_SRCS_MONTGOMERY_ALL=$(wildcard $(MANUAL_SRCS_MONTGOMERY_DIR)/*.s) $(wildcard $(MANUAL_SRCS_MONTGOMERY_DIR)/*.h) - -MANUAL_SRCS_NTT_KYBER_DIR=$(MANUAL_SRCS_DIR)/ntt_kyber -MANUAL_SRCS_NTT_KYBER_ALL=$(wildcard $(MANUAL_SRCS_NTT_KYBER_DIR)/*.s) $(wildcard $(MANUAL_SRCS_NTT_KYBER_DIR)/*.h) - -MANUAL_SRCS_NTT_DILITHIUM_DIR=$(MANUAL_SRCS_DIR)/ntt_dilithium -MANUAL_SRCS_NTT_DILITHIUM_ALL=$(wildcard $(MANUAL_SRCS_NTT_DILITHIUM_DIR)/*.s) $(wildcard $(MANUAL_SRCS_NTT_DILITHIUM_DIR)/*.h) - -MANUAL_SRCS_NTT_N256_DIR=$(MANUAL_SRCS_DIR)/ntt_n256 -MANUAL_SRCS_NTT_N256_ALL=$(wildcard $(MANUAL_SRCS_NTT_N256_DIR)/*.s) $(wildcard $(MANUAL_SRCS_NTT_N256_DIR)/*.h) - -MANUAL_SRCS_CRT_DIR=$(MANUAL_SRCS_DIR)/crt -MANUAL_SRCS_CRT_ALL=$(wildcard $(MANUAL_SRCS_CRT_DIR)/*.s) $(wildcard $(MANUAL_SRCS_CRT_DIR)/*.h) - -MANUAL_SRCS_SQMAG_DIR=$(MANUAL_SRCS_DIR)/sqmag -MANUAL_SRCS_SQMAG_ALL=$(wildcard $(MANUAL_SRCS_SQMAG_DIR)/*.s) $(wildcard $(MANUAL_SRCS_SQMAG_DIR)/*.h) - -MANUAL_SRCS_FX_FFT_DIR=$(MANUAL_SRCS_DIR)/fx_fft -MANUAL_SRCS_FX_FFT_ALL=$(wildcard $(MANUAL_SRCS_FX_FFT_DIR)/*.s) $(wildcard $(MANUAL_SRCS_FX_FFT_DIR)/*.h) - -MANUAL_SRCS_FLT_FFT_DIR=$(MANUAL_SRCS_DIR)/flt_fft -MANUAL_SRCS_FLT_FFT_ALL=$(wildcard $(MANUAL_SRCS_FLT_FFT_DIR)/*.s) $(wildcard $(MANUAL_SRCS_FLT_FFT_DIR)/*.h) - -MANUAL_SRCS_CT_DIR=$(MANUAL_SRCS_DIR)/ct -MANUAL_SRCS_CT_ALL=$(wildcard $(MANUAL_SRCS_CT_DIR)/*.s) $(wildcard $(MANUAL_SRCS_CT_DIR)/*.h) - -MANUAL_SRCS_CHUNK_DIR=$(MANUAL_SRCS_DIR)/chunk -MANUAL_SRCS_CHUNK_ALL=$(wildcard $(MANUAL_SRCS_CHUNK_DIR)/*.s) $(wildcard $(MANUAL_SRCS_CHUNK_DIR)/*.h) - -MANUAL_SRCS_KARATSUBA_DIR=$(MANUAL_SRCS_DIR)/karatsuba -MANUAL_SRCS_KARATSUBA_ALL=$(wildcard $(MANUAL_SRCS_KARATSUBA_DIR)/*.s) $(wildcard $(MANUAL_SRCS_KARATSUBA_DIR)/*.h) - -MANUAL_SRCS_SCHOOLBOOK_DIR=$(MANUAL_SRCS_DIR)/schoolbook -MANUAL_SRCS_SCHOOLBOOK_ALL=$(wildcard $(MANUAL_SRCS_SCHOOLBOOK_DIR)/*.s) $(wildcard $(MANUAL_SRCS_SCHOOLBOOK_DIR)/*.h) - -MANUAL_SRCS_POLY_ALL=$(MANUAL_SRCS_KARATSUBA_ALL) $(MANUAL_SRCS_MONTGOMERY_ALL) $(MANUAL_SRCS_SCHOOLBOOK_ALL) - -MANUAL_SRCS_NTT_1024_ALL=$(MANUAL_SRCS_MONTGOMERY_ALL) -MANUAL_SRCS_NTT_768_ALL=$(MANUAL_SRCS_MONTGOMERY_ALL) -MANUAL_SRCS_NTT_192_ALL=$(MANUAL_SRCS_MONTGOMERY_ALL) -MANUAL_SRCS_NTT_384_ALL=$(MANUAL_SRCS_MONTGOMERY_ALL) - - -AUTOGEN_SRCS_ALL=$(wildcard $(AUTOGEN_SRCS_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_DIR)/*/*/*/*/*.s) -AUTOGEN_SRCS_POLY_TOOM4_ALL=$(wildcard $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_POLY_TOOM3_ALL=$(wildcard $(AUTOGEN_SRCS_POLY_TOOM3_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_TOOM3_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_TOOM3_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_TOOM3_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_POLY_SCHOOLBOOK_ALL=$(wildcard $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_POLY_SIMD_ALL=$(wildcard $(AUTOGEN_SRCS_POLY_SIMD_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_SIMD_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_SIMD_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_POLY_SIMD_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_PERMUTE_ALL=$(wildcard $(AUTOGEN_SRCS_PERMUTE_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_PERMUTE_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_PERMUTE_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_PERMUTE_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_TRANSPOSE_ALL=$(wildcard $(AUTOGEN_SRCS_TRANSPOSE_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_TRANSPOSE_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_TRANSPOSE_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_TRANSPOSE_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_N256_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_N256_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_N256_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_N256_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_N256_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_256_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_256_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_256_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_256_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_256_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_1024_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_1024_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_1024_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_1024_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_1024_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_512_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_512_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_512_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_512_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_512_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_768_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_768_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_768_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_768_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_768_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_192_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_192_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_192_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_192_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_192_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_NTT_384_ALL=$(wildcard $(AUTOGEN_SRCS_NTT_384_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_384_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_384_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_NTT_384_DIR)/*/*/*/*.s) -AUTOGEN_SRCS_UNPACK_ALL=$(wildcard $(AUTOGEN_SRCS_UNPACK_DIR)/*.s) $(wildcard $(AUTOGEN_SRCS_UNPACK_DIR)/*/*.s) $(wildcard $(AUTOGEN_SRCS_UNPACK_DIR)/*/*/*.s) $(wildcard $(AUTOGEN_SRCS_UNPACK_DIR)/*/*/*/*.s) - -TEST_BASE_DIR=tests - -TEST_POLY_SCHOOLBOOK_DIR=$(TEST_BASE_DIR)/schoolbook -TEST_POLY_SCHOOLBOOK_SOURCES_AUTO_DIR=$(TEST_POLY_SCHOOLBOOK_DIR)/auto -TEST_POLY_SCHOOLBOOK_SRC_C=$(wildcard $(TEST_POLY_SCHOOLBOOK_DIR)/*.c) $(wildcard $(TEST_POLY_SCHOOLBOOK_DIR)/*/*.c) -TEST_POLY_SCHOOLBOOK_SRC_AUTO_SCHOOLBOOK=$(patsubst $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR)/%.s, $(TEST_POLY_SCHOOLBOOK_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_ALL)) -TEST_POLY_SCHOOLBOOK_SRC_AUTO_SIMD=$(patsubst $(AUTOGEN_SRCS_POLY_SIMD_DIR)/%.s, $(TEST_POLY_SCHOOLBOOK_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_SIMD_ALL)) -TEST_POLY_SCHOOLBOOK_SRC_AUTO=$(TEST_POLY_SCHOOLBOOK_SRC_AUTO_SIMD) $(TEST_POLY_SCHOOLBOOK_SRC_AUTO_SCHOOLBOOK) -TEST_POLY_SCHOOLBOOK_SRC_ALL=$(TEST_POLY_SCHOOLBOOK_SRC_C) $(TEST_POLY_SCHOOLBOOK_SRC_AUTO) - -TEST_POLY_TOOM_DIR=$(TEST_BASE_DIR)/toom -TEST_POLY_TOOM_SOURCES_AUTO_DIR=$(TEST_POLY_TOOM_DIR)/auto -TEST_POLY_TOOM_SRC_C=$(wildcard $(TEST_POLY_TOOM_DIR)/*.c) $(wildcard $(TEST_POLY_TOOM_DIR)/*/*.c) -TEST_POLY_TOOM_SRC_AUTO_TOOM3=$(patsubst $(AUTOGEN_SRCS_POLY_TOOM3_DIR)/%.s, $(TEST_POLY_TOOM_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_TOOM3_ALL)) -TEST_POLY_TOOM_SRC_AUTO_TOOM4=$(patsubst $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/%.s, $(TEST_POLY_TOOM_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_TOOM4_ALL)) -TEST_POLY_TOOM_SRC_AUTO=$(TEST_POLY_TOOM_SRC_AUTO_TOOM3) $(TEST_POLY_TOOM_SRC_AUTO_TOOM4) -TEST_POLY_TOOM_SRC_ALL=$(TEST_POLY_TOOM_SRC_C) $(TEST_POLY_TOOM_SRC_AUTO) - -TEST_PERMUTE_DIR=$(TEST_BASE_DIR)/permute -TEST_PERMUTE_SOURCES_AUTO_DIR=$(TEST_PERMUTE_DIR)/auto -TEST_PERMUTE_SRC_C=$(wildcard $(TEST_PERMUTE_DIR)/*.c) $(wildcard $(TEST_PERMUTE_DIR)/*/*.c) -TEST_PERMUTE_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_PERMUTE_DIR)/%.s, $(TEST_PERMUTE_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_PERMUTE_ALL)) -TEST_PERMUTE_SRC_ALL=$(TEST_PERMUTE_SRC_C) $(TEST_PERMUTE_SRC_AUTO) - -TEST_UNPACK_DIR=$(TEST_BASE_DIR)/unpack -TEST_UNPACK_SOURCES_AUTO_DIR=$(TEST_UNPACK_DIR)/auto -TEST_UNPACK_SRC_C=$(wildcard $(TEST_UNPACK_DIR)/*.c) $(wildcard $(TEST_UNPACK_DIR)/*/*.c) -TEST_UNPACK_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_UNPACK_DIR)/%.s, $(TEST_UNPACK_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_UNPACK_ALL)) -TEST_UNPACK_SRC_ALL=$(TEST_UNPACK_SRC_C) $(TEST_UNPACK_SRC_AUTO) - -TEST_NTT_N256_DIR=$(TEST_BASE_DIR)/ntt_n256 -TEST_NTT_N256_SOURCES_AUTO_DIR=$(TEST_NTT_N256_DIR)/auto -TEST_NTT_N256_SOURCES_MANUAL_DIR=$(TEST_NTT_N256_DIR)/manual -TEST_NTT_N256_SRC_C=$(wildcard $(TEST_NTT_N256_DIR)/*.c) $(wildcard $(TEST_NTT_N256_DIR)/*/*.c) -TEST_NTT_N256_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_N256_DIR)/%.s, $(TEST_NTT_N256_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_N256_ALL)) -TEST_NTT_N256_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_NTT_N256_DIR)/%.s, $(TEST_NTT_N256_SOURCES_MANUAL_DIR)/%.s, $(MANUAL_SRCS_NTT_N256_ALL)) -TEST_NTT_N256_SRC_ALL=$(TEST_NTT_N256_SRC_C) $(TEST_NTT_N256_SRC_AUTO) $(TEST_NTT_N256_SRC_MANUAL) - -TEST_NTT_256_DIR=$(TEST_BASE_DIR)/ntt_256 -TEST_NTT_256_SOURCES_AUTO_DIR=$(TEST_NTT_256_DIR)/auto -TEST_NTT_256_SRC_C=$(wildcard $(TEST_NTT_256_DIR)/*.c) $(wildcard $(TEST_NTT_256_DIR)/*/*.c) -TEST_NTT_256_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_256_DIR)/%.s, $(TEST_NTT_256_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_256_ALL)) -TEST_NTT_256_SRC_ALL=$(TEST_NTT_256_SRC_C) $(TEST_NTT_256_SRC_AUTO) - -TEST_NTT_1024_DIR=$(TEST_BASE_DIR)/ntt_1024 -TEST_NTT_1024_SOURCES_AUTO_DIR=$(TEST_NTT_1024_DIR)/auto -TEST_NTT_1024_SRC_C=$(wildcard $(TEST_NTT_1024_DIR)/*.c) $(wildcard $(TEST_NTT_1024_DIR)/*/*.c) -TEST_NTT_1024_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_1024_DIR)/%.s, $(TEST_NTT_1024_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_1024_ALL)) -TEST_NTT_1024_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%,$(TEST_NTT_1024_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_NTT_1024_SRC_ALL=$(TEST_NTT_1024_SRC_C) $(TEST_NTT_1024_SRC_AUTO) $(TEST_NTT_1024_SRC_MANUAL) - -TEST_NTT_512_DIR=$(TEST_BASE_DIR)/ntt_512 -TEST_NTT_512_SOURCES_AUTO_DIR=$(TEST_NTT_512_DIR)/auto -TEST_NTT_512_SRC_C=$(wildcard $(TEST_NTT_512_DIR)/*.c) $(wildcard $(TEST_NTT_512_DIR)/*/*.c) -TEST_NTT_512_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_512_DIR)/%.s, $(TEST_NTT_512_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_512_ALL)) -TEST_NTT_512_SRC_ALL=$(TEST_NTT_512_SRC_C) $(TEST_NTT_512_SRC_AUTO) - -TEST_NTT_768_DIR=$(TEST_BASE_DIR)/ntt_768 -TEST_NTT_768_SOURCES_AUTO_DIR=$(TEST_NTT_768_DIR)/auto -TEST_NTT_768_SRC_C=$(wildcard $(TEST_NTT_768_DIR)/*.c) $(wildcard $(TEST_NTT_768_DIR)/*/*.c) -TEST_NTT_768_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_768_DIR)/%.s, $(TEST_NTT_768_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_768_ALL)) -TEST_NTT_768_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%,$(TEST_NTT_768_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_NTT_768_SRC_ALL=$(TEST_NTT_768_SRC_C) $(TEST_NTT_768_SRC_AUTO) $(TEST_NTT_768_SRC_MANUAL) - -TEST_NTT_192_DIR=$(TEST_BASE_DIR)/ntt_192 -TEST_NTT_192_SOURCES_AUTO_DIR=$(TEST_NTT_192_DIR)/auto -TEST_NTT_192_SRC_C=$(wildcard $(TEST_NTT_192_DIR)/*.c) $(wildcard $(TEST_NTT_192_DIR)/*/*.c) -TEST_NTT_192_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_192_DIR)/%.s, $(TEST_NTT_192_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_192_ALL)) -TEST_NTT_192_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%,$(TEST_NTT_192_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_NTT_192_SRC_ALL=$(TEST_NTT_192_SRC_C) $(TEST_NTT_192_SRC_AUTO) $(TEST_NTT_192_SRC_MANUAL) - -TEST_NTT_384_DIR=$(TEST_BASE_DIR)/ntt_384 -TEST_NTT_384_SOURCES_AUTO_DIR=$(TEST_NTT_384_DIR)/auto -TEST_NTT_384_SRC_C=$(wildcard $(TEST_NTT_384_DIR)/*.c) $(wildcard $(TEST_NTT_384_DIR)/*/*.c) -TEST_NTT_384_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_NTT_384_DIR)/%.s, $(TEST_NTT_384_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_384_ALL)) -TEST_NTT_384_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%,$(TEST_NTT_384_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_NTT_384_SRC_ALL=$(TEST_NTT_384_SRC_C) $(TEST_NTT_384_SRC_AUTO) $(TEST_NTT_384_SRC_MANUAL) - -TEST_TRANSPOSE_DIR=$(TEST_BASE_DIR)/transpose -TEST_TRANSPOSE_SOURCES_AUTO_DIR=$(TEST_TRANSPOSE_DIR)/auto -TEST_TRANSPOSE_SRC_C=$(wildcard $(TEST_TRANSPOSE_DIR)/*.c) $(wildcard $(TEST_TRANSPOSE_DIR)/*/*.c) -TEST_TRANSPOSE_SRC_AUTO=$(patsubst $(AUTOGEN_SRCS_TRANSPOSE_DIR)/%.s, $(TEST_TRANSPOSE_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_TRANSPOSE_ALL)) -TEST_TRANSPOSE_SRC_ALL=$(TEST_TRANSPOSE_SRC_C) $(TEST_TRANSPOSE_SRC_AUTO) - -TEST_HELLOWORLD_DIR=$(TEST_BASE_DIR)/helloworld -TEST_HELLOWORLD_SOURCES_AUTO_DIR=$(TEST_HELLOWORLD_DIR)/auto -TEST_HELLOWORLD_SRC_C=$(wildcard $(TEST_HELLOWORLD_DIR)/*.c) $(wildcard $(TEST_HELLOWORLD_DIR)/*/*.c) -TEST_HELLOWORLD_SRC_ALL=$(TEST_HELLOWORLD_SRC_C) - -TEST_PROFILING_DIR=$(TEST_BASE_DIR)/profiling -TEST_PROFILING_SOURCES_AUTO_DIR=$(TEST_PROFILING_DIR)/auto -TEST_PROFILING_SRC_C=$(wildcard $(TEST_PROFILING_DIR)/*.c) $(wildcard $(TEST_PROFILING_DIR)/*/*.c) -TEST_PROFILING_SRC_ALL=$(TEST_PROFILING_SRC_C) - - -TEST_SABER_DIR=$(TEST_BASE_DIR)/saber -TEST_SABER_SOURCES_AUTO_DIR=$(TEST_SABER_DIR)/auto -TEST_SABER_SOURCES_MANUAL_DIR=$(TEST_SABER_DIR)/manual -TEST_SABER_SRC_C=$(wildcard $(TEST_SABER_DIR)/*.c) $(wildcard $(TEST_SABER_DIR)/*/*.c) -TEST_SABER_SRC_MANUAL_KARATSUBA=$(patsubst $(MANUAL_SRCS_KARATSUBA_DIR)/%, $(TEST_SABER_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_KARATSUBA_ALL)) -TEST_SABER_SRC_MANUAL_MONTGOMERY=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%, $(TEST_SABER_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_SABER_SRC_AUTO_TOOM4=$(patsubst $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/%.s, $(TEST_SABER_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_TOOM4_ALL)) -TEST_SABER_SRC_AUTO_NTT_N256=$(patsubst $(AUTOGEN_SRCS_NTT_N256_DIR)/%.s, $(TEST_SABER_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_N256_ALL)) -TEST_SABER_SRC_AUTO_SCHOOLBOOK=$(patsubst $(AUTOGEN_SRCS_POLY_SIMD_DIR)/%.s, $(TEST_SABER_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_SIMD_ALL)) -TEST_SABER_SRC_ALL=$(TEST_SABER_SRC_C) $(TEST_SABER_SRC_MANUAL_MONTGOMERY) $(TEST_SABER_SRC_MANUAL_KARATSUBA) $(TEST_SABER_SRC_AUTO_TOOM4) $(TEST_SABER_SRC_AUTO_SCHOOLBOOK) $(TEST_SABER_SRC_AUTO_NTT_N256) - -TEST_INTMULNTT_DIR=$(TEST_BASE_DIR)/intmulntt -TEST_INTMULNTT_SOURCES_AUTO_DIR=$(TEST_INTMULNTT_DIR)/ -TEST_INTMULNTT_SOURCES_MANUAL_DIR=$(TEST_INTMULNTT_DIR)/ -TEST_INTMULNTT_SRC_C=$(wildcard $(TEST_INTMULNTT_DIR)/*.c) $(wildcard $(TEST_INTMULNTT_DIR)/*/*.c) -TEST_INTMULNTT_SRC_MANUAL_MONTGOMERY=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%, $(TEST_INTMULNTT_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_INTMULNTT_SRC_MANUAL_CRT=$(patsubst $(MANUAL_SRCS_CRT_DIR)/%, $(TEST_INTMULNTT_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_CRT_ALL)) -TEST_INTMULNTT_SRC_AUTO_NTT_384=$(patsubst $(AUTOGEN_SRCS_NTT_384_DIR)/%.s, $(TEST_INTMULNTT_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_384_ALL)) -TEST_INTMULNTT_SRC_AUTO_NTT_192=$(patsubst $(AUTOGEN_SRCS_NTT_192_DIR)/%.s, $(TEST_INTMULNTT_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_192_ALL)) -TEST_INTMULNTT_SRC_ALL=$(TEST_INTMULNTT_SRC_C) $(TEST_INTMULNTT_SRC_MANUAL_MONTGOMERY) $(TEST_INTMULNTT_SRC_MANUAL_CRT) $(TEST_INTMULNTT_SRC_AUTO_NTT_384) $(TEST_INTMULNTT_SRC_AUTO_NTT_192) - -TEST_MONTGOMERY_DIR=$(TEST_BASE_DIR)/montgomery -TEST_MONTGOMERY_SRC_C=$(wildcard $(TEST_MONTGOMERY_DIR)/*.c) $(wildcard $(TEST_MONTGOMERY_DIR)/*/*.c) $(wildcard $(TEST_MONTGOMERY_DIR)/*.s) -TEST_MONTGOMERY_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%, $(TEST_MONTGOMERY_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_MONTGOMERY_SRC_ALL=$(TEST_MONTGOMERY_SRC_C) $(TEST_MONTGOMERY_SRC_MANUAL) - -TEST_NTT_KYBER_DIR=$(TEST_BASE_DIR)/ntt_kyber -TEST_NTT_KYBER_SRC_C=$(wildcard $(TEST_NTT_KYBER_DIR)/*.c) $(wildcard $(TEST_NTT_KYBER_DIR)/*/*.c) $(wildcard $(TEST_NTT_KYBER_DIR)/*.s) -TEST_NTT_KYBER_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_NTT_KYBER_DIR)/%, $(TEST_NTT_KYBER_DIR)/manual/%, $(MANUAL_SRCS_NTT_KYBER_ALL)) -TEST_NTT_KYBER_SRC_ALL=$(TEST_NTT_KYBER_SRC_C) $(TEST_NTT_KYBER_SRC_MANUAL) - -TEST_NTT_DILITHIUM_DIR=$(TEST_BASE_DIR)/ntt_dilithium -TEST_NTT_DILITHIUM_SRC_C=$(wildcard $(TEST_NTT_DILITHIUM_DIR)/*.c) $(wildcard $(TEST_NTT_DILITHIUM_DIR)/*/*.c) $(wildcard $(TEST_NTT_DILITHIUM_DIR)/*.s) -TEST_NTT_DILITHIUM_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_NTT_DILITHIUM_DIR)/%, $(TEST_NTT_DILITHIUM_DIR)/manual/%, $(MANUAL_SRCS_NTT_DILITHIUM_ALL)) -TEST_NTT_DILITHIUM_SRC_ALL=$(TEST_NTT_DILITHIUM_SRC_C) $(TEST_NTT_DILITHIUM_SRC_MANUAL) - -TEST_CRT_DIR=$(TEST_BASE_DIR)/crt -TEST_CRT_SRC_C=$(wildcard $(TEST_CRT_DIR)/*.c) $(wildcard $(TEST_CRT_DIR)/*/*.c) $(wildcard $(TEST_CRT_DIR)/*.s) -TEST_CRT_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_CRT_DIR)/%, $(TEST_CRT_DIR)/%, $(MANUAL_SRCS_CRT_ALL)) -TEST_CRT_SRC_ALL=$(TEST_CRT_SRC_C) $(TEST_CRT_SRC_MANUAL) - -TEST_SQMAG_DIR=$(TEST_BASE_DIR)/sqmag -TEST_SQMAG_SRC_C=$(wildcard $(TEST_SQMAG_DIR)/*.c) $(wildcard $(TEST_SQMAG_DIR)/*/*.c) $(wildcard $(TEST_SQMAG_DIR)/*.s) -TEST_SQMAG_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_SQMAG_DIR)/%, $(TEST_SQMAG_DIR)/%, $(MANUAL_SRCS_SQMAG_ALL)) -TEST_SQMAG_SRC_ALL=$(TEST_SQMAG_SRC_C) $(TEST_SQMAG_SRC_MANUAL) - -TEST_FX_FFT_DIR=$(TEST_BASE_DIR)/fx_fft -TEST_FX_FFT_SRC_C=$(wildcard $(TEST_FX_FFT_DIR)/*.c) $(wildcard $(TEST_FX_FFT_DIR)/*/*.c) $(wildcard $(TEST_FX_FFT_DIR)/*.s) -TEST_FX_FFT_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_FX_FFT_DIR)/%, $(TEST_FX_FFT_DIR)/%, $(MANUAL_SRCS_FX_FFT_ALL)) -TEST_FX_FFT_SRC_ALL=$(TEST_FX_FFT_SRC_C) $(TEST_FX_FFT_SRC_MANUAL) - -TEST_FLT_FFT_DIR=$(TEST_BASE_DIR)/flt_fft -TEST_FLT_FFT_SRC_C=$(wildcard $(TEST_FLT_FFT_DIR)/*.c) $(wildcard $(TEST_FLT_FFT_DIR)/*/*.c) $(wildcard $(TEST_FLT_FFT_DIR)/*.s) -TEST_FLT_FFT_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_FLT_FFT_DIR)/%, $(TEST_FLT_FFT_DIR)/%, $(MANUAL_SRCS_FLT_FFT_ALL)) -TEST_FLT_FFT_SRC_ALL=$(TEST_FLT_FFT_SRC_C) $(TEST_FLT_FFT_SRC_MANUAL) - -TEST_CT_DIR=$(TEST_BASE_DIR)/ct -TEST_CT_SRC_C=$(wildcard $(TEST_CT_DIR)/*.c) $(wildcard $(TEST_CT_DIR)/*/*.c) $(wildcard $(TEST_CT_DIR)/*.s) -TEST_CT_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_CT_DIR)/%, $(TEST_CT_DIR)/%, $(MANUAL_SRCS_CT_ALL)) -TEST_CT_SRC_ALL=$(TEST_CT_SRC_C) $(TEST_CT_SRC_MANUAL) - - -TEST_CHUNK_DIR=$(TEST_BASE_DIR)/chunk -TEST_CHUNK_SRC_C=$(wildcard $(TEST_CHUNK_DIR)/*.c) $(wildcard $(TEST_CHUNK_DIR)/*/*.c) $(wildcard $(TEST_CHUNK_DIR)/*.s) -TEST_CHUNK_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_CHUNK_DIR)/%, $(TEST_CHUNK_DIR)/%, $(MANUAL_SRCS_CHUNK_ALL)) -TEST_CHUNK_SRC_ALL=$(TEST_CHUNK_SRC_C) $(TEST_CHUNK_SRC_MANUAL) - -TEST_KARATSUBA_DIR=$(TEST_BASE_DIR)/karatsuba -TEST_KARATSUBA_SRC_C=$(wildcard $(TEST_KARATSUBA_DIR)/*.c) $(wildcard $(TEST_KARATSUBA_DIR)/*/*.c) $(wildcard $(TEST_KARATSUBA_DIR)/*.s) -TEST_KARATSUBA_SRC_MANUAL=$(patsubst $(MANUAL_SRCS_KARATSUBA_DIR)/%, $(TEST_KARATSUBA_DIR)/%, $(MANUAL_SRCS_KARATSUBA_ALL)) -TEST_KARATSUBA_SRC_ALL=$(TEST_KARATSUBA_SRC_C) $(TEST_KARATSUBA_SRC_MANUAL) - - -TEST_POLY_DIR=$(TEST_BASE_DIR)/poly -TEST_POLY_SOURCES_AUTO_DIR=$(TEST_POLY_DIR)/auto -TEST_POLY_SOURCES_MANUAL_DIR=$(TEST_POLY_DIR)/manual -TEST_POLY_SRC_C=$(wildcard $(TEST_POLY_DIR)/*.c) $(wildcard $(TEST_POLY_DIR)/*/*.c) $(wildcard $(TEST_POLY_DIR)/*.s) -TEST_POLY_SRC_MANUAL_KARATSUBA=$(patsubst $(MANUAL_SRCS_KARATSUBA_DIR)/%, $(TEST_POLY_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_KARATSUBA_ALL)) -TEST_POLY_SRC_MANUAL_SCHOOLBOOK=$(patsubst $(MANUAL_SRCS_SCHOOLBOOK_DIR)/%, $(TEST_POLY_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_SCHOOLBOOK_ALL)) -TEST_POLY_SRC_MANUAL_MONTGOMERY=$(patsubst $(MANUAL_SRCS_MONTGOMERY_DIR)/%, $(TEST_POLY_SOURCES_MANUAL_DIR)/%, $(MANUAL_SRCS_MONTGOMERY_ALL)) -TEST_POLY_SRC_AUTO_TOOM4=$(patsubst $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/%.s, $(TEST_POLY_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_TOOM4_ALL)) -TEST_POLY_SRC_AUTO_NTT_N256=$(patsubst $(AUTOGEN_SRCS_NTT_N256_DIR)/%.s, $(TEST_POLY_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_NTT_N256_ALL)) -TEST_POLY_SRC_AUTO_SCHOOLBOOK=$(patsubst $(AUTOGEN_SRCS_POLY_SIMD_DIR)/%.s, $(TEST_POLY_SOURCES_AUTO_DIR)/%.s, $(AUTOGEN_SRCS_POLY_SIMD_ALL)) -TEST_POLY_SRC_ALL=$(TEST_POLY_SRC_C) $(TEST_POLY_SRC_MANUAL_MONTGOMERY) $(TEST_POLY_SRC_MANUAL_SCHOOLBOOK) $(TEST_POLY_SRC_MANUAL_KARATSUBA) $(TEST_POLY_SRC_AUTO_TOOM4) $(TEST_POLY_SRC_AUTO_SCHOOLBOOK) $(TEST_POLY_SRC_AUTO_NTT_N256) $(TEST_POLY_SRC_AUTO_MONTGOMERY) - -TEST_SRC_AUTO_ALL=$(TEST_POLY_SCHOOLBOOK_SRC_AUTO) $(TEST_POLY_TOOM_SRC_AUTO) $(TEST_PERMUTE_SRC_AUTO) $(TEST_UNPACK_SRC_AUTO) $(TEST_TRANSPOSE_SRC_AUTO) $(TEST_HELLOWORLD_SRC_AUTO) $(TEST_NTT_N256_SRC_AUTO) $(TEST_POLY_SRC_AUTO_ALL) $(TEST_SABER_SRC_AUTO) - -TEST_ENVS_BASE_DIR=envs -TEST_ENV_M55_MPS3_FVP_BASE=$(TEST_ENVS_BASE_DIR)/fvp-corstone300-mps3 -TEST_ENV_M55_MPS3_FVP_SRC=$(TEST_ENV_M55_MPS3_FVP_BASE)/src -TEST_ENV_M55_MPS3_FVP_SYMLINK=$(TEST_ENV_M55_MPS3_FVP_SRC)/test_src - -TEST_ENV_M55_MPS2_FVP_BASE=$(TEST_ENVS_BASE_DIR)/fvp-corstone300-mps2 -TEST_ENV_M55_MPS2_FVP_SRC=$(TEST_ENV_M55_MPS2_FVP_BASE)/src -TEST_ENV_M55_MPS2_FVP_SYMLINK=$(TEST_ENV_M55_MPS2_FVP_SRC)/test_src - -TEST_ENV_M85_AN555_BASE=$(TEST_ENVS_BASE_DIR)/m85-an555 -TEST_ENV_M85_AN555_SRC=$(TEST_ENV_M85_AN555_BASE)/src -TEST_ENV_M85_AN555_SYMLINK=$(TEST_ENV_M85_AN555_SRC)/test_src - -TEST_ENV_M55_AN547_BASE=$(TEST_ENVS_BASE_DIR)/m55-an547 -TEST_ENV_M55_AN547_SRC=$(TEST_ENV_M55_AN547_BASE)/src -TEST_ENV_M55_AN547_SYMLINK=$(TEST_ENV_M55_AN547_SRC)/test_src - -TEST_ENV_M55_CORE_BASE=$(TEST_ENVS_BASE_DIR)/core -TEST_ENV_M55_CORE_SRC=$(TEST_ENV_M55_CORE_BASE)/src -TEST_ENV_M55_CORE_SYMLINK=$(TEST_ENV_M55_CORE_SRC)/test_src - -PYTHON_SRCS=$(wildcard $(CODEGEN_DIR)/*.py) $(wildcard $(CODEGEN_DIR)/*/*.py) $(wildcard $(CODEGEN_DIR)/*/*/*.py) $(wildcard $(CODEGEN_DIR)/*/*/*/*/*.py) +builds := $(addprefix build-, $(platformtests)) +runs := $(addprefix run-, $(platformtests)) +cleans := $(addprefix clean-, $(platformtests)) .PHONY: all -all: codegen $(TEST_SRC_AUTO_ALL) - -.PHONY: clean - -ifndef PQMX_STATIC_BUILD -clean: - make clean -C $(TEST_ENV_M55_MPS3_FVP_BASE) - make clean -C $(TEST_ENV_M55_MPS2_FVP_BASE) - make clean -C $(TEST_ENV_M55_CORE_BASE) - rm -f $(TEST_SRC_AUTO_ALL) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* -else -clean: - make clean -C $(TEST_ENV_M55_MPS3_FVP_BASE) - make clean -C $(TEST_ENV_M55_MPS2_FVP_BASE) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* -endif - -.PHONY: cleanasm -cleanasm: - make clean -C $(CODEGEN_DIR) - -.PHONY: cleanall -cleanall: clean cleanasm - -ifndef PQMX_STATIC_BUILD - -$(AUTOGEN_SRCS_ALL): $(PYTHON_SRCS) - make -C $(CODEGEN_DIR) - -$(TEST_POLY_SCHOOLBOOK_SRC_AUTO_SCHOOLBOOK): $(TEST_POLY_SCHOOLBOOK_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_SCHOOLBOOK_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_POLY_SCHOOLBOOK_SRC_AUTO_SIMD): $(TEST_POLY_SCHOOLBOOK_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_SIMD_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_POLY_TOOM_SRC_AUTO_TOOM3): $(TEST_POLY_TOOM_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_TOOM3_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_POLY_TOOM_SRC_AUTO_TOOM4): $(TEST_POLY_TOOM_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_PERMUTE_SRC_AUTO): $(TEST_PERMUTE_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_PERMUTE_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_UNPACK_SRC_AUTO): $(TEST_UNPACK_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_UNPACK_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_TRANSPOSE_SRC_AUTO): $(TEST_TRANSPOSE_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_TRANSPOSE_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_N256_SRC_AUTO): $(TEST_NTT_N256_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_N256_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_256_SRC_AUTO): $(TEST_NTT_256_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_256_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_1024_SRC_AUTO): $(TEST_NTT_1024_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_1024_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_1024_SRC_MANUAL): $(TEST_NTT_1024_DIR)/%: $(MANUAL_SRCS_MONTGOMERY_DIR)/% - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_512_SRC_AUTO): $(TEST_NTT_512_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_512_DIR)/%.s - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_768_SRC_AUTO): $(TEST_NTT_768_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_768_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_768_SRC_MANUAL): $(TEST_NTT_768_DIR)/%: $(MANUAL_SRCS_MONTGOMERY_DIR)/% - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_192_SRC_AUTO): $(TEST_NTT_192_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_192_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_192_SRC_MANUAL): $(TEST_NTT_192_DIR)/%: $(MANUAL_SRCS_MONTGOMERY_DIR)/% - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_384_SRC_AUTO): $(TEST_NTT_384_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_384_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_384_SRC_MANUAL): $(TEST_NTT_384_DIR)/%: $(MANUAL_SRCS_MONTGOMERY_DIR)/% - mkdir -p $(@D) - cp $< $@ - -$(TEST_MONTGOMERY_SRC_MANUAL): $(TEST_MONTGOMERY_DIR)/%.s: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_MONTGOMERY_SRC_MANUAL): $(TEST_MONTGOMERY_DIR)/%.h: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_KYBER_SRC_MANUAL): $(TEST_NTT_KYBER_DIR)/manual/%.s: $(MANUAL_SRCS_NTT_KYBER_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_KYBER_SRC_MANUAL): $(TEST_NTT_KYBER_DIR)/manual/%.h: $(MANUAL_SRCS_NTT_KYBER_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_DILITHIUM_SRC_MANUAL): $(TEST_NTT_DILITHIUM_DIR)/manual/%.s: $(MANUAL_SRCS_NTT_DILITHIUM_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_DILITHIUM_SRC_MANUAL): $(TEST_NTT_DILITHIUM_DIR)/manual/%.h: $(MANUAL_SRCS_NTT_DILITHIUM_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_NTT_N256_SRC_MANUAL): $(TEST_NTT_N256_DIR)/manual/%.s: $(MANUAL_SRCS_NTT_N256_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_NTT_N256_SRC_MANUAL): $(TEST_NTT_N256_DIR)/manual/%.h: $(MANUAL_SRCS_NTT_N256_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_CRT_SRC_MANUAL): $(TEST_CRT_DIR)/%.s: $(MANUAL_SRCS_CRT_DIR)/%.s - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ -$(TEST_CRT_SRC_MANUAL): $(TEST_CRT_DIR)/%.h: $(MANUAL_SRCS_CRT_DIR)/%.h - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ - -$(TEST_SQMAG_SRC_MANUAL): $(TEST_SQMAG_DIR)/%.s: $(MANUAL_SRCS_SQMAG_DIR)/%.s - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ -$(TEST_SQMAG_SRC_MANUAL): $(TEST_SQMAG_DIR)/%.h: $(MANUAL_SRCS_SQMAG_DIR)/%.h - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ - -$(TEST_FX_FFT_SRC_MANUAL): $(TEST_FX_FFT_DIR)/%.s: $(MANUAL_SRCS_FX_FFT_DIR)/%.s - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ -$(TEST_FX_FFT_SRC_MANUAL): $(TEST_FX_FFT_DIR)/%.h: $(MANUAL_SRCS_FX_FFT_DIR)/%.h - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ - -$(TEST_FLT_FFT_SRC_MANUAL): $(TEST_FLT_FFT_DIR)/%.s: $(MANUAL_SRCS_FLT_FFT_DIR)/%.s - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ -$(TEST_FLT_FFT_SRC_MANUAL): $(TEST_FLT_FFT_DIR)/%.h: $(MANUAL_SRCS_FLT_FFT_DIR)/%.h - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ - -$(TEST_CT_SRC_MANUAL): $(TEST_CT_DIR)/%.s: $(MANUAL_SRCS_CT_DIR)/%.s - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ -$(TEST_CT_SRC_MANUAL): $(TEST_CT_DIR)/%.h: $(MANUAL_SRCS_CT_DIR)/%.h - mkdir -p $(@D) - echo "A $< B $@" - cp $< $@ - -$(TEST_CHUNK_SRC_MANUAL): $(TEST_CHUNK_DIR)/%.s: $(MANUAL_SRCS_CHUNK_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_CHUNK_SRC_MANUAL): $(TEST_CHUNK_DIR)/%.h: $(MANUAL_SRCS_CHUNK_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_KARATSUBA_SRC_MANUAL): $(TEST_KARATSUBA_DIR)/%.s: $(MANUAL_SRCS_KARATSUBA_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_KARATSUBA_SRC_MANUAL): $(TEST_KARATSUBA_DIR)/%.h: $(MANUAL_SRCS_KARATSUBA_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_POLY_SRC_AUTO_TOOM4): $(TEST_POLY_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_AUTO_NTT_N256): $(TEST_POLY_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_N256_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_AUTO_SCHOOLBOOK): $(TEST_POLY_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_SIMD_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_MANUAL_KARATSUBA): $(TEST_POLY_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_KARATSUBA_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_MANUAL_KARATSUBA): $(TEST_POLY_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_KARATSUBA_DIR)/%.h - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_MANUAL_SCHOOLBOOK): $(TEST_POLY_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_SCHOOLBOOK_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_MANUAL_SCHOOLBOOK): $(TEST_POLY_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_SCHOOLBOOK_DIR)/%.h - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_MANUAL_MONTGOMERY): $(TEST_POLY_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_POLY_SRC_MANUAL_MONTGOMERY): $(TEST_POLY_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_SABER_SRC_AUTO_TOOM4): $(TEST_SABER_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_TOOM4_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_SABER_SRC_AUTO_NTT_N256): $(TEST_SABER_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_N256_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_SABER_SRC_AUTO_SCHOOLBOOK): $(TEST_SABER_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_POLY_SIMD_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_SABER_SRC_MANUAL_KARATSUBA): $(TEST_SABER_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_KARATSUBA_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_SABER_SRC_MANUAL_KARATSUBA): $(TEST_SABER_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_KARATSUBA_DIR)/%.h - mkdir -p $(@D) - cp $< $@ -$(TEST_SABER_SRC_MANUAL_MONTGOMERY): $(TEST_SABER_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_SABER_SRC_MANUAL_MONTGOMERY): $(TEST_SABER_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -$(TEST_INTMULNTT_SRC_AUTO_NTT_384): $(TEST_INTMULNTT_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_384_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_AUTO_NTT_384): $(TEST_INTMULNTT_SOURCES_AUTO_DIR)/%.h: $(AUTOGEN_SRCS_NTT_384_DIR)/%.h - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_AUTO_NTT_192): $(TEST_INTMULNTT_SOURCES_AUTO_DIR)/%.s: $(AUTOGEN_SRCS_NTT_192_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_AUTO_NTT_192): $(TEST_INTMULNTT_SOURCES_AUTO_DIR)/%.h: $(AUTOGEN_SRCS_NTT_192_DIR)/%.h - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_MANUAL_CRT): $(TEST_INTMULNTT_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_CRT_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_MANUAL_CRT): $(TEST_INTMULNTT_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_CRT_DIR)/%.h - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_MANUAL_MONTGOMERY): $(TEST_INTMULNTT_SOURCES_MANUAL_DIR)/%.s: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.s - mkdir -p $(@D) - cp $< $@ -$(TEST_INTMULNTT_SRC_MANUAL_MONTGOMERY): $(TEST_INTMULNTT_SOURCES_MANUAL_DIR)/%.h: $(MANUAL_SRCS_MONTGOMERY_DIR)/%.h - mkdir -p $(@D) - cp $< $@ - -endif - -.PHONY: codegen -codegen: - make codegen -C $(CODEGEN_DIR) - -### -### M55-MPS3-FVP test environment -### - -# Toom test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_TOOM = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_toom -$(TEST_ENV_M55_MPS3_FVP_LINK_TOOM): $(TEST_POLY_TOOM_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_POLY_TOOM_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-toom -build-m55-mps3-fvp-toom: $(TEST_ENV_M55_MPS3_FVP_LINK_TOOM) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-toom -run-m55-mps3-fvp-toom: $(TEST_ENV_M55_MPS3_FVP_LINK_TOOM) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-toom -debug-m55-mps3-fvp-toom: $(TEST_ENV_M55_MPS3_FVP_LINK_TOOM) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Schoolbook test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_SCHOOLBOOK = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_schoolbook -$(TEST_ENV_M55_MPS3_FVP_LINK_SCHOOLBOOK): $(TEST_POLY_SCHOOLBOOK_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_POLY_SCHOOLBOOK_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-schoolbook -build-m55-mps3-fvp-schoolbook: $(TEST_ENV_M55_MPS3_FVP_LINK_SCHOOLBOOK) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-schoolbook -run-m55-mps3-fvp-schoolbook: $(TEST_ENV_M55_MPS3_FVP_LINK_SCHOOLBOOK) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-schoolbook -debug-m55-mps3-fvp-schoolbook: $(TEST_ENV_M55_MPS3_FVP_LINK_SCHOOLBOOK) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Permute test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_PERMUTE = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_permute -$(TEST_ENV_M55_MPS3_FVP_LINK_PERMUTE): $(TEST_PERMUTE_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_PERMUTE_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-permute -build-m55-mps3-fvp-permute: $(TEST_ENV_M55_MPS3_FVP_LINK_PERMUTE) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-permute -run-m55-mps3-fvp-permute: $(TEST_ENV_M55_MPS3_FVP_LINK_PERMUTE) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-permute -debug-m55-mps3-fvp-permute: $(TEST_ENV_M55_MPS3_FVP_LINK_PERMUTE) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Unpack test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_UNPACK = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_unpack -$(TEST_ENV_M55_MPS3_FVP_LINK_UNPACK): $(TEST_UNPACK_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_UNPACK_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-unpack -build-m55-mps3-fvp-unpack: $(TEST_ENV_M55_MPS3_FVP_LINK_UNPACK) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-unpack -run-m55-mps3-fvp-unpack: $(TEST_ENV_M55_MPS3_FVP_LINK_UNPACK) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-unpack -debug-m55-mps3-fvp-unpack: $(TEST_ENV_M55_MPS3_FVP_LINK_UNPACK) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Transpose test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_TRANSPOSE = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_transpose -$(TEST_ENV_M55_MPS3_FVP_LINK_TRANSPOSE): $(TEST_TRANSPOSE_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_TRANSPOSE_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-transpose -build-m55-mps3-fvp-transpose: $(TEST_ENV_M55_MPS3_FVP_LINK_TRANSPOSE) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-transpose -run-m55-mps3-fvp-transpose: $(TEST_ENV_M55_MPS3_FVP_LINK_TRANSPOSE) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-transpose -debug-m55-mps3-fvp-transpose: $(TEST_ENV_M55_MPS3_FVP_LINK_TRANSPOSE) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_N256 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_N256 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_n256 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_N256): $(TEST_NTT_N256_SRC_AUTO) $(TEST_NTT_N256_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_N256_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_n256 -build-m55-mps3-fvp-ntt_n256: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_N256) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_n256 -run-m55-mps3-fvp-ntt_n256: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_N256) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_n256 -debug-m55-mps3-fvp-ntt_n256: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_N256) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_256 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_256 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_256 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_256): $(TEST_NTT_256_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_256_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_256 -build-m55-mps3-fvp-ntt_256: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_256) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_256 -run-m55-mps3-fvp-ntt_256: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_256) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_256 -debug-m55-mps3-fvp-ntt_256: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_256) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_1024 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_1024 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_1024 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_1024): $(TEST_NTT_1024_SRC_AUTO) $(TEST_NTT_1024_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_1024_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_1024 -build-m55-mps3-fvp-ntt_1024: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_1024) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_1024 -run-m55-mps3-fvp-ntt_1024: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_1024) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_1024 -debug-m55-mps3-fvp-ntt_1024: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_1024) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_512 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_512 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_512 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_512): $(TEST_NTT_512_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_512_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_512 -build-m55-mps3-fvp-ntt_512: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_512) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_512 -run-m55-mps3-fvp-ntt_512: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_512) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_512 -debug-m55-mps3-fvp-ntt_512: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_512) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_768 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_768 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_768 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_768): $(TEST_NTT_768_SRC_AUTO) $(TEST_NTT_768_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_768_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_768 -build-m55-mps3-fvp-ntt_768: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_768) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_768 -run-m55-mps3-fvp-ntt_768: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_768) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_768 -debug-m55-mps3-fvp-ntt_768: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_768) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_192 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_192 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_192 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_192): $(TEST_NTT_192_SRC_AUTO) $(TEST_NTT_192_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_192_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_192 -build-m55-mps3-fvp-ntt_192: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_192) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_192 -run-m55-mps3-fvp-ntt_192: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_192) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_192 -debug-m55-mps3-fvp-ntt_192: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_192) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_384 test on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_384 = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_384 -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_384): $(TEST_NTT_384_SRC_AUTO) $(TEST_NTT_384_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_384_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_384 -build-m55-mps3-fvp-ntt_384: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_384) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_384 -run-m55-mps3-fvp-ntt_384: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_384) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_384 -debug-m55-mps3-fvp-ntt_384: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_384) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Montgomery multiplication on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_MONTGOMERY = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_montgomery -$(TEST_ENV_M55_MPS3_FVP_LINK_MONTGOMERY): $(TEST_MONTGOMERY_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_MONTGOMERY_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-montgomery -build-m55-mps3-fvp-montgomery: $(TEST_ENV_M55_MPS3_FVP_LINK_MONTGOMERY) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-montgomery -run-m55-mps3-fvp-montgomery: $(TEST_ENV_M55_MPS3_FVP_LINK_MONTGOMERY) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-montgomery -debug-m55-mps3-fvp-montgomery: $(TEST_ENV_M55_MPS3_FVP_LINK_MONTGOMERY) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_KYBER on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_KYBER = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_kyber -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_KYBER): $(TEST_NTT_KYBER_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_KYBER_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_kyber -build-m55-mps3-fvp-ntt_kyber: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_KYBER) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_kyber -run-m55-mps3-fvp-ntt_kyber: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_KYBER) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_kyber -debug-m55-mps3-fvp-ntt_kyber: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_KYBER) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_DILITHIUM on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_NTT_DILITHIUM = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ntt_dilithium -$(TEST_ENV_M55_MPS3_FVP_LINK_NTT_DILITHIUM): $(TEST_NTT_DILITHIUM_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_DILITHIUM_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-ntt_dilithium -build-m55-mps3-fvp-ntt_dilithium: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_DILITHIUM) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ntt_dilithium -run-m55-mps3-fvp-ntt_dilithium: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_DILITHIUM) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ntt_dilithium -debug-m55-mps3-fvp-ntt_dilithium: $(TEST_ENV_M55_MPS3_FVP_LINK_NTT_DILITHIUM) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# NTT_KYBER on M55-CORE - -TEST_ENV_M55_CORE_LINK_NTT_KYBER = $(TEST_ENV_M55_CORE_BASE)/test_loaded_ntt_kyber -$(TEST_ENV_M55_CORE_LINK_NTT_KYBER): $(TEST_NTT_KYBER_SRC_MANUAL) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_NTT_KYBER_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-ntt_kyber -build-m55-core-ntt_kyber: $(TEST_ENV_M55_CORE_LINK_NTT_KYBER) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: run-m55-core-ntt_kyber -run-m55-core-ntt_kyber: $(TEST_ENV_M55_CORE_LINK_NTT_KYBER) - make run -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: debug-m55-core-ntt_kyber -debug-m55-core-ntt_kyber: $(TEST_ENV_M55_CORE_LINK_NTT_KYBER) - make debug -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# NTT Kyber on M85-MPS3-FVP - -TEST_ENV_M85_AN555_LINK_NTT_KYBER = $(TEST_ENV_M85_AN555_BASE)/test_loaded_ntt_kyber -$(TEST_ENV_M85_AN555_LINK_NTT_KYBER): $(TEST_NTT_KYBER_SRC_MANUAL) - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_NTT_KYBER_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-ntt_kyber -build-m85-an555-ntt_kyber: $(TEST_ENV_M85_AN555_LINK_NTT_KYBER) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-ntt_kyber -run-m85-an555-ntt_kyber: $(TEST_ENV_M85_AN555_LINK_NTT_KYBER) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-ntt_kyber -debug-m85-an555-ntt_kyber: $(TEST_ENV_M85_AN555_LINK_NTT_KYBER) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# NTT Kyber on M55-MPS3-FVP - -TEST_ENV_M55_AN547_LINK_NTT_KYBER = $(TEST_ENV_M55_AN547_BASE)/test_loaded_ntt_kyber -$(TEST_ENV_M55_AN547_LINK_NTT_KYBER): $(TEST_NTT_KYBER_SRC_MANUAL) - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_NTT_KYBER_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-ntt_kyber -build-m55-an547-ntt_kyber: $(TEST_ENV_M55_AN547_LINK_NTT_KYBER) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-ntt_kyber -run-m55-an547-ntt_kyber: $(TEST_ENV_M55_AN547_LINK_NTT_KYBER) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-ntt_kyber -debug-m55-an547-ntt_kyber: $(TEST_ENV_M55_AN547_LINK_NTT_KYBER) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# NTT_DILITHIUM on M55-CORE - -TEST_ENV_M55_CORE_LINK_NTT_DILITHIUM = $(TEST_ENV_M55_CORE_BASE)/test_loaded_ntt_dilithium -$(TEST_ENV_M55_CORE_LINK_NTT_DILITHIUM): $(TEST_NTT_DILITHIUM_SRC_MANUAL) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_NTT_DILITHIUM_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-ntt_dilithium -build-m55-core-ntt_dilithium: $(TEST_ENV_M55_CORE_LINK_NTT_DILITHIUM) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: run-m55-core-ntt_dilithium -run-m55-core-ntt_dilithium: $(TEST_ENV_M55_CORE_LINK_NTT_DILITHIUM) - make run -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: debug-m55-core-ntt_dilithium -debug-m55-core-ntt_dilithium: $(TEST_ENV_M55_CORE_LINK_NTT_DILITHIUM) - make debug -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# NTT_DILITHIUM on M85-MPS3-FVP - -TEST_ENV_M85_AN555_LINK_NTT_DILITHIUM = $(TEST_ENV_M85_AN555_BASE)/test_loaded_ntt_dilithium -$(TEST_ENV_M85_AN555_LINK_NTT_DILITHIUM): $(TEST_NTT_DILITHIUM_SRC_MANUAL) - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_NTT_DILITHIUM_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-ntt_dilithium -build-m85-an555-ntt_dilithium: $(TEST_ENV_M85_AN555_LINK_NTT_DILITHIUM) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-ntt_dilithium -run-m85-an555-ntt_dilithium: $(TEST_ENV_M85_AN555_LINK_NTT_DILITHIUM) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-ntt_dilithium -debug-m85-an555-ntt_dilithium: $(TEST_ENV_M85_AN555_LINK_NTT_DILITHIUM) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# NTT Dilithium on M55-MPS3-FVP - -TEST_ENV_M55_AN547_LINK_NTT_DILITHIUM = $(TEST_ENV_M55_AN547_BASE)/test_loaded_ntt_dilithium -$(TEST_ENV_M55_AN547_LINK_NTT_DILITHIUM): $(TEST_NTT_DILITHIUM_SRC_MANUAL) - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_NTT_DILITHIUM_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-ntt_dilithium -build-m55-an547-ntt_dilithium: $(TEST_ENV_M55_AN547_LINK_NTT_DILITHIUM) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-ntt_dilithium -run-m55-an547-ntt_dilithium: $(TEST_ENV_M55_AN547_LINK_NTT_DILITHIUM) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-ntt_dilithium -debug-m55-an547-ntt_dilithium: $(TEST_ENV_M55_AN547_LINK_NTT_DILITHIUM) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# CRT on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_CRT = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_crt -$(TEST_ENV_M55_MPS3_FVP_LINK_CRT): $(TEST_CRT_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_CRT_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ +all: ${builds} -.PHONY: build-m55-mps3-fvp-crt -build-m55-mps3-fvp-crt: $(TEST_ENV_M55_MPS3_FVP_LINK_CRT) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) +platform = $(firstword $(subst --, ,$*)) +test = $(lastword $(subst --, ,$*)) -.PHONY: run-m55-mps3-fvp-crt -run-m55-mps3-fvp-crt: $(TEST_ENV_M55_MPS3_FVP_LINK_CRT) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) -.PHONY: debug-m55-mps3-fvp-crt -debug-m55-mps3-fvp-crt: $(TEST_ENV_M55_MPS3_FVP_LINK_CRT) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) +.PHONY: ${builds} +${builds}: build-%: + make -j$(nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' -# CT on M55-MPS3-FVP +.PHONY: ${runs} +${runs}: run-%: + make -C envs/$(platform) run SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' -TEST_ENV_M55_MPS3_FVP_LINK_CT = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_ct -$(TEST_ENV_M55_MPS3_FVP_LINK_CT): $(TEST_CT_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_CT_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ +.PHONY: ${cleans} +${cleans}: clean-%: + make -C envs/$(platform) clean -.PHONY: build-m55-mps3-fvp-ct -build-m55-mps3-fvp-ct: $(TEST_ENV_M55_MPS3_FVP_LINK_CT) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-ct -run-m55-mps3-fvp-ct: $(TEST_ENV_M55_MPS3_FVP_LINK_CT) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-ct -debug-m55-mps3-fvp-ct: $(TEST_ENV_M55_MPS3_FVP_LINK_CT) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Chunk on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_CHUNK = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_chunk -$(TEST_ENV_M55_MPS3_FVP_LINK_CHUNK): $(TEST_CHUNK_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_CHUNK_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-chunk -build-m55-mps3-fvp-chunk: $(TEST_ENV_M55_MPS3_FVP_LINK_CHUNK) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-chunk -run-m55-mps3-fvp-chunk: $(TEST_ENV_M55_MPS3_FVP_LINK_CHUNK) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-chunk -debug-m55-mps3-fvp-chunk: $(TEST_ENV_M55_MPS3_FVP_LINK_CHUNK) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Karatsuba multiplication on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_KARATSUBA = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_karatsuba -$(TEST_ENV_M55_MPS3_FVP_LINK_KARATSUBA): $(TEST_KARATSUBA_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_KARATSUBA_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-karatsuba -build-m55-mps3-fvp-karatsuba: $(TEST_ENV_M55_MPS3_FVP_LINK_KARATSUBA) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-karatsuba -run-m55-mps3-fvp-karatsuba: $(TEST_ENV_M55_MPS3_FVP_LINK_KARATSUBA) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-karatsuba -debug-m55-mps3-fvp-karatsuba: $(TEST_ENV_M55_MPS3_FVP_LINK_KARATSUBA) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Poly multiplication on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_POLY = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_poly -$(TEST_ENV_M55_MPS3_FVP_LINK_POLY): $(TEST_POLY_SRC_MANUAL_KARATSUBA) $(TEST_POLY_SRC_MANUAL_SCHOOLBOOK) $(TEST_POLY_SRC_MANUAL_MONTGOMERY) $(TEST_POLY_SRC_AUTO_TOOM4) $(TEST_POLY_SRC_AUTO_SCHOOLBOOK) $(TEST_POLY_SRC_AUTO_NTT_N256) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_POLY_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-poly -build-m55-mps3-fvp-poly: $(TEST_ENV_M55_MPS3_FVP_LINK_POLY) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-poly -run-m55-mps3-fvp-poly: $(TEST_ENV_M55_MPS3_FVP_LINK_POLY) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-poly -debug-m55-mps3-fvp-poly: $(TEST_ENV_M55_MPS3_FVP_LINK_POLY) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Saber on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_SABER = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_saber -$(TEST_ENV_M55_MPS3_FVP_LINK_SABER): $(TEST_SABER_SRC_MANUAL_KARATSUBA) $(TEST_SABER_SRC_MANUAL_MONTGOMERY) $(TEST_SABER_SRC_AUTO_TOOM4) $(TEST_SABER_SRC_AUTO_SCHOOLBOOK) $(TEST_SABER_SRC_AUTO_NTT_N256) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_SABER_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - - -.PHONY: build-m55-mps3-fvp-saber -build-m55-mps3-fvp-saber: $(TEST_ENV_M55_MPS3_FVP_LINK_SABER) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-saber -run-m55-mps3-fvp-saber: $(TEST_ENV_M55_MPS3_FVP_LINK_SABER) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-saber -debug-m55-mps3-fvp-saber: $(TEST_ENV_M55_MPS3_FVP_LINK_SABER) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Intmulntt on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_INTMULNTT = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_intmulntt -$(TEST_ENV_M55_MPS3_FVP_LINK_INTMULNTT): $(TEST_INTMULNTT_SRC_MANUAL_MONTGOMERY) $(TEST_INTMULNTT_SRC_MANUAL_CRT) $(TEST_INTMULNTT_SRC_AUTO_NTT_384) $(TEST_INTMULNTT_SRC_AUTO_NTT_192) - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_INTMULNTT_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - - -.PHONY: build-m55-mps3-fvp-intmulntt -build-m55-mps3-fvp-intmulntt: $(TEST_ENV_M55_MPS3_FVP_LINK_INTMULNTT) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-intmulntt -run-m55-mps3-fvp-intmulntt: $(TEST_ENV_M55_MPS3_FVP_LINK_INTMULNTT) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-intmulntt -debug-m55-mps3-fvp-intmulntt: $(TEST_ENV_M55_MPS3_FVP_LINK_INTMULNTT) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Template on M55-MPS3-FVP - -TEST_ENV_M55_MPS3_FVP_LINK_HELLOWORLD = $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_helloworld -$(TEST_ENV_M55_MPS3_FVP_LINK_HELLOWORLD): - rm -f $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - ln -s ../../../$(TEST_HELLOWORLD_DIR) $(TEST_ENV_M55_MPS3_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS3_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps3-fvp-helloworld -build-m55-mps3-fvp-helloworld: $(TEST_ENV_M55_MPS3_FVP_LINK_HELLOWORLD) - make -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps3-fvp-helloworld -run-m55-mps3-fvp-helloworld: $(TEST_ENV_M55_MPS3_FVP_LINK_HELLOWORLD) - make run -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps3-fvp-helloworld -debug-m55-mps3-fvp-helloworld: $(TEST_ENV_M55_MPS3_FVP_LINK_HELLOWORLD) - make debug -C $(TEST_ENV_M55_MPS3_FVP_BASE) -j$(nproc) - -# Template on M85-AN555 - -TEST_ENV_M85_AN555_LINK_HELLOWORLD = $(TEST_ENV_M85_AN555_BASE)/test_loaded_helloworld -$(TEST_ENV_M85_AN555_LINK_HELLOWORLD): - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_HELLOWORLD_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-helloworld -build-m85-an555-helloworld: $(TEST_ENV_M85_AN555_LINK_HELLOWORLD) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-helloworld -run-m85-an555-helloworld: $(TEST_ENV_M85_AN555_LINK_HELLOWORLD) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-helloworld -debug-m85-an555-helloworld: $(TEST_ENV_M85_AN555_LINK_HELLOWORLD) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# Template on M55-AN547 - -TEST_ENV_M55_AN547_LINK_HELLOWORLD = $(TEST_ENV_M55_AN547_BASE)/test_loaded_helloworld -$(TEST_ENV_M55_AN547_LINK_HELLOWORLD): - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_HELLOWORLD_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-helloworld -build-m55-an547-helloworld: $(TEST_ENV_M55_AN547_LINK_HELLOWORLD) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-helloworld -run-m55-an547-helloworld: $(TEST_ENV_M55_AN547_LINK_HELLOWORLD) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-helloworld -debug-m55-an547-helloworld: $(TEST_ENV_M55_AN547_LINK_HELLOWORLD) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# Profiler on M85-AN555 - -TEST_ENV_M85_AN555_LINK_PROFILING = $(TEST_ENV_M85_AN555_BASE)/test_loaded_profiling -$(TEST_ENV_M85_AN555_LINK_PROFILING): - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_PROFILING_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-profiling -build-m85-an555-profiling: $(TEST_ENV_M85_AN555_LINK_PROFILING) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-profiling -run-m85-an555-profiling: $(TEST_ENV_M85_AN555_LINK_PROFILING) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-profiling -debug-m85-an555-profiling: $(TEST_ENV_M85_AN555_LINK_PROFILING) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# Profiler on M55-AN547 - -TEST_ENV_M55_AN547_LINK_PROFILING = $(TEST_ENV_M55_AN547_BASE)/test_loaded_profiling -$(TEST_ENV_M55_AN547_LINK_PROFILING): - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_PROFILING_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-profiling -build-m55-an547-profiling: $(TEST_ENV_M55_AN547_LINK_PROFILING) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-profiling -run-m55-an547-profiling: $(TEST_ENV_M55_AN547_LINK_PROFILING) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-profiling -debug-m55-an547-profiling: $(TEST_ENV_M55_AN547_LINK_PROFILING) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# SQMAG on M55-MPS3-FVP - -TEST_ENV_M85_AN555_LINK_SQMAG = $(TEST_ENV_M85_AN555_BASE)/test_loaded_sqmag -$(TEST_ENV_M85_AN555_LINK_SQMAG): $(TEST_SQMAG_SRC_MANUAL) - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_SQMAG_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-sqmag -build-m85-an555-sqmag: $(TEST_ENV_M85_AN555_LINK_SQMAG) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-sqmag -run-m85-an555-sqmag: $(TEST_ENV_M85_AN555_LINK_SQMAG) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-sqmag -debug-m85-an555-sqmag: $(TEST_ENV_M85_AN555_LINK_SQMAG) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# SQMAG on M55-MPS3-FVP - -TEST_ENV_M55_AN547_LINK_SQMAG = $(TEST_ENV_M55_AN547_BASE)/test_loaded_sqmag -$(TEST_ENV_M55_AN547_LINK_SQMAG): $(TEST_SQMAG_SRC_MANUAL) - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_SQMAG_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-sqmag -build-m55-an547-sqmag: $(TEST_ENV_M55_AN547_LINK_SQMAG) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-sqmag -run-m55-an547-sqmag: $(TEST_ENV_M55_AN547_LINK_SQMAG) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-sqmag -debug-m55-an547-sqmag: $(TEST_ENV_M55_AN547_LINK_SQMAG) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - - -# FX_FFT on M55-MPS3-FVP - -TEST_ENV_M55_AN547_LINK_FX_FFT = $(TEST_ENV_M55_AN547_BASE)/test_loaded_fx_fft -$(TEST_ENV_M55_AN547_LINK_FX_FFT): $(TEST_FX_FFT_SRC_MANUAL) - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_FX_FFT_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-fx_fft -build-m55-an547-fx_fft: $(TEST_ENV_M55_AN547_LINK_FX_FFT) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-fx_fft -run-m55-an547-fx_fft: $(TEST_ENV_M55_AN547_LINK_FX_FFT) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-fx_fft -debug-m55-an547-fx_fft: $(TEST_ENV_M55_AN547_LINK_FX_FFT) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# FX_FFT on M85-MPS3-FVP - -TEST_ENV_M85_AN555_LINK_FX_FFT = $(TEST_ENV_M85_AN555_BASE)/test_loaded_fx_fft -$(TEST_ENV_M85_AN555_LINK_FX_FFT): $(TEST_FX_FFT_SRC_MANUAL) - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_FX_FFT_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-fx_fft -build-m85-an555-fx_fft: $(TEST_ENV_M85_AN555_LINK_FX_FFT) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-fx_fft -run-m85-an555-fx_fft: $(TEST_ENV_M85_AN555_LINK_FX_FFT) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-fx_fft -debug-m85-an555-fx_fft: $(TEST_ENV_M85_AN555_LINK_FX_FFT) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# FLT_FFT on M55-MPS3-FVP - -TEST_ENV_M55_AN547_LINK_FLT_FFT = $(TEST_ENV_M55_AN547_BASE)/test_loaded_flt_fft -$(TEST_ENV_M55_AN547_LINK_FLT_FFT): $(TEST_FLT_FFT_SRC_MANUAL) - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_FLT_FFT_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-flt_fft -build-m55-an547-flt_fft: $(TEST_ENV_M55_AN547_LINK_FLT_FFT) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-flt_fft -run-m55-an547-flt_fft: $(TEST_ENV_M55_AN547_LINK_FLT_FFT) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-flt_fft -debug-m55-an547-flt_fft: $(TEST_ENV_M55_AN547_LINK_FLT_FFT) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# FLT_FFT on M85-MPS3-FVP - -TEST_ENV_M85_AN555_LINK_FLT_FFT = $(TEST_ENV_M85_AN555_BASE)/test_loaded_flt_fft -$(TEST_ENV_M85_AN555_LINK_FLT_FFT): $(TEST_FLT_FFT_SRC_MANUAL) - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_FLT_FFT_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-flt_fft -build-m85-an555-flt_fft: $(TEST_ENV_M85_AN555_LINK_FLT_FFT) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-flt_fft -run-m85-an555-flt_fft: $(TEST_ENV_M85_AN555_LINK_FLT_FFT) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-flt_fft -debug-m85-an555-flt_fft: $(TEST_ENV_M85_AN555_LINK_FLT_FFT) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -# FLT_FFT on M55-CORE - -TEST_ENV_M55_CORE_LINK_FLT_FFT = $(TEST_ENV_M55_CORE_BASE)/test_loaded_flt_fft -$(TEST_ENV_M55_CORE_LINK_FLT_FFT): $(TEST_FLT_FFT_SRC_MANUAL) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_FLT_FFT_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-flt_fft -build-m55-core-flt_fft: $(TEST_ENV_M55_CORE_LINK_FLT_FFT) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: run-m55-core-flt_fft -run-m55-core-flt_fft: $(TEST_ENV_M55_CORE_LINK_FLT_FFT) - make run -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: debug-m55-core-flt_fft -debug-m55-core-flt_fft: $(TEST_ENV_M55_CORE_LINK_FLT_FFT) - make debug -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# FX_FFT on M55-CORE - -TEST_ENV_M55_CORE_LINK_FX_FFT = $(TEST_ENV_M55_CORE_BASE)/test_loaded_fx_fft -$(TEST_ENV_M55_CORE_LINK_FX_FFT): $(TEST_FX_FFT_SRC_MANUAL) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_FX_FFT_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-fx_fft -build-m55-core-fx_fft: $(TEST_ENV_M55_CORE_LINK_FX_FFT) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: run-m55-core-fx_fft -run-m55-core-fx_fft: $(TEST_ENV_M55_CORE_LINK_FX_FFT) - make run -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -.PHONY: debug-m55-core-fx_fft -debug-m55-core-fx_fft: $(TEST_ENV_M55_CORE_LINK_FX_FFT) - make debug -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - - -# NTT_N256 on M55-MPS3-AN547 -TEST_ENV_M55_AN547_LINK_NTT_N256 = $(TEST_ENV_M55_AN547_BASE)/test_loaded_ntt_n256 -$(TEST_ENV_M55_AN547_LINK_NTT_N256): $(TEST_NTT_N256_SRC_MANUAL) - rm -f $(TEST_ENV_M55_AN547_SYMLINK) - ln -s ../../../$(TEST_NTT_N256_DIR) $(TEST_ENV_M55_AN547_SYMLINK) - rm -f $(TEST_ENV_M55_AN547_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_AN547_BASE) clean - touch $@ - -.PHONY: build-m55-an547-ntt_n256 -build-m55-an547-ntt_n256: $(TEST_ENV_M55_AN547_LINK_NTT_N256) - make -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: run-m55-an547-ntt_n256 -run-m55-an547-ntt_n256: $(TEST_ENV_M55_AN547_LINK_NTT_N256) - make run -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -.PHONY: debug-m55-an547-ntt_n256 -debug-m55-an547-ntt_n256: $(TEST_ENV_M55_AN547_LINK_NTT_N256) - make debug -C $(TEST_ENV_M55_AN547_BASE) -j$(nproc) - -# NTT_N256 on M85-MPS3-AN555 -TEST_ENV_M85_AN555_LINK_NTT_N256 = $(TEST_ENV_M85_AN555_BASE)/test_loaded_ntt_n256 -$(TEST_ENV_M85_AN555_LINK_NTT_N256): $(TEST_NTT_N256_SRC_MANUAL) - rm -f $(TEST_ENV_M85_AN555_SYMLINK) - ln -s ../../../$(TEST_NTT_N256_DIR) $(TEST_ENV_M85_AN555_SYMLINK) - rm -f $(TEST_ENV_M85_AN555_BASE)/test_loaded_* - make -C $(TEST_ENV_M85_AN555_BASE) clean - touch $@ - -.PHONY: build-m85-an555-ntt_n256 -build-m85-an555-ntt_n256: $(TEST_ENV_M85_AN555_LINK_NTT_N256) - make -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: run-m85-an555-ntt_n256 -run-m85-an555-ntt_n256: $(TEST_ENV_M85_AN555_LINK_NTT_N256) - make run -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) - -.PHONY: debug-m85-an555-ntt_n256 -debug-m85-an555-ntt_n256: $(TEST_ENV_M85_AN555_LINK_NTT_N256) - make debug -C $(TEST_ENV_M85_AN555_BASE) -j$(nproc) -### -### M55-MPS2-FVP test environment -### - -# Toom test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_TOOM = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_toom -$(TEST_ENV_M55_MPS2_FVP_LINK_TOOM): $(TEST_POLY_TOOM_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_POLY_TOOM_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-toom -build-m55-mps2-fvp-toom: $(TEST_ENV_M55_MPS2_FVP_LINK_TOOM) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-toom -run-m55-mps2-fvp-toom: $(TEST_ENV_M55_MPS2_FVP_LINK_TOOM) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-toom -debug-m55-mps2-fvp-toom: $(TEST_ENV_M55_MPS2_FVP_LINK_TOOM) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Schoolbook test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_SCHOOLBOOK = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_schoolbook -$(TEST_ENV_M55_MPS2_FVP_LINK_SCHOOLBOOK): $(TEST_POLY_SCHOOLBOOK_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_POLY_SCHOOLBOOK_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-schoolbook -build-m55-mps2-fvp-schoolbook: $(TEST_ENV_M55_MPS2_FVP_LINK_SCHOOLBOOK) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-schoolbook -run-m55-mps2-fvp-schoolbook: $(TEST_ENV_M55_MPS2_FVP_LINK_SCHOOLBOOK) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-schoolbook -debug-m55-mps2-fvp-schoolbook: $(TEST_ENV_M55_MPS2_FVP_LINK_SCHOOLBOOK) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Permute test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_PERMUTE = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_permute -$(TEST_ENV_M55_MPS2_FVP_LINK_PERMUTE): $(TEST_PERMUTE_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_PERMUTE_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-permute -build-m55-mps2-fvp-permute: $(TEST_ENV_M55_MPS2_FVP_LINK_PERMUTE) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-permute -run-m55-mps2-fvp-permute: $(TEST_ENV_M55_MPS2_FVP_LINK_PERMUTE) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-permute -debug-m55-mps2-fvp-permute: $(TEST_ENV_M55_MPS2_FVP_LINK_PERMUTE) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Transpose test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_TRANSPOSE = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_transpose -$(TEST_ENV_M55_MPS2_FVP_LINK_TRANSPOSE): $(TEST_TRANSPOSE_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_TRANSPOSE_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-transpose -build-m55-mps2-fvp-transpose: $(TEST_ENV_M55_MPS2_FVP_LINK_TRANSPOSE) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-transpose -run-m55-mps2-fvp-transpose: $(TEST_ENV_M55_MPS2_FVP_LINK_TRANSPOSE) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-transpose -debug-m55-mps2-fvp-transpose: $(TEST_ENV_M55_MPS2_FVP_LINK_TRANSPOSE) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# NTT_N256 test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_NTT_N256 = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_ntt_n256 -$(TEST_ENV_M55_MPS2_FVP_LINK_NTT_N256): $(TEST_NTT_N256_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_N256_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-ntt_n256 -build-m55-mps2-fvp-ntt_n256: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_N256) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-ntt_n256 -run-m55-mps2-fvp-ntt_n256: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_N256) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-ntt_n256 -debug-m55-mps2-fvp-ntt_n256: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_N256) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# NTT_256 test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_NTT_256 = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_ntt_256 -$(TEST_ENV_M55_MPS2_FVP_LINK_NTT_256): $(TEST_NTT_256_SRC_AUTO) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_256_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-ntt_256 -build-m55-mps2-fvp-ntt_256: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_256) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-ntt_256 -run-m55-mps2-fvp-ntt_256: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_256) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-ntt_256 -debug-m55-mps2-fvp-ntt_256: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_256) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# NTT_1024 test on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_NTT_1024 = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_ntt_1024 -$(TEST_ENV_M55_MPS2_FVP_LINK_NTT_1024): $(TEST_NTT_1024_SRC_AUTO) $(TEST_NTT_1024_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_NTT_1024_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-ntt_1024 -build-m55-mps2-fvp-ntt_1024: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_1024) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-ntt_1024 -run-m55-mps2-fvp-ntt_1024: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_1024) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-ntt_1024 -debug-m55-mps2-fvp-ntt_1024: $(TEST_ENV_M55_MPS2_FVP_LINK_NTT_1024) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - - -# Montgomery multiplication on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_MONTGOMERY = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_montgomery -$(TEST_ENV_M55_MPS2_FVP_LINK_MONTGOMERY): $(TEST_MONTGOMERY_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_MONTGOMERY_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-montgomery -build-m55-mps2-fvp-montgomery: $(TEST_ENV_M55_MPS2_FVP_LINK_MONTGOMERY) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-montgomery -run-m55-mps2-fvp-montgomery: $(TEST_ENV_M55_MPS2_FVP_LINK_MONTGOMERY) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-montgomery -debug-m55-mps2-fvp-montgomery: $(TEST_ENV_M55_MPS2_FVP_LINK_MONTGOMERY) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Karatsuba multiplication on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_KARATSUBA = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_karatsuba -$(TEST_ENV_M55_MPS2_FVP_LINK_KARATSUBA): $(TEST_KARATSUBA_SRC_MANUAL) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_KARATSUBA_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-karatsuba -build-m55-mps2-fvp-karatsuba: $(TEST_ENV_M55_MPS2_FVP_LINK_KARATSUBA) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-karatsuba -run-m55-mps2-fvp-karatsuba: $(TEST_ENV_M55_MPS2_FVP_LINK_KARATSUBA) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-karatsuba -debug-m55-mps2-fvp-karatsuba: $(TEST_ENV_M55_MPS2_FVP_LINK_KARATSUBA) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Poly multiplication on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_POLY = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_poly -$(TEST_ENV_M55_MPS2_FVP_LINK_POLY): $(TEST_POLY_SRC_MANUAL_KARATSUBA) $(TEST_POLY_SRC_MANUAL_SCHOOLBOOK) $(TEST_POLY_SRC_MANUAL_MONTGOMERY) $(TEST_POLY_SRC_AUTO_TOOM4) $(TEST_POLY_SRC_AUTO_SCHOOLBOOK) $(TEST_POLY_SRC_AUTO_NTT_N256) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_POLY_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-poly -build-m55-mps2-fvp-poly: $(TEST_ENV_M55_MPS2_FVP_LINK_POLY) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-poly -run-m55-mps2-fvp-poly: $(TEST_ENV_M55_MPS2_FVP_LINK_POLY) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-poly -debug-m55-mps2-fvp-poly: $(TEST_ENV_M55_MPS2_FVP_LINK_POLY) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Saber on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_SABER = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_saber -$(TEST_ENV_M55_MPS2_FVP_LINK_SABER): $(TEST_SABER_SRC_MANUAL_KARATSUBA) $(TEST_SABER_SRC_MANUAL_MONTGOMERY) $(TEST_SABER_SRC_AUTO_TOOM4) $(TEST_SABER_SRC_AUTO_SCHOOLBOOK) $(TEST_SABER_SRC_AUTO_NTT_N256) - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_SABER_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - - -.PHONY: build-m55-mps2-fvp-saber -build-m55-mps2-fvp-saber: $(TEST_ENV_M55_MPS2_FVP_LINK_SABER) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-saber -run-m55-mps2-fvp-saber: $(TEST_ENV_M55_MPS2_FVP_LINK_SABER) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-saber -debug-m55-mps2-fvp-saber: $(TEST_ENV_M55_MPS2_FVP_LINK_SABER) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -# Template on M55-MPS2-FVP - -TEST_ENV_M55_MPS2_FVP_LINK_HELLOWORLD = $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_helloworld -$(TEST_ENV_M55_MPS2_FVP_LINK_HELLOWORLD): - rm -f $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - ln -s ../../../$(TEST_HELLOWORLD_DIR) $(TEST_ENV_M55_MPS2_FVP_SYMLINK) - rm -f $(TEST_ENV_M55_MPS2_FVP_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) clean - touch $@ - -.PHONY: build-m55-mps2-fvp-helloworld -build-m55-mps2-fvp-helloworld: $(TEST_ENV_M55_MPS2_FVP_LINK_HELLOWORLD) - make -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: run-m55-mps2-fvp-helloworld -run-m55-mps2-fvp-helloworld: $(TEST_ENV_M55_MPS2_FVP_LINK_HELLOWORLD) - make run -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -.PHONY: debug-m55-mps2-fvp-helloworld -debug-m55-mps2-fvp-helloworld: $(TEST_ENV_M55_MPS2_FVP_LINK_HELLOWORLD) - make debug -C $(TEST_ENV_M55_MPS2_FVP_BASE) -j$(nproc) - -### -### Core test environment -### - -# Toom test on M55-CORE - -TEST_ENV_M55_CORE_LINK_TOOM = $(TEST_ENV_M55_CORE_BASE)/test_loaded_toom -$(TEST_ENV_M55_CORE_LINK_TOOM): $(TEST_POLY_TOOM_SRC_AUTO) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_POLY_TOOM_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-toom -build-m55-core-toom: $(TEST_ENV_M55_CORE_LINK_TOOM) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Schoolbook test on M55-CORE - -TEST_ENV_M55_CORE_LINK_SCHOOLBOOK = $(TEST_ENV_M55_CORE_BASE)/test_loaded_schoolbook -$(TEST_ENV_M55_CORE_LINK_SCHOOLBOOK): $(TEST_POLY_SCHOOLBOOK_SRC_AUTO) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_POLY_SCHOOLBOOK_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-schoolbook -build-m55-core-schoolbook: $(TEST_ENV_M55_CORE_LINK_SCHOOLBOOK) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Permute test on M55-CORE - -TEST_ENV_M55_CORE_LINK_PERMUTE = $(TEST_ENV_M55_CORE_BASE)/test_loaded_permute -$(TEST_ENV_M55_CORE_LINK_PERMUTE): $(TEST_PERMUTE_SRC_AUTO) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_PERMUTE_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-permute -build-m55-core-permute: $(TEST_ENV_M55_CORE_LINK_PERMUTE) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Transpose test on M55-CORE - -TEST_ENV_M55_CORE_LINK_TRANSPOSE = $(TEST_ENV_M55_CORE_BASE)/test_loaded_transpose -$(TEST_ENV_M55_CORE_LINK_TRANSPOSE): $(TEST_TRANSPOSE_SRC_AUTO) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_TRANSPOSE_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-transpose -build-m55-core-transpose: $(TEST_ENV_M55_CORE_LINK_TRANSPOSE) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# NTT test on M55-CORE - -TEST_ENV_M55_CORE_LINK_NTT = $(TEST_ENV_M55_CORE_BASE)/test_loaded_ntt -$(TEST_ENV_M55_CORE_LINK_NTT): $(TEST_NTT_SRC_AUTO) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_NTT_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-ntt -build-m55-core-ntt: $(TEST_ENV_M55_CORE_LINK_NTT) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Montgomery multiplication on M55-CORE - -TEST_ENV_M55_CORE_LINK_MONTGOMERY = $(TEST_ENV_M55_CORE_BASE)/test_loaded_montgomery -$(TEST_ENV_M55_CORE_LINK_MONTGOMERY): $(TEST_MONTGOMERY_SRC_MANUAL) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_MONTGOMERY_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-montgomery -build-m55-core-montgomery: $(TEST_ENV_M55_CORE_LINK_MONTGOMERY) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Karatsuba multiplication on M55-CORE - -TEST_ENV_M55_CORE_LINK_KARATSUBA = $(TEST_ENV_M55_CORE_BASE)/test_loaded_karatsuba -$(TEST_ENV_M55_CORE_LINK_KARATSUBA): $(TEST_KARATSUBA_SRC_MANUAL) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_KARATSUBA_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-karatsuba -build-m55-core-karatsuba: $(TEST_ENV_M55_CORE_LINK_KARATSUBA) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Template on M55-CORE - -TEST_ENV_M55_CORE_LINK_HELLOWORLD = $(TEST_ENV_M55_CORE_BASE)/test_loaded_helloworld -$(TEST_ENV_M55_CORE_LINK_HELLOWORLD): - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_HELLOWORLD_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-helloworld -build-m55-core-helloworld: $(TEST_ENV_M55_CORE_LINK_HELLOWORLD) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Poly multiplication on M55-Core - -TEST_ENV_M55_CORE_LINK_POLY = $(TEST_ENV_M55_CORE_BASE)/test_loaded_poly -$(TEST_ENV_M55_CORE_LINK_POLY): $(TEST_POLY_SRC_MANUAL_KARATSUBA) $(TEST_POLY_SRC_MANUAL_SCHOOLBOOK) $(TEST_POLY_SRC_MANUAL_MONTGOMERY) $(TEST_POLY_SRC_AUTO_TOOM4) $(TEST_POLY_SRC_AUTO_SCHOOLBOOK) $(TEST_POLY_SRC_AUTO_NTT) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_POLY_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - -.PHONY: build-m55-core-poly -build-m55-core-poly: $(TEST_ENV_M55_CORE_LINK_POLY) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) - -# Saber on M55-Core - -TEST_ENV_M55_CORE_LINK_SABER = $(TEST_ENV_M55_CORE_BASE)/test_loaded_saber -$(TEST_ENV_M55_CORE_LINK_SABER): $(TEST_SABER_SRC_MANUAL_KARATSUBA) $(TEST_SABER_SRC_MANUAL_MONTGOMERY) $(TEST_SABER_SRC_AUTO_TOOM4) $(TEST_SABER_SRC_AUTO_SCHOOLBOOK) $(TEST_SABER_SRC_AUTO_NTT) - rm -f $(TEST_ENV_M55_CORE_SYMLINK) - ln -s ../../../$(TEST_SABER_DIR) $(TEST_ENV_M55_CORE_SYMLINK) - rm -f $(TEST_ENV_M55_CORE_BASE)/test_loaded_* - make -C $(TEST_ENV_M55_CORE_BASE) clean - touch $@ - - -.PHONY: build-m55-core-saber -build-m55-core-saber: $(TEST_ENV_M55_CORE_LINK_SABER) - make -C $(TEST_ENV_M55_CORE_BASE) -j$(nproc) +.PHONY: clean +clean: ${cleans} diff --git a/tests/inc/hal.h b/envs/common/inc/hal.h similarity index 100% rename from tests/inc/hal.h rename to envs/common/inc/hal.h diff --git a/tests/inc/misc.h b/envs/common/inc/misc.h similarity index 100% rename from tests/inc/misc.h rename to envs/common/inc/misc.h diff --git a/tests/inc/poly.h b/envs/common/inc/poly.h similarity index 100% rename from tests/inc/poly.h rename to envs/common/inc/poly.h diff --git a/envs/m55-an547/src/test_common/misc.c b/envs/common/src/misc.c similarity index 100% rename from envs/m55-an547/src/test_common/misc.c rename to envs/common/src/misc.c diff --git a/envs/m55-an547/src/test_common/poly.c b/envs/common/src/poly.c similarity index 100% rename from envs/m55-an547/src/test_common/poly.c rename to envs/common/src/poly.c diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index 4bc7711..adaca36 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -1,17 +1,15 @@ -# Makefile for images for AN555 +# Makefile for images for AN547 + +CC = arm-none-eabi-gcc +LD := $(CC) TARGET=test.elf -INC_DIR=./inc -INC_DIR_TEST=$(INC_DIR)/test_inc -I$(SRC_DIR)/test_src/manual -I$(SRC_DIR)/test_src/auto -I$(SRC_DIR)/platform/ -BUILD_DIR=./build SRC_DIR=./src +BUILD_DIR=./build -.phony: all clean run - -CC=arm-none-eabi-gcc-12.2.1 -LD := $(CC) - +COMMON_INC=../common/inc/ +ENV_INC=./inc/ SYSROOT := $(shell $(CC) --print-sysroot) CFLAGS += \ @@ -22,7 +20,10 @@ CFLAGS += \ -fdata-sections \ --sysroot=$(SYSROOT) \ -DARMCM55 \ - -I$(INC_DIR) -I$(INC_DIR_TEST) + -I$(COMMON_INC) \ + -I$(ENV_INC) \ + -I$(SRC_DIR) \ + -I$(SRC_DIR)/platform ARCH_FLAGS += \ -march=armv8.1-m.main+mve.fp \ @@ -41,40 +42,41 @@ LDFLAGS += \ LDFLAGS += \ --specs=nosys.specs \ - -Wl,--wrap=_write \ + -Wl,--wrap=_open \ -Wl,--wrap=_read \ + -Wl,--wrap=_write \ -ffreestanding \ -T$(LDSCRIPT) \ $(ARCH_FLAGS) all: $(TARGET) -C_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard $(SRC_DIR)/*/*/*.c) -C_SRC_FILES=$(patsubst $(SRC_DIR)/%.c, %.c, $(C_SRC_FILES_PRE)) +HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) +OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) +OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) +OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) +OBJECTS_ASM = $(patsubst %.s, $(BUILD_DIR)/%.s.o, $(abspath $(ASMS))) -ASM_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*/*.s) $(wildcard $(SRC_DIR)/*.s) $(wildcard $(SRC_DIR)/*/*/*.s) -ASM_SRC_FILES=$(patsubst $(SRC_DIR)/%.s, %.s, $(ASM_SRC_FILES_PRE)) +OBJECTS = $(OBJECTS_C) $(OBJECTS_ASM) -ASM_OBJ_FILES=$(patsubst %.s, $(BUILD_DIR)/%.o, $(ASM_SRC_FILES)) -C_OBJ_FILES=$(patsubst %.c, $(BUILD_DIR)/%.o, $(C_SRC_FILES)) -OBJ_FILES=$(ASM_OBJ_FILES) $(C_OBJ_FILES) $(CMSIS_OBJ_FILES) - -$(C_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.c +$(OBJECTS_C): $(BUILD_DIR)/%.o: % + mkdir -p $(@D) $(CC) $(CFLAGS) -c -o $@ $< -$(ASM_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.s +$(OBJECTS_ASM): $(BUILD_DIR)/%.o: % + mkdir -p $(@D) $(CC) -x assembler-with-cpp $(CFLAGS) -c -o $@ $< + +.PHONY: test.elf +test.elf: $(OBJECTS) $(LDSCRIPT) + $(LD) $(LDFLAGS) -o $@ $(OBJECTS) -test.elf: $(OBJS_DIR) $(OBJ_FILES) $(LDSCRIPT) - $(LD) $(LDFLAGS) -o $@ $(OBJ_FILES) - -%.bin: %.elf - arm-none-eabi-objcopy -S $< -O binary $@ - +.PHONY: build build: $(TARGET) -run: $(TARGET) + +run: $(TARGET) qemu-system-arm -M mps3-an547 -nographic -semihosting -kernel $(TARGET) -clean: - rm -rf $(OBJ_FILES) - rm -rf $(TARGET) - rm -rf $(LIBDEPS) + +clean: + rm -f $(TARGET) + rm -rf $(BUILD_DIR) \ No newline at end of file diff --git a/envs/m55-an547/build/test_common/dummy b/envs/m55-an547/build/test_common/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m55-an547/build/test_src/auto/dummy b/envs/m55-an547/build/test_src/auto/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m55-an547/build/test_src/external/dummy b/envs/m55-an547/build/test_src/external/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m55-an547/build/test_src/manual/dummy b/envs/m55-an547/build/test_src/manual/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m55-an547/build/test_src/mve_test.d b/envs/m55-an547/build/test_src/mve_test.d deleted file mode 100644 index 285897b..0000000 --- a/envs/m55-an547/build/test_src/mve_test.d +++ /dev/null @@ -1 +0,0 @@ -build/test_src/mve_test.o: src/test_src/mve_test.s diff --git a/envs/m55-an547/inc/test_inc b/envs/m55-an547/inc/test_inc deleted file mode 120000 index 31da609..0000000 --- a/envs/m55-an547/inc/test_inc +++ /dev/null @@ -1 +0,0 @@ -../../../tests/inc \ No newline at end of file diff --git a/envs/m55-an547/src/test_src b/envs/m55-an547/src/test_src deleted file mode 120000 index 6142ff4..0000000 --- a/envs/m55-an547/src/test_src +++ /dev/null @@ -1 +0,0 @@ -../../../tests/sqmag \ No newline at end of file diff --git a/envs/m55-an547/test.bin b/envs/m55-an547/test.bin deleted file mode 100755 index cf28a7a8cf795474021d555ae54de82a974702a1..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 37296 zcmeFZd3;k<`agc|z1fpCY0FaD0%^k1g(c8d*&ITeUY4R|aY0ZC;1Wc!ijIuT*rY9p zqBCt%uuVYFs)LlJWtm|S+Y!cP6hA4TvQ_HXHntVKX^Tl0n&11}w5T8F`+fahukXJD zubX>&_Vb))JJ0i+Lx@5@SWAdMf{>_4LMGu^i05fM#dv=CuNr^X^{4sq?oy%->i++E ze%a1{zyAw?|3cuu5cn?y{tJQsLg4=m0^A%)c7cWUa@kyU?CjjxpU8-wKD1R?wtViV zaWCuU)G(9g>Nu|?TV2h}p`}R8FU#qU$mSc@XC4a$E`%i6G1Zbe2Rev@F&cAqC%ye` zd0!z!aYULOUoD+074uJiuP}!KXKF|d?dv?s(J~b`9Mfp~L$gV$q9!JL2OkRD2$94s zyq1rdGiY9PoJ5$~6^UHpEt?(D|HgO|B_x7G%ppmvS2ITI>XY9K%jtLrTK#qX;k14p>gSEUE?=8`T{&UDpNT};PQ#D9 z9_c%gwifNpOueq2K-)yR*#3WNoA&YR_HP%*hxV4!G2&T{aejeu{yC@hp~(KeMkf6@ z-+?}Mj=nDaGcAX)d_n7s`*DtgmZy2NF0H2(;%iYZlGdT+Xx_LV&0_suj!Ab|I{$nN zF)m2R^b_WE3t=7%O0sij3xH{k&@c|;` zdt0@S<;FjT`hX%At}96-M+&MU+8C@`Mc0m?gm6`8xhMAsS~cR)kWn&&`Y!dI53Sf3 zG}dlk5w{|4HN#s3hK*YxpF;$p=j8XfW?-7V%+3kp-zPPFp%5Y0NW=Ov-o!1^+v-8?x8%nP! zW*+`Q0!)!?k%aD@we#A;>Ydjfi7&lI^6=h*a-K~*8(NYm3oTgj(sYj*TsJkjMYc_b?liI$-kG(`y!ZN$`N$HRi@#dY^L~JELT$DKb2W= zqnN8`pS^0cz_Dz3oHf5SxH8J>q)@{5(<@!pErLY+?pah~eWgFW+fr}X>-bleT3IeRy+%{`Ma~5{@L_`Eu1n`-X&6oAJaNJCnA? zMpA$0eS+76OI`0{jBoRacct3i|6WV9A8p6)COkB+q3>3ZJm>(C;Jxy~9@hGc= zli~VbpiJ~#Ki6e%r?n^A$i(`(fxa%(Rif8Q^m?wY4PR}x|M3l^oaj&ayH6ZD#vcSN z>hUhG4jz&4s=HJ6P4$%)5>1Sfsgh~nL|P`zf?wp20?SY5G=7cd z?X1>wka^i)DY-|6@bdckBi@M}6lAnS7#iGb>8c|JS}V?$dvb|^&8GG6J)73M zn8+EJ>|rb1_?wtaa~1Oz+b_y{&#RJHuQWRrsS!qQZj3G4ei^BZm_o}*P|hRo{ZePI zJs6Rss8LZ$K>FP3DWILK*ihhLFC9DWeacD5sw4YuIPP>-WzGo&s(b10`QDsP$3i=( zy(tt5{Ha&U|89;K5$~C@CMw)0(ti#0|%SP{512H&FMkUUNKqtL9kL+1iT|i+2QxK7PI1 z!g-BI+tSO7GT{9Z+RNzGkCs7)%IPG}{M_D!9_9|(-k4s8*q&S*$q>|e@>lK2>*hnd zd%9PTIXLcQz_TJzW*OS2^Y}TH)Kq~Ej^a6w=L())Jo-(9%)oOup7nT2@lZYz3QWVu zmWDWdW%Lp~?d9x|rLp`l7bm=aQ(=fLB8hX* zS48g#Kha9OmxJjhlFA6R4uyO1+IqR!pXEH*rO_8lys zX-T;L+Ori?HYA)su$ju!6>pGK53dEX+@S962{3ezdhw)6ClL6&VXWM2m zG)n=X@RZ#u3&@{m_o_syuS{ zhb`pxbkB5ax+~4Yq{&m$Tv1mPljS99q^VtGd|u~m2hu$xb(V0&Nel{4TFId@IETdBHYW5AJth-5)h_J&E;fb$~z*r>0yEhdcT1(+xwDvr(tJ!hCjPd`o z;MVwPuD(b*^Tyxz59iCCNFEKD08rW^K7Jp zDO6077dyrgAwPdln_NcQJe4o=d?O|P;rVj+9agR9K3j|}&4X4&h?0((1r(+-&qH$Q zml}`K{f0P_o&sfgeT#bXE0(=Ap4ervVR$b)K#XawpP?teN#T1>v(5#&m>lU*mCDnM zf}A__1G|CJT7Ao)tVq{rfn@4S6J4)o*SF{&Gx1Y-nJ}-9B$~_NEe!B5Yr}Y zBnD#C1H)v37AdoYkxt;BUgjbCm*yG;2NzK!b8X7jZRf`3^FxYekNc!%7!NoX1Hy!) z%*SD7Npyx^3+~jCI!fWS&ex&&gcI_4?+_RB_dWahqhGOETh5 z+7#(w#%ArKTu%8qRxvTsMGV@lbS2|`k#Fc+#HcEz)PHYKGF|JUG|{(F=UI%m6K%u@ zIxfm(7B*};?hnmmgclr(_^5Ji87D01jwm8Yj1XH8;nBM5gVtR;<07}wd5vxLtveH3 zvg=H6&Mu{Eo$*XCeV5MV_UT+Qz;lPp?3XzsJz==U@%P&o1r^gHOH@2bu5I63Nc$ne zssdSgqTpa;#mB)@&BC&R#jPwesD?=-iDu#Mf`=+!fi%*(_BTX&=-fz(ro_=cr%>aL zbS<)Iu=<&TM1;w+_muO@~Jm&Atuwqm};<(pF$3;307ElhXm`o!(Fc`7VBai~36;#WxMtdcRSjJ=F7RP6ojf!-%KUm48HOaD`^dz}=T z(k9t>ii9Z&c zLhF{5?u;vMPIi+N%)&`mc=%lORR)<6USTH9H4D;a#7k5f#du;eiSWWpSl{wtWy7ui zXd&D$*LV3#ib)!tq|~-we@W|R(hO)L#y!}&QiRjuei$>tJJ2TbIw$YVs7&7>cm1ug zfzt6ygbczH<6(e*tTXKmCPVJldYIeNx5n7y6M$`8+P1|Q%X&z$L3}K7hPj7Z$eZ8O?pJlO4}4Y#hcrEeu2*YXwV@yMlr%O{Mh zq^%fjl~=GPoZR(zqi(|0pzbA#14;F|(Pvslq-XN-?MM0ljY$~6bEun0*YXx7VJ#<# z^ZcOlXH1%cwVch5@QmE<;F-ztvgcyuDGXl=T;eH=Mw5=g1&T?Or^t&#fgewVv2uiq z5!7N^WtCx=jqrp5{{CD>xEZAIe5F4x6zDt=w}TV3Bwm^HKxQkuE?UV1nFz||I_j|M z#j?`nac?w*sn4`&WW8=|eT#hFE$R@d4s&`5Bgz<<)53oI{8;Sk?twXF_^$i{J6p&m zS|=~?Rpo#lc0epY11lX0e0JiY%6wi9>9ibJt4e(a7D|UOt-zGVjFmyw4)E3mO&F!? zPVoGza>@t)&$yt;r?`;di}5TUK_=z}Vh`vhUJcqdnmcYdDdmBe%+yUZcU*Tew}F>X zTzLf}d}M&?!rJ45;uu@^WbDOSKhktdhg>M8f3qhH^H5;HiLl%^3;P^0ceG%J3~>}h zmdEk1Gfej!j4b}FaS$X?WSLob-7!coGer1{BaVmdU^S(3btEz*Cg=g zV#zq2E23WR8EuWT#qkgPN?UtlBc<=CH^{i%HMhl6{Z3sF1&WBV%G?W~<^8!Z`HdL& zHVb!I<~|F3c2Hrgo6_ZAN|$U{n+)rcA8n+%BFe_=++WB29ev-v+!|27)Fsr6B;ASDKsOCM>p#Gb&pVB*0D%8m1|+yN0MuP zT3@-$BHNxLj3?j>%*r+k8{G&=Rt8BH|KZR`7llD6(07!|FqT(C!<2!7%)%X;bRMIi zm(7}B77paEfD|)vuR&^RYStT<7?1l`XFU7t^&9l&& zO*~qfE)^E^Dm_$(k@L$Q+)V90(=Wkp^SW4j3}27dWx^(4u30GENLwU^qHLLPPf%Cg z3cml!MylP^w(r`g@-v*tkY<&^UU|sUQ>S%%;$on=Fo`-hbP7&8S>{UioIM!jVx}j% zp8L1SsCBbbWTkA7nF0DN8W|S~{PBdU6q?KSP~eplI@hz=s?xYyTG(G4ztzI1uB!IR z%0H1SzMLhn&|-DvfVa++h&9lKbu>0H6c~rrRHdQ75WF7}TRqoLl|SBe2p78FZE1!&>$ZHz1 zU}<56Q@xo9Q>;WtD_oGJxI%>7LOGx6Hu8(^mt>HXAL__NBJ6jr0)O@t{2_tv2EOXk zV(&zRxI(Qb!WC|x6|Qwp=TrA+p`VdbR`{UzC5u`3sgM94Be%KvbNA4F6`eQTTPa;w zl|7+=aC8?hFZr$Blo$BZ?oi?c9Fez`xIrXk)=^q$-7aUZl1 zW~HAAgP~uJwc+ImGyC5 zDEXh>fj%9=<@|@JZL3QAdOb#SHNGC>gni2q3Ve9%md0+2boB*Z{UFStXF!kca?GZ> z?yvOdYmQqSn(Dj1(xZuR725raolhM2nL|gG$ABjjLF%M@nB^}+t{Je~h;)}r=hsAI z2l6u~hK&{!G*IKJ0GH4R!lE2wS~L)I|hi;;3U zsn7@vHVbz+NKqK)NZuw&m=knCBU73E^DaA}VA ze5cOCPS?4jJj`^GxFcxRC%YnwN$P9CDn0br{8WPy-_?3^$A|e5{*NVBgWlTF{N5fe zaRi^lGk1KdSTzB*G`0c~2z;7a*MiKr>sL%^(7I<7jHs|Ugz|z2e>EEuWa29<;GQG9 zpoPW8?eg#A;$1PGcsG;EwXt$@$L0K3&q%Ax1AKnQh@G#p1p8lbDq~)i$Z}%j)rqaHr>KJ2gSUcliIS#ZLIKS_c|lBbI8SU z>0Fe{_0u9O9n{Lv7f8CrX+Seczw1oc-xs+3gUwHZp1b-2%_j%=(Cf&3v2)b^0lvHG zM2;X+H8uku7|J33*&F5%%6!TpCN_?T>?JB{wdo6VeK6h-{`%&FA=uNfSEBJdqdu9x z8oaVmH>w`|LGB*TQ(U~=OXYHcXL#8#x=-cO%F>G&lRPP{Xc3?{)T4VNcK!45FF{uV z=boslPb-^hWlZVCv;@lEM`0Yho=n_Z^HIEvCliKTiS|gXB-ML`wtnVHDt3$V5*>EL zNi{L>RV1Gn3Je?SipFoW>q8%Cw#?jdUq0Pe4(tx~^^6HS(j+gEys1$)6 zUWCl+;T5 zZA?~5EgsRX!_CB)x`%S+VW3H&D@&Ag)0|pQs*6mV+4;f#$fkj{eNXK7wtX3u$!;p6 zLV>T2nkuPHUb2f)I;C?;;jD`@PUDG-Qu)vxZoRh1l&XM^GeGG-0un^TBc9q{LxIW1DZd;QJp$h!;5)*Vg}u`1 z#{UgR$}I`BSA&rip6ix*v_+TY!@s2P3PbBo5n3sL7T2bdKHL-|t~Xt2b4Ra(uCdIs z6uS(sBfRQTp9%XGsv$0Ky4Rwcd^NZ^KlPGsY4ciWlu7~@enM;X=?D5+umo=GQt{N>ZAQOmQNgt^uECP z8j`%e&IBqysE_a;Vohn_r_;w(o{Nij-QT2e!<%Hv6wI8<*%wg1AG0;t4O?(9XY@rF~-f+L!uJSLaYGX$Lb;)XQ7*rKd6>A4`IF?p7b zFqMFFk1MwPPU5)+#>7ktr7FBG zi|o?bxNO)`Z`7&RGZ~$}=%Oh2CWb5=;*R#nN@J`EZnlSep5jn=jlXE5&=_f@t%g%n z&HRL!oevyH_Rw9x+Q!j2|I-xau{&;2%LmwNyx5st8D8DK09yk+W!L;h3TsInGxkxV z@^zi7sRK?B=u6@w(5G;6d`mDVduaKeu=g#$L`_C94S%&ncvJXgLCHw;kj@C7K~L_iBgRi!n6#7M2mf*b zQ4A>pr&6MJMVpHI1b>rM?DF(K_!yzE`{$B~4>??FPpGCbVxhpv8a;ovEgUaKwgc3gxPmJy0?roan3=)DCz zd40eKdkQ|-V~#~H_i=}#?Uek8lZ z?-!-NE%hU*WxhiR`d`m}$2^U<3?sPC{z3W`OaCfjo+2B>n!E7(6nWEc;)sE5%V*&y ziRYj>(iCOtjf={XKf?o?EqY+TDNJ0y$;WyU*HbSR>vH(GHx}@dfRjT--u>;Szx~P^ z%l*n5dl}x?*J0n!+hpVc^>V;m2=pn66lJq-d1I-Bn%T=3hg&D~!#&z7^CY-GYazGA zdnQ|%w1g6QYCL?gYQI7aud7wzBI97WJ&+)3qJQ2*48RG|7n{&Pz#Y>Iiam+Bg?@fTL%n#!?6sW3FoGPG0I9w@~2*6 z9KZLq;9(ub0>&tTpPv}jpj)=z!zyvJlO-iA>~V#xrbJq&bW6(bi6i5wy{x_^UcWE_ z5M8jb8lF~@Jo&Sr1d<~gQGW5d&k)r?Ixgg^csf63(&g%yN>(gG5c|h+;W84Fx#Qb> zI`;1i2Iiu`T-|2$&n%#i()YnL8}Fh}9L7^9PV1v} z=Qq3I0bPE8g9nrm2K7=8=zTub8&p$+B_f+b4{Zux7cPdkAhD1_PbW1C358P6(gY%G z*%${;i5?gc>`{w*@W(cC@qnC8x>fEnwtlxxwN2tuZzahyJ1qw!;N=q69lp$DSt;jb zJ>S4TRbQamF@7)Iv&ZgcgqWV#%6nh|NyHc#?AbY9Uc{tPZ=>9!+DhA#U>)hGh62ZB zqWAMX2Vwz4ukwKRQ`PVN%yLT0nMsIsa!}7+zXwza59sIurF)U(OpDSOrYH$0>g%S&^{mGZc-&Sr&qA8r~!1pIj z20j|}nSy-lMoOWBcqKTG1YYa^M7z?GD%6%yzaMH-_)=fw2l)eY8t_G`irI2$8N87z z7!z9zosa<(bY|gfKEF5YxzXS9+`zg=JvUP7xe?(E3*Pm7zLF%y$ID&7Hta8sVt+wl zsdkw}NDlN*eTerRU_~F|C;6=CL;QDvO7tPBL?7b9f$~Hrj&<387%PEg4mcn}|M@~D zQ7v<4wDR6f?INRd| zDvzSybE$HD-bmN-39Qw91z`w$-$*gFnv;8Kc~l zM>K@FsTUkSuXE)-%57LbqtV@=8$YyuVS>)Y`dN)US;d2AQwgl2t^Qaaw}bcIKTPD zIGe6Wx39n`MeaE;Mjh$6;Rti-MNR9g!+baj+IObd>z8|JJ5io6H;!wf(HajsS$#|; zlSp$$6b*L`aQUPr3g^)USK({cjfpCyFhRZAmHgm_u+5Xk_@qOpwBn&NixT^&fOqI; z=*RwFq0g0v=SAVq!Uv2M8{d2-lV6EdjoOa6rThZ^kU_QL+!iLh*T{gsUf$0p=XD16 zv1R<;2Hn?IrAWmq7x%1A`A~nJxicNeC;5kC8cHjICucYL(6A`4}-H3^zZ-_IL)F5V7pVo z4|c~^4tN0MrIPKRHiTD6;#HN2cE_SOed2maa8kvgDhvBhv7SX+eQY`8WhCM~cSYQI z^^Xi>Xnk@&=9N5eLInY4$w!S%@Qd9#OGAvH!=Ww6@+9#(rUUgF_f|M`1m?31_b_Mdv) zu-j9c{nJ^e>IV9C2wyq)%6g3E48|1?tNplLVQnM{(g+@qo={-UG3tL>v7Sk1fkoSn zQSE-kl=K8!`7z~d*7BUV zq9-PHHxZKhfMj@7peT=On-t~Ck83@;@nz z7BNRuQ+)Z=pu@pTda^0ZOZo!s@PW_leEhvoVDmBV7Z$2epx_t>OV~wN!Z=vM+^~dk zu!IpIqmNp`6ts_Huflen?idALD#?_)(m<&z)(_$-rCM75{fr8`;!xn(qued~NGOnV zjKYk{um3lj2oc(_UmviEd>tCFiNG`Tf3b-W;q1);n~3R%lz+KO!6PO}_)DK5n=wdA zWiG^)$HFt`VhwB7?6pga&a|+JBz2;9)!G<=(b9bjQ&%QU{IXmkJHK)*Xl1bhk>M52 zm5;*T0?Z=<*S%^jNz&A)MpL`YCG2!7UiL`YFOzs>mFeUzK(G{J82EnmCT;U;UlM9o zb$y0&5O)JMY&OHk+CzbTI44PC+N&XjU^AJe;iKdZKKuL%JL1t-k>m=et^S?Ip(}bx zVsxit(S;zfeFDl~S;rW^U2W!f&(2I>`NzGqk9Qtlj{Vyz``I|-8g=z|YtF`z{xVq! z^XpJkKRo4*3zM^BW?`n>l)A*#t8BDhqRrSzR$P|6>0Ed|aSd9x&8n98%`n2SUNb0j zZipC-tB4*^)Xp)h$%HO%v3#_Bj7?LsYBe$Li?6V$8&dhnJRQ$tj~|^~vgb;@kM;(bi8j&-CvDI0cV$Q_7qHM{q z&yBV{^62RnF5`3ybI<7(={-bvwnyS$LYh~eZjsMD-J*Pn2%~#PXVbD0|7o)7vC)s6 z4t~DibkMQjbW0rSj6j{GsAK0O{#tT6SZF%kVkkS^GV1q4Q1lK(P05XIxeCMS;1$#9 z;04r7UwAtBJ!+?;_7v2fj@n*Ma-%f&bWkWd9qhH84*q~v?sy3nNksiMAs&F95utN? zI+Kv8&k?c$kCHGrY{ilBs##%k#t3LzqVd?FNh zF2wf#ReRzn+NABij5Zn6JA!^_8vU;B|JEIr%%?FtdONjkI_$+U0n^ zT}>&5BrUGL2rFy8^|rY9xKBlUYYXhHoMBE0gK0?TZuj{^SA=?xgXMK3TtH`ME(J?@G=-W)Gte<`o`?qE#2+T_YwpWbdk>Z~t8);m1bwNi!fp_uv&dRuGXO$ubzw-Hg5|8q zyWV#QSTUKu(>GOw@scjaU14|ep12B|(H>(p+P3RR~&fKTvw60RY zAIGIx7x9a;n2T;i_GC)OeHjH9Gmsr5X}WhQpSYSSBEre;w_avEhR&ju7Qw;3<>cDb z#JFM)pfZ_1#Geih+EroG+al0ER$YkgLT_kg8G6RKF0?eW^PU5veP(crf58SulD+}g zr@br}Ep!@#5i#}Ps`|mcXA+9}F+Poft}ji|_(*RkkMa-TuaZWU4dK6PVeV$`&_UiY zcQC1ni-&RkCXsp?(^zlq?76K(xh4TzKytnAHanT%tD6J96fBr>WI{IO(((3~>ZNN) zN|q?CV&y}=G^}o~hAd*_-W98fey(Mg?adX_s)>He$ujwpniZ?xw2_oR-7wjd^Ez}PS6kLUqq%_oVo%*M%Z;-sp-b5M?A^bB$$8T7D&X>4mA(dimG590R zY3`vp#9&xO5)G@3!YB^j!o-WdSm1Tz%_Bagf!g3wY!s_2oKsHHzk)tW=ebAnt7Pz} z4+4ar>SB`Ex-}#EQpa2bC*@Lj<~z4-b&)vLDv4L_wLTQoJ%H2gk-BtiQ2)R~!Gs5_ zcpeJI;vI|kpa+_>toUuMW747pjTX?+Fl}5EY2#v&y?gx>~lA>tN7Gv-zQ$C2WMC(Y$}w{Ioiiuvar*(9Imy?X>n zndSZ0L5^n{m>Dz9VP2^_-^l4M{(|a;HkwoTCMrzv%K5c>y%r)9KUdc z;h3_)EmwM(snTzBov2=K2DYmAr%nH{1;L{z^ zv}s1MyrfNHWg8?@x#ZZ3I_lS*zRFzTXMQ)Kl@sUe_b#j+XhHgm763QB?xc?Y(HF&k z3fw5dX~eRu`=JZ1t%X>c_d2 z*^g(g-12zA%FiA@wNkx?Ta~>gbJdnL1*<+=b83~@WS(T6cKNJnvidh|=@5CNo7Qnw zGg*3$?l;YNB-6UiN+v73XXObs^t=-yyQVYe)SEenO+2@+NQpVOMPBXIVTAj=r+{B) z<1YWA?Azxi1=;C(JUe}oX&28i(*|1UI;&pKXVsMAWmb412IU zzx_l1NS3snCAqV6p`#a8Evru8S64@7C-LWk4K^cRg`JDmi@P~ak{kJm#q-z6=iuZf z_^B89Nn-wwh`M>jXt1T)XgP|LI`Cr2c0x}9rZI@d4D_;9RObyIxf*=9HW5;ujFs@K zvLYa@hT8O?l*O$YO@jAoa851S#`pX1J}c3#I-3>V3Oyxc$e}Uvv}XL^iuS&=>DDxx zG&{zeDdgRg)ky}^^MS#Wu*7DDLRtqs`-Pot9*g^MSm=n490hLTAH8R0d^Xht3&5$N z?K!mA-yyQ73g@-^=&o&BHQ=YDy5O_+M@Fd6GwHInLdo;h=YwR@35&|h+##8L zgzQ+ms8!{J0=XZogZ{xxue4}to`XhAc9E1;ud-X(@l|kh)K_R-ZBf@1&0c3;mK9BR zc&0;^OWqOCEI+TiqWwNhe$;hFN?m#w<~*?IXJ#eFrRV+Ozqs zwn%oRwYJS*E;&frV$7CSg5LD=XQ4OWSvoh&ihKMR_?*}ae#Ks3`$&dXBKE@EX5yuy zj!yL9H-(#X&5o)f=)%c6I9L}XFVC(1a1A`iyvsb#w#5eDWG^vRtU899a%$W6!*Z2s z`+fEa>~X3cG(Kv_HOYpBKMYmBc1^Z{6OILCyuF$+Cb?s5DV~_>C$dxUbv1a{Y37HE zJ;zmYW%VtxO2!pm37yro3^R`osfzB1Fbg{!^(}FaWLTx`lQZ>*teTR43j6wPmNOvz*s-w$2R_@+Pj^kAGoy$4jWymx-ZIh<-@h3LNQ{;&Ib9sJACGFah_n{)cp zxIVfoGDBZxh1p_0C%hLVh60`#X(}f?ES>`poG1K4w4~kbyc#42J*uDn2i2qe5^(WF(`|B>LJ#-JCc75h2@Q6nq8sh;Csg6-J7VOPcI~<( z_kFDjd5mqCHw9+^pK#CL|EI--^tqR{iy*vojBUV%w7R%11MM{;+dFjP=}1$}6H^?;#{Olb$jA zcCSo_Yj=$NB`do(!a;gVQb#1vk7W}3;k zC}M>2CcSN9d8=JPql2#3O@@W89(7n@4C+QRgFLvm#IO_RucO8D*Q~G-_D8d@q>$o{ zqZjGGtLEO9u&>;>kf!83ug9u|W_Br`P|Jqzd+i3;v z0%&1TxBVvVlNB^QXzOLflQDCFgRJmGNRHo}UY0lAK7f_o-H3wTKu1h5wH{?zVTw2& zZy`Mg158q!2|m2*O`2Dk+CXFLKgn{mUoKq0-;a|kxuAxkUSe3AozHK9&OQ}AZ0@C} zV^|?QB*E`c^rz-c>EXN&yBG1xc=VYq_9--i1DOSPAssQ%-#?cXs>Kql@HKKi_A1KR z23UZE7YkGP!9nJZso-AVrp1>JrsmSKsEU$5ef#@(#u%oHyc^H%{|Ng=Dnrz+r-^{b9oKZY1t5Au zP4h0$7ibQ2mtck^AMa6k4>D7_1NSKq@inL~0!A?lTi}rrc2NjEM(?YlXPj7JPM;LN zsXgMl(sNH4tngJAGmfsPGRQ0=g0^rpukZkW9K)@arlq=78I~u|^tBqw8BY>g2{atdgcdRBrH zx~rV@dj>}u5{zn`eRW&otZLkmK~kml)UQcf`U~2mThQ&yzsm5AWj+AS-W710utgJV20_o7yg zQF_e6NBz{&oliMuZF@&P;Lq=|D5%aZe4P)Ufl+0k_4L7evm}V20|kSMRVrtmdcN4&W{ggQ}dF&Xpl1#YluNLK=^% zh_2U;e7aKRj+Y&=bgeoz;;nQUd<<+&bX{h4jq%8N1|^ytR3=YfEpIE}Vpro#K&TdU z|4s7_=rHW)*Fc@;8zpEXuk-wVc#@(&gj}V6o!Ng3ns#b6JYtZ945F_YV|ev)@;&^? zX2z&-jm62I7;CJHNn=unV;zpxFr$Yef}$`V{v6{n5`~k@oPQA6Z3$lFJQr6z)flct zBgtc(!}(a368EwYv~*GY_4`iw6Sy~;r}9z*S4M<;AvaWIkZ|F1;eo&J7dbB?%oHx% zn~(#)0X7q1b7&T5hk7yJ>C=KUCOIz0Yh5!tz570CnndG2F}r+d zvXh#*G7pdW3&nkuRuq>d9?&H_#xtjtQ!Hv#u%@~JXguBzjZ0LA?68BXr{bQl45!jR zN&Oxl>B)dK9Xu5$G&lH&u&yr)6qL_LH-~BP1?QOZIjuM;SH7@Sp&-IDuomV8()QEd z!!7BY?;-Az;riQ|PT_g44;mxz1(xm;l+I>$-np0JeqJDVZ={>b_ynGr7Llz(%ot=X zW~Alq36MjNL01ljdH8y8D_J&T4*SBAI20&@AA#zDVLD7OQyHGmL+3+3k`sePNn7aP z+S4-CZnsstb1+Z&YPye%rEB;*{%A8~7f7Mz$sk@ zejx>T7qpP+xSb|-1N;w+Qn_;ipU^p!_9W0F7X8T$5-AAtOOmU&ZXEowrK$83{6~JJaosr&K>)I#$8it z&`Y6Q(G0GJ`xz)@#pG^p25&D8{{^4)i8#+G|83kKe6%ga)|-t_+bM1?zA1(E`WbK9 zYMEbygAq7jPWl29LdiU(9pr~_vLF4=sj7!QM`4~J>U@mQeUr{QtdC9WUU2MZII1y& zonETrZpXM3ghWEOcB*ZBg#c3_OT4JWAJljd`8h_lIFrvVR+B zR3~!E=nbL3gU4BZXtufi$Hbv?7~z{DB65X2kqJ1ejrMV{4aV* zBISFbKnHH=qhrYnJhgvWGY1%t-RfQ?T1pJDE=>u=YmJDDqs)h~|M&zCweQQ|q}VAyck88|;$XoF3(dX1hcRirGvjshXekpv8@Jgkx5e1w<6~?@SlP>nWy39H z(w6iEaoW^N^_h^?+}y=BdP`hwoWfQIIaUiB#oMq|RN*1XfAR*8XsFZ_ruM#&b9YcO z?rq$pa@1~sKSA2YXs#gY*_K5uuHa9Ov+~dJ_vG_;<3>ffUqSZ8MOpXQ^DS$RlLh_A>=zn5$&cUNQf{uFAYGn5KKdCT}~% zDyIFjg^ivaOj~fqW}@B>Gh$LspiOp>{j7W(PMI?tZ2`9G8ML)a=c;e$pUP=nXk2o3 zi<0wCC7gc>dx`{OdzL7slv%jCM;ko$GWaZ(VkR8>iY0L2f*pOBHlqD{TeydhPUOxW zW~K9>H%=BID_P@EE-E<>%GbD;VRyY{fP}m3`dSw%|RV+FFOW z4+@QaaWU2}f0aLW-bq`u-EG-+VGK0eVTe-rp|8j{%=(FM5PUv&rw+fMR&)NRF!Hq+ z`I|lt9%atI4Bs9f?f;Gye_FBI_PuS5ZS{g@eN1ABj_GFGDcoMLmHQ6e*pnx_@EKZI zk6O>8mcwUmzk3nR3tiX(+xnl7`bVTb=gaAQaovlCr^PnIYjLNKuKjL20`<}Q)!dL& zYH^YC8>@^LbjhUsxC4>DJ-X4cMmF(-yV9%=*eh^~u@rl(+nlTHQ*btiH0BPKh0;P0)3$;@_>DpMOLS2pm#R>;Ao_Oc~{wKuS?^#DIN7IKx^y&oxb{6$n{Tg zc_dxq#qnBbSBsn^^`jQabw_)q^B^$BA$;dptjuXWt6>b(m*Fz>LECf0sx3C-^oMAp~+oEIiW#%G_^@7eKtxJS0*Gwdnbe0~ktnDAl&HQ}+d{9&SaF(L3 zrGf76`W(O){7r;^<4d9+7==);X_GQj_$+~u0%x8D%`E7({gvX41Y?VuZ`0t8nSTTo z4gUyg8s-OC6K){sUg^^qr`yP=hp=K~ojEZ_n5p@_@338*xAMIfg&+baq*` zLVq>*$76n|%l*iiP1({^F7V@z+vk-K|wKCBAI~|``zX~dL&B40<3v)AP-whmg z;)d2(+wZamzA$6O?*5VO&XS&`9_0A$?Y?L1OMEMb(fLyjq=2_d^1u}SplvZQMBP1r zAwOf#6#kbu)ZH~u;)=b(lGpjses+?~cTlAGMaFgK1_5uRPD(G!(rSqex4xQ|?DZ=& zg;`u(LxaO!m?iOMirC@;-FvY{&)Bibn$Lp@tg;&P_WkhrP?_aKg#(g~#z$}@k!^F5xUNTd97rS)MR>#v)_FSgTJF1CRW%)AKOB*A`j zLvxD)cT#c7n+?_}ydr&|49;kDJ=wfMgv}KG5sDL1iW39aaHp+e0q#G-nl5d2QOY1w zFvnMYhb;#0Vf#vZ7UdHd@iL6~vF7`OimBOIICbEN!l>yw-l$t=H^Set&SoZ-5xDF0 zTz#{dDEtydtfkvKi!X0p9$xdQ3c&4DkkkTq2xD67-P6Xxl8YN`1Ub&tutp2~5DJ#@ zOyNtcF;=ft(aLD3zXF~K_5F}Z+Mp$Y^E~O@)utk(O|6KvQ(QT0`3P9E#7c1{);6>A zyZy)fTCL^c3j2G04U3x>GzPTKK)OU_YExF&ByBA7kD#O$^RLaha0#_OZT9FG1Etfr z%x7!{pQ>ZR3ds$^z?-u-s{=SdnBc!PRG;lrv!%3`+gA-V*+brF(>f79po0WY9&`C|(BF@+9F4rp?8w307~mw2Njr11-nl{U zInurH#*@lf&`UxA>CNwHJeCQO4~|U4F6eFg6&21qPE60liOBvoH70mArNf~2niXWU z3AZy$M9uH9EB=Aa!kepD>7D4jYV)i4Y`OjyMZ@D4t{s_K!S+3e$qEc{?wLZpW3l;fy<`-WpC$H$)KBs%>ezax)1 zt1!j0_*%}Km6+lgeC2TGNUH2IW6GWFA=2~9@aXl_%iHW@k;dU6H5{xkF`zlQ_mX_% z`7(Lpm&+84UTw>rU3PF+wT{1ZP<0F~tK#HL9okj8@n%gP#U(j+ewl*!86xx2((xMI z|5x3$M@Lni`+d$lG9+U%d0`BZ%p`y@3CZv@g;3^sLL|wAmtxD68Ipk+&11|%vAw+v zf=aDj4T8Nq)D^(q>g84wFcw?GpwcVXRy*Ng!>cQZW%NwDI)Pw@gpm9D&YXav)!sjD z@49EL-+brnz0bG5+56l3?8mqF)_)SB8DxrMr-qk!umI&1^<@@zF86dGr_c(>;d1S+ zx3A|Fi_9BU2u>-l#z}2%d$I=qHf=O3hr`}8EHV|N$^Aht2epoKgY%D}u{5AoNl#Sg zf4n2znbF$(u@l~t`h8?qd$YqR(yYY{QLMCzapg1Rcw`Xdd4z) zVPGf57L*B(397zv+QOa=dtARVuHq!t;Z6AQJ~nkA_&cE#{S)q|E27E-gXf8T7wpn( zoJlpT&LJ(zw5f(!G8axUKZCh4nu+5q_!1Ue=;dB0izd*Xq*>#%M}u=cEN6i*58Aj@ z;5Y5u#|`su^WiiLc73zjp>t$kUFacwe8Rp5zFMrnPLcF8)Z1YvP9)jB-fE1$^B=Q9 z{)`It7m6UPJLa*1D?z0z)*>T4a}0DoM4EQbwXrl#L3yO|kcK_o3VUu5Uesj;zg_pl z{_}CD`!wp~wYs#R(eqNLS~wpUJuw|Us1noz(?@4@e!tEOoi^z6q0@%yTAfM|d-9~R zFdgF=6e@ev=u}HPN0 z7LqczBC2So@mtn2PiU@N>H@;n^0FgNtQS<9Hx>UU!}a+G&1%- zHwp<}<~@`-9+og^+O%of@CKg4=%DU3u5YVda}-t?mEd#x)0V-Y>L}Sj#nz4EOQMh6 zNp+ng)w`i1$xAz5j%Kpr^d8iC>a^)RxT(R7>wCfS@KNxo1V4$F`e=cGeRZ*>itD>_ zn%&_=?`0NJvNc%bRrRMOXi#VNMs-G{Y5coRRlkl)3~v~h5Pkw*c&4TO50o9yzJ;)8 zSeZSvak$@3v(jlcnu9zXPPf(Al;K?4PPk2~u!>4+88rMt@5^D0O4FaNIg7W`=lw$C z&H5pG%Z~IkB!oHTBHM0D(a9uXGR;EI%^uni{dd&WsZ}+$!{C${E(9m4rwc2H+n;;6 zIVlsUMV@TSeFJjL=BRebwwBnR>~~k&v7VUaLhEv4uD_7t!`LF)@8^*xYSKW1 z8X9;IGgJyZ-6hr2zr=YtV|f(m;tU%)#=~NBbZ`&$wP$&E2j9IEzZ7$%3T-F${z+v; z+!mzyOWc&v!IBYf2g9gtk?R3h*8UeS&al1HKZ|AAG&{4Ln3aUZh64M?dj$5=(ZQS% zg`nBb?Z7B7#}9G+Z=2XQ?itRC-GBq!6n@HYK2@a@xme-po;{eERw4B<`r=K1&Eh%i zE_Wy_=*IY+#aRmUe|Y?Zq-j$5zd9zruZ!Jj3aat7)xrLo70iPT|Jk@1uuj~jnsdT{ z_QJ^ElIZydRKkqTb6%sc4R*E%q`egO{4i5J7q;#!;!3gZ>eBIP8Uu*vi^GWgb&gRu1;8QdaS{Ii1Y zo*JyK`0H_U`_Ty5nU4%UIdUdsfGpPrUl=`zHWKaIa4T9sXnFg!n2}3R`;i;LVH(dE zUxX2janJc3XE$B(__vdF%!anrp{Hgfe2Rn}(A=z`AbCKsh z=QExiPO?Cyu{q-N&~OQjiu`0sBu+g|xRmsK=yoB9Hu^E=4(DF0b;b6K}`0!Jz=S=qqjG!fK9MXgu2UCP6c8x)mJt zu;?fq<9coIqia3rufoFB$nFS+(ECM4Ns*xL*Rh1K3OzADdi+PR@8|blrO}a>!#t&y z)I+A!T5g{|=zTRm`(EJ`dQkU9HA$25-*RqpvWp!^4b#QFeuHO1J}j1XF3)=i!7gHC zZDxGGjyHIcoi5KJ=Pu7?=iacIkB6)#>3pN@&bytIBa{j=tkljQ9Xv7gSO{OwQmUJ8 zw|QYjJRx7km9u0ktkH}a@d_$Q%9m z>EDLf!uzFhpk0s*ymjY@H_kDWc=LCMa$+3xREEGo4-Wi%lr`7~!b(3;hkf4A%2u4t ztykD`-MVhH><9cy_N*s6GHOpyD6K9VC!PFOHC{+KZov5;b1`3@C+w1XEcgzZF~@a) z`Ure3M#Lk(ilQ#ikv-{+I6BbFq#6P}dT`gfiT{mwVsg|TPzdFnqxpOWlv!@ zvDld>8rbATyJ!!WG`V)Wo2;hwd%U{1J8@8DPRD*?Za;1pMxiPo%Ms1*vp@xb35_R8Qh*!kwma(*%HZT|y)ZC}X`r_p$* zKSQ?U8d#5`k6G<2({znuZQl4C1zKs8=bG-4uDHs$ioyvm>Qq6l6m5lh_+k}a>FNHY z#KaEWJL zwyAbzdX&>Mow3-C^eDEgc5t0Kj2({7W;sS-j#K7z`H4qJ^4(+uKO=3+u84}E^lQgLro9jdt%fZ$m9LE*stWZs)`tBI+OV3+!(*S;hewsL0@cAjl+u{m*jNSG zu0|^`3H3B%n^Bh?xpLdWEYHmSZcp~qDQE6~Z+V z*-{Pw?HFr|uS|nYr9vq3+#;~9i0APRTc^65J&Qa^Q<~5kMBVtwF{MVCgp2U&nKOnkgs&woGB+K2Uf^ zTm>hn_Uh3;qVso#2Q!DMg{dvNZt=Dw12-b+}vF8l>Hzi%!d8rF;_e|_P5zWBq zKC3zxNA<>qom_z`cpq$hnZXj(>!8aR?Vz=BcSdsr4z+1aueArp=W_Z2$TvQh(vR+9 zW#udD3#o*CxxQeL7vEeTVP+m4s zt1JH{%veL1<2o<m4Gy(mLJ(y+Lk1we9 zlyL>*W^;7`W>W%H^+l%E*^%F#Q6H}YE*r2JOm%oZKq+bk-QCI*;+>FJ*@ zU7_kRv2t@>zznT1cBftTD@z01Wq-1wql|r(zYg&&2$(Ds=GzEkK`ws8 zImBrI@5>IPMAliiMAkXhN1Tg%Wz!?;EE=qzAQxqT;*G`SY$3S~_%~VL0qJ-O1t}1F zeBX^{Xt13>5z|_SEa6IV^NRTcy{|Y>yQ~z?Xfpze>0^`+pW%_a3O1y;svn{}QCu^L ziO4bJysi5Jt)W4=)m9Aj)n$pvOK%TkiK)xBirS=S0T%gVw9tm~q&A2Zi*m(^C6HuR zhLmn*jsJgd(QEADv?DVh<$Iha_tGOBzXGh?-v zO)bi!E5)kBdb1E#t1aSx8qDIdMGo;ddXsp6wL|=%=(YY4?WVw+m-`wg_8uJ5|HLu< z_jMx{K|tv?;Zy=h$GsL_p?o`V9IKYiqH>Y1n`DCoiCihWwAY{3KRfA<0gUt_b$!rK zrqq|tf-X-O(`6MyNsZ~V>LT$(NQ(iXb>U+^XzphgwD#{8t%)d6;xQlbkhC1{De)fm zXC6|Hy(hgx^}`p#>nV!@L9?jlo|V!TM67eTe`J?)n=J*|=0@l+0Wp*(&)hedcp!DR}#pHCmsB=3+HYE|V zK992VQ;V&CM5%-p@khU_afs9T*E8K=g(`OM81U;b`%w6Lm``jCD-v@8uV=OjCQ-}e zus6A5E-%L8Cimz5hWi<|OjIXk1RbIZl8tY5i0bhU@g0R#T-@vs-$wb%$~x3T`IOLu z^+!?-gcPoLJw$G(o2^WLL%ZjH_UbELVFhYA$gBN)9q;SL|E;RRH!AIK>UP@&J7oFs z1Cj2|=y{?`yTB zWm>9RdFkto{zI7#Ba`H^Jg^=4*k&mV^xoPip7)tB5DrkvR0TQ?K#j-L06kL}q8fnw;$%wh6pkqQ5f6H2>|J^$zcK%@ z@MG^2&#KDul?z-AwRP?$w!qa?E4W(gTCTe=R_U&Gna7+cniFgLj_ zao4mq-Q>9B9_w{ZHutL5HIfFuxJlDf(?u-X(W^Js;E5xF5 z)~un3mGP2PGE2wB7` ztShaKiVE{md+dJcebu;=?kgQt_NC^<_feRtWtH~D<0E=|-UvZJYo&^3szRDX`dy(Ofk{DI4DrRxj z)iu`Wn_FvY-0oU;Ex8xp(^%J9?`|%xb+x#PA8$s`eWUyPc55~G{miO~mOy@<=dvB*p`m{KFrjeq!8iMro^pQwIi zJ8uE~SxrxOt-bx#bX)abKlFHq^AB`kA9goXpkS>~5k*ea__O9Q|sSx6WNVN52|Gaw*i;c^lmN$+eT4CpSwq z0b~FCgVA*y-lcQ0ap^Ih#^`)&dZ0R=_Lf0$``YXOyT2D2ahoUh7u%m?Y&k$M#5m2iqq6LnVks z_8Qt}l=QLI$_)WJFt$h_n?AxJ8Am_~K^eWEggzOcfD$4yqM!s600|mUf=)&TC_yhH z50rqlQ4&f(2__i}Knc|{mV*+UGCZIJpNzGj1iy?;poBIV+dv8JGF}2DbjUaWN;o9r z2q+;aqZgFWC*u=PLPSOslz>K|1Pv%bCnE!tpqG&cN-)YO0VSAZEC3}`%UBLdaLVw2 z5_~e&f)e~PHh~h_WNZT^w99x2l+YpL04U**j3c0gpp0HnLZ6IJKnW2UQBVRJ&Jr}B z1f7fwPy$9DB;L6ulS!P(oGQx~H%6wD6k*!k4wy~|?O(Yf#&C( zf0ZHUUnSzD!h7_R^LZ<-NzBTYSt zZuI5T-_kz`0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K z0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SN&K0SSTszY$Ow2j~e2 z#t;%0OUQU!3vfM)s|43i|5f96ZEso-_nT^zLD~N|*H7z_o+Sh%1SA9`1SA9`1pdDv zz|E577TQ=pm&;Wr%*>nliGmpEMQc^$DrbM1_=;gx4KseWf%D69wbjfl`WC7A>3fF5 zin->s*+(Ox3lUjve6?)W{&wPG%;r4934d>0{#OZ69hT=NRm*3~#rzZBs;rUFnHo|< z+d7Z;=sOiRTvKTMgK|lRswO^nJ0A(%h>(;`yq=Gr)qhUEM42%8W-M}vziehq?;ZV3 zk&ze@GmE6Me%&mYu(wMl{Pm`hq!A)~+D(KnyJGCFGQ2a zGz-hdTr+CtwQP>PN6?=v=_bRfyN$zAt~2+TuX9h2{NC=^`JF6ZkGvD#8MP_jP?=NGW7Fb=;t5jwLBc#+t#qu@8`PE#*Pu!<$t8_p)dcUWk!EL%SGR( zd9*Aorx%h=;=Nc}hQ3GhMt^S=%m1`bhC}kX=bMRnesXq@FsGXd^H5lpn>SMcrt^gQ z(dZ|qrEk1)C7duHcV>2us>Yggqc$PPumkzoL883BMgMqS(&H!(6uD?wSqeE^SQXRC zVBRWjZV$@{SB09p@(!a`GcFw&t}tou(cb;=^7Uc!$!*IMmnW`bc$>hmiOZFm(Yf!kaWJe6H>qlBwyo4InMz_Dz3qP?Icyduu- zrYPb0*%co9CP5}XcP-p(f3-Kg(^g`CMUac>H&+a>uNTzf^KVx4w-*Vl_}sD3V0Q@$ zmMuxNGvx&>f{oT2EfZW(YJUwa(l&m-Vvzmi-t^GIf%Xl8N=*O3#*`G>^}>B8&opc9 zJ=3gX&NM4NKGUpnYAQX~$T9`}o6mX8xg>Hg&NHs1ODKZL_+cfV3K&bcl*!(l$Y+R0?Dc94u;`Jd#CkW$xs81p-P$~_s5 z55CnC{Xe$h?*?2nufFFU-fBw9O`e+=%{@(Xi*Tu{gcH&7|H3={?)kASdnYYD&Oyf2 z)%LY@p|%pOR-)B&wXJw+b^K4DFXeb|%3lLw-_idtc+rS^d3E@(jMv5zq)j6JqcE#5m336(@LooxYPq_T{b131P6GdCxJkK-#pXZcyb$luwU-xDA;s>9~o!glpWVlHggY%FZoZY09Y@UXeVaw}|;YPRwH z&t3_uEUT-pG?RYBESoHw0!gH2GHv*a98zfe;hZk0)4h|^at=B#x6GDwmJH{Dk!gEc ziH-yf+-O0mQ(3mz=Z9s}7F9zEhgG9xQ&4^$Qj(d>O}6kkmBo>By*3hB(haIMehb}t z1sA>3OP_0)ZJ1N;I2Yb%n9-5PoWecMnIc%2I7ggwhG4*b$tqh1F)dkM(?0J{HSN#* zNuH~iljtDA-AADBAWy+2gnYx$yV1J#4@~%Q{ZW#xq9gCbViTK7%i(z@Eq5`6Gcmb?mwWM>SS+(ua~3%- zD!b2XQdz$|Hvy?JW^Q)8BiDHusf?IH-;?1zpR)T4gY)Epm{e7bhH?VZXID=G@8l#z zLI=8O-)ZZUPC!>3-h0D!x4SBPRwPv2P5++np4Z`8-~_ifMk1jSl(UaDN%~P3$&~ zP{4+o*Fm29vAzpk%w4p;k=-t_KBd@`fhhCT&+1dw&V_aNY_}0(aNW&7X2s&2rKq2d zfxZtx$3f+MqVlW5Z;A3)*n*MrW#!3lvTQ&i^8%D#zO>6_(v^HYN(zEU5a}9 zYE?2{$g4_Xf=ASv&7AP(Zq6B7n!peCaKan6RHlStk}?Zz#dIGJ61~iSIhR;raAfg>(~0P)4W-ldoJ8y z<%D~?C-5Af{{CYAOqdk!oJmqSo(LbhPPQF#rF%!DGyb9WZvz^ioDU^hg}^_xbD^+?TR1;qO{@ zDf!?3qwn_pUB}!aUoxY(bqxLAmzu)Sr&GQ4UPpS2n6gpX_eS46T7Co@E&l{Ng002d z%TaO>w_j=H=&$Qc9w`({9>m`~QW3PU#5|a1THOpde=6HLgQ02a^cxp-C9emzW4sjc z85_)N*9TPF7xQVD!LM$T)ER=g=rCq0IurRFQ}?UNt8d2Q?Z^)k_vd0h+4IHn8NmSs zW?th}l`rNQA?ud5SX;KNirNz@QHVmg+?nN@YR~dy`j|{* zMy4n3ifV##vleMu4;ho+amW5FAIX>@TyYbV%9pwM;7y%naQ%ec(RO}+%XM^OrP9+B z=u261i=+}^yDP?*R2_r9$b|Q9seH7Q%DeFC^PsLK*RK^!@b87U`$uz)#q#Ml{GVW{f6GNnUC2#aPyhw zIFrH`FR&czSC_Nn)g_AZIH5=;uho^S%VHDBSkhiwn6>@IYGoN?VkX7}cQ7_brZ;N~ zGl6j^(=#`dcD6`mQ5LVM&h(}3R+h69nM{p?H7PT)wq$zdI7mBFq#Ca*ag8QILBZ}; zrGnOZszBlUN=|}93Y6Zv?0Vk=j(A6=54DPjlJ@C^6jO!oVWs>Folot3Q|w7sp}M@T zSv%oX+nyRnN|)HTTbn z^^6cmwy`wD^G0r6v+;2YKbcnub9zXsRhYV-mMejFp=l;DZTxy-B4#5fOd;rzGD8^V z2K^ZoK4N@%wpnm-F~thchFrroZd3t3uz2R^PihA9z_|nn6Oul?WA1)6=tb?B87RZN z4cf2TwiENVrSlNxSszgNvIQfj@R-*co3b;+SrbNYQ1~t>NHBFntdAL$vzJOa^&6PQ zlvocj>9^3CObW!lY48xUrj&C3{bAWuy@&Ed&w7Jz5$oL%JpTOu(&g(n4~g7LSc+g@2v~lckW1x-AczbswJ@IOnC9mGvQe~)t)uxGvTbA z29Gyj@F;-ic7-*laL4+hxF+(yb}kI7rp9j8@FeYI+r}c=4iQ!sD#}v?7o#XS2AOIV zmKH8*VVV9lObSV{3ilR1T=^=rk>0bfKGsLaM$&bgU2U_9blzCcLYoe=pDoBlOtvjOyT1oJJKP`Qwpi{ z(DCyGqEV+$|KX{!J`@M@OPmWRir~PM4%Z_jg%~N07!%gPE8W8d7f}=gK>{S*K$IlZFnEI^yU0z_ ztipQ*{RKw(DzwGn0t>KsVSG&63-Nl-u^fz;TfozKJyJ87kMRujs3~1$9srNf`9z;& z{l)&=C;+XcMCUaDJ%Uv^HMaPuNS`WkO#X)3WyEo8S4_XrxliG{0!l3?RQMKyHh#4s z);r;*3>0x`P#oXUr0}gZ#(HTxtZe2F7H;2cL-F45PRl~jYhnqT61$n0UPc?5(ru{a zS=4d?_pjG0e80v$@nwb2E4KzOu8;L;%PFPA<;8j)Z7}#A&9JslgOnv^@HC?kCUsLu zdd&sNJaOwB$ddA`EgFY1V;IsSx`jf zPrUEpp=4}+==fg4*i7Hmu&Th?p0VD}8%8O?Ek+ktZ}^pAthIgEdM48_QJH~SWU;ST|w!4A2cD6n2u-{0_t3iM#8N*BI!NXZfMEz4cbPQ|l7>Y)F?QykWR<`!Jpr zYaUd`OfZaBrvJT}Jdb))+%MMr%^und)SVKz63(ap&290u?YJ0TpggT6ALe6Yu z20S~7C_O{LnRklYNFaeT6?^TrUNGxwk^OAIo z!cAv*$ZWJ#2AvU|VJ6cv1KMWj%Tyc1`{J{S@Z!su-}1p_L+pQRCfrZwciGFT@j9NQ zpKQDSvfj&Nnovi)cYu9`h|{88j2YowSQGgj6ZTB3%-XK>{H39u^6|@r^v4zNV?ckb zJM&FuqSCAPF?VEbiFYW+g4(#ut&7l?wa{Yy`2^$)_71Tt#{x++kbIQ1FDq1K{YbZ0 zfTUGubJ035rm>k@&5n2?<8vTS)u00##NLa_Mj7*wK}=7PF)B3g(1UdY))7 zjJ+B*yi92zwazf&O!LsJY+kwT2>-t+2|f4?%BIk{yiG}%%kknkKdk(bk|tp;XYxaR z!?wA2W`eTpHwnsghCc~f;weTWNc(_7)p#mXlqHeS_s64DIn=`lTCuLO$|z<-eUVVG zH)sT}-2(}J#m z(n4||-nVQh8J8bQ*l!qjHSAb#ZNK5BoCjGl-7wDDe%;O70a-$6~$%;{RAs?h^sKRJduZ=VXE&yY{_R0 z{h^6s%dEm1uKt3RA;OxpF?CAc2z#O< zk$>=K*4mrvDSyYkNk;Fgxg&|%ciO@@a74Ua;avbL?~g@kZ^nDK+PKRK?^)Qh{fiR3 zlrINRzGS1;WYm`Y$9ifj;@u>J_shh;qU}3*GEP@Vtwd@eKHy+7v$jyHo3x)4TP5Rm z)%<`q^F!b6j?#&)$lAWR;FnT*Zo&UqN)w?fLMijE-F;FT8cmsh9jC%_UXFE(mSI<_)p*NWKoWD})W8T&qyJp4LcA#kmUMzObRX1#I-3f8{}2SFPUbON@u*!lW3yuqinG1cfKfclJP>hnbq@`OSYUX1$l4q$p*> z%rx+4@vy{5==aAprLbJKMMAG0H+Y`U)s!aQw!;48`lS^{ZB?ybQT~Zi^~DTrRykj$6fZ*kfF-4 zk5ei7XnrVuZ&dye;TiGm?p;ydRzjvQS=k+A4YevGp#ew4-@U7ZNGRq=U%5qMx%ypw zt-N#m0X~uU!p>8S8|aDMnp&e6MXkLlxpdtZ2^~2a`)1Bwy2iEaVujK!OC{C~8C0vY z!t`6&$x2v|`-_Y%^=o3cs`37(N29W2jK7cG^KOrDfY|oiy?x%&(AOumn7*|){SUqA zL*b+8#%ede;}!Il>JF+^sfPHUBYje;zx$_BYHEP?Dm~m2np)FW_hqpL=k4((AU;F& zatAI#ED&||(^b&V!*NkPZP}yu(a{?D3=b&l^*l}5h%Slq;MG9wCWyAaS!S6r3XCt`h!74@RnegzP%K8t38a*@=JLTLy z-o$^v4{c>3YvdYLkWuM^kvpweExgx7`VSX4hG{Kj<_Vu?Srb2OCCrK-k;}T64n>#T z{#KV_fzB?6jCxA-R+lC*r{(l-)CrHMEaP=yCYfY>=J$n(VMfl!wUM;H`}^8-36~2V zroOEz{Tp@Y&DEqj^b`IqS0wb&(c2cgBi7RsdhNrggq{XF`lf3pwRL}HM_+T@meADR z{h1w2gsZUbU+Q>r|Bn(nwmcp(nFw+>mBTE58G6lx)kdthTt2s^A66hgN@DnE!9jf` zuDX6K5ee}}qckzhJ#(zAmP#E}4P>=O;iWR!*5P`du4|txq}H2>8S3H35+Ur)fm|*x zuv9`i$>Fu}WK=R+I8Rz-{1(>3Vnt$FOFUCx)w6=;7BS@v2ECZ6l#>e$pkS-8yU^mR zv?O}%SjX7u$dmP}aP(+Y=H*jz$+VN9@^jSn+P6*frEVK-rgB_!?)?&Xp^ z@xCN4lfkvJN^AS&f&||%yTS*0e$I@QuX;1qzmQbKJdjd#=1_RRtilGDS6p@WgkGvS z@?U9uvFJOCguXpax#AcPiSx$5H)7sE9WV5RR@P7}Al~<8Qq&SqHJthU^Uo=*|x}J_UZh*%N9y(IhW`q!{mhjNtQM7<4OkJU*2E)Lk#>9# zC~T0YAO809d=voBmRZ{$D4^@gzSW_zj=P?nwKmZYpQhp2yA9H>N119*=&W zLruh-v71Wf!Qe^ZW{#-ornvRK3=bJMz2n1uv5kFm`@Y!j9eXEMrg^E3iiG}s#8OFp z^0J+j(C)9$mQ3H)Yt2aLfn$BWSs~_pzo!p{p||n5Pt5txo;0s$HB3Yl2vY8X@}(Qg z_x^X^#@C>R{+Q*WBYF>LsyvdI$WMZoq9Ni7Ge*5nQ>_g$T2QL$La)|_M^Q)-+u4V{ zrM*e<&_idBzTLlmV(yMT@jjaSW)k(kMnV&gQF%GM-%vb%i02qj4%SMm8vZwqR9X^P zuO>4q{H9ak(-&V>4*7!O6-DcA5v_Ef#kFeWk2J=M^G#>k+TQJ=bFAdcjm;7S{`Z zY#0D^mtCeZS>c;6`nRcUPk>E%PcHQM2W!j^fER}EED+^r>{f$kU~#|Vp3tiw!~_qp z8v}7W?r2KzVJAwfM)??KXiR|4)1Y$g+DOLpeljA?o62+bjD;B=$x!({HpQj0iMjBl z-l)~CWhQn6`W43^HZgF)KyN>vqBP!~>}9*S7bp!y=lEX@6pdkaT5B{#)5MRR-tpl6 zG#_0BJl!y2-v2a4W$gCb-0~sT8ZUL^Rz_#HC&bpkPT4uPfnqJIWkx+_R=;8JG&aIx zliAnvXziC9)!xl7E)K+cSFY9Zuqu*fw0!+y?B<@(>|?As$&=UcPgntITeM9oujiG% zc^!D1>cf5>U*H04(<`{B?h+0U<2Svsx9!Qh2k>UBUKISmT{?bmSZB$7da&oSSN4Wu zEuXDYdUBsodlD;&N9LIa4a<7uRF=WN@|&J?ek#8TtE`?-$xZoC|&&z$fnBOKF^M%LR&zXHN25UG-aP_a7 z+4N;CUgZG)+d-L@3}@>9Y>V-y^Gn0BVQ3+X5k7;R+)+!+pENU>C%z5;=>m2!qzjxz zjnWmZ8txPPCK*`e8NUlKLQ&_BZ(@JQp;BjZHQgf?37x1h^7lHT`^D(Ks*kG*_yN4i zG;lvxLQ;|;ku(@7e~W4;ng5a)jU~dt4IHAe7a;X5>uEH0E26R7|BA*MkbBa_{nuzL zETS^SmTic~CS?wR&1yY)CTu+E;}bp5{3p*e$9}HZei2bvMku+Jjwo#Z_viEEjlLM{ zDa2rpy84ST*dg{7_^W07b;MwgBL+(&Mo%WMTKjr9*P`~wH?tll>#d-XTarQAf{4M! z3r~>m@qZjg|9(&O;$6>9#66K6$9_-J{!0ut`tDEvo@6%S-CzEn)tg$(d{1_Y&o9Y; z`PTR3B=Zg4p#P2RH_WrRE6{`M?C<3NX6dhD<{7e1EV&bZpCNArEgUhitpzONBuN}B zN1CF}xN%Xv`HzUe=86&6uZmLEZV0fxl(jU9#d=%;?#=o9c+lh^QFed5;V(an#&SQ4 z#$HA=_6_*=^Ea4zpk5Blg;0;GSXDOjb~Ki1sOjB|d5C>%FYf*P3SYAKvu1Kfl5c{Y z$xPm?%t%5kRvT1l5p}hzJY+OHxBHVtOZ1N$hzT?y#$uD}38Z6IQQ~geRwVR7kLzem z@OkYwfxeWdv3e!KTP`gkvRb^u3Uxi47?I@=k*x)Vz3Ezt$b@@CUmq2vB<)i_F^}2v zdiaomQUPO@AnyXwnnT8cC=}sby?fxVm-re*F)DV z`5nLApIk=kVto%HGOPBUuE8H)&m{plgM5q9V{Z9ojb^LNquoN%rgzx(%OJ~Tp1T6sX^K+L&-%VX ze5$Tcvwh4Sx@J$<#R&0T36=N31CoM1GFY>7yt0_dq|rvDPqT&AC&N6_UJU|`E5zvM z`!4JS5TnX{(N9fp^fSjTFK5PMualbJx38xh+MicL^2iXi{%^c1-&}>#3L5uAX^Jn6MSfV&H>SQ=q^5)|mzN}SOo;wJ^H7(@I|p+<}$YQz}gg1+~OLG0_&|I$|i&m3q#ME`{%CdDYY zWNaz=B(nwTUK-OjsK31A_3)tayU><~`V|R{ zRlYZSZ+Na4sZ^9|5U0e7`zP%bx!II|ihc7z9yrkVuW)~^ko7%;yBxH;4U+ANLXA(= z8@beYK5w9N`6TA*fx;*P-!@Rnd#IDfoDI->_jKtIbEavFIt!hM_9Nzej~H{#-5c8* zbLRJo^DkNkpEN{ifx&Ek)R19L5%tiChA}BiZ4?vgbEddpUV3S=FQT!hz>-O6kdB^? zoiQstsR8WuFjGz)QaZFaxfmP_nImrkw~G=ZKlJrp!x(G3vgm5K$ZZ&z&F|!s`I!2s zH1$K`=Xb2wOQj7PWORnR3}Xh>El4(4*dVJ@rfGP{Y^s3`wASzI5_i*!X3R-*82x^uorq)loSd2kSdqZ1syhw4OL$R2nBX(%l-5xLIR-C6hvP zh87R;^hx>DMvC)@!mEfi8%D;JQcO_pfw)CIi4;^$;9NU8A_Zt|<*US6ZwET|nUbc+i z6GJ0;OpY;h;#O;Yhg$~m`nMot}V*D%`Je!Spn_;itBM3R?>S5xPT zt)BtfkAy0*szgLa#4C!HVVYs|fVxGXrIKW2Mk1x8M=?hhinf3`BB@{38m2KRtT9+0-qB+eAsskk14>dgD@>-ni6nk7>ksR#XnZ-k=d5F!VXRN4*5zMpOmQdXd790 z;%Gl;1XWR5+rXA7N=Gyv&ZRz>sGfOIqZ$^igjGUUOXmc%pST6Ll?~eD+Vxj>Xv^ca%|~=TvQ9yDs71F21r3y*Yz^CBbVydS}!cNrg5-1f(kxnst=MpO&v>vRF{j)}z$A z@B8(qqqlXMgVNZFqjKM`<|;BYG~NU)B14+%7}cad^~Fc2C=z<&IC4bq{r&zvPrL^E z2tKCiWz?(B`zgXE`1Cc!*P9Y52kb_#U^`$xsL(ImXQS#tjW^M_6ZJ{ccoUVuG*|31-uUbBAGu)^<*cX$ z;G$@KcI4N|)FM83Q^u3jS$=Kt+kuTw5#jY7L@I$~v{&FLpJuBZ@0TCb`wZm`@YNYH zlGq+Ee07>?c$zho%Iccf;sFgtUcN=Ml?dY^)T=GdWwCY(9oviWzR_C#awN$MnM6He zuDHge@~dH&iy8k^V^o&(gxU}TpWgAr`;pMbquftCRFP2OQ4XH4i|~YT@Pv8c3FF`i zBf`WU>IqZPHjcgq-*uL2IAp0TTj|LJr!HUHpQoH^Yx&P}8mx*Vq34fqxBVlL(7dA* zGpfJ-zxhOn(2DhXpHJk=NS{vxk)i*KPlO0(Z}s^^EQjU%D^)5UdxC^N^_X%QldM$X z!QS!&MCLrKY4z$oPI>W}W;TUnjPtL2I$mJ(blt+#mdR7TD3>YDuXq}~vdDy;;T7%` zk0IUy$|C~Tx$0;*+yqr1I)2%ZZ&puoQjhdw%T}t@F!3DoR$} z{0!$H?gch%F2g4{BcZ)GCrS6TS3?WIXEH;_$0=QW?)l|T>_=Zo(kk4Jx_6&|t>`By z{W@F=FNBHX6L9{DTE_hKDl5NhW_B{mKjEisy!*s5tlw5T&nB8zYpcImeKwKwzLS$Y zw-zOhL(=cMFd;X721d$Fs!dt7(m~6mILsYn`DNK#?gi&lR-<;ujA~iX3MUNfw}LZg zM~KCn}zTt&=#lPVnA`V4*oPy6%u6GvuN1&Qw6R-ODY zyP}Sa310a+@bzkv=DSiCpsfKj(MMY0ruCVgdPhtprWLdoE8cdSR+$`^WXHZRK6)g8 zKEoGL>>rZB_y=ZOt)+cC79i=Bwci8Bo=DBrNtk0Dud2>o%^gJTWMlQpr!Om+^KQ8w z^N|9JV_{_F({jFVmQ+ZcbV82U*(7nMPIR{h=!_ljrYVR=hA#)^ADnb}P_B$0;*6ec2e>)1*|J$gxbIkW&T(GN_`Z$c zNG%*UbGX?Y@@68BroWpPP1ohlaF*K)IIF_=S?oP{%c;(#^J0XY$?y5vslYE8ex3N% zU}c5herUj1i#(k1x`4}yGZpX)W{;mkMvR{;GmoErHGH917JQIg4WD<7n>X&BI7b@% zb0Zv&K6bj9n|QjJx$ktd{5~Q)-z5t!CQU0&H!EkKZdSicgc02%a_PIW;Ayh*@ez-o z4u3xXbl5fjbaNug3`LnGDC6X0!IR{4xX5z4*;IDAdH8RMpz0oklCm3H^HiqO;VYKY z;R`64wcvF4Ta?a1=}9O(6{Y>0>_%zc>9A0II^69z9sUlr-1Ra%lGydvi2VTcj0hdu zv)P19{tY3^aj6M|!&WT$;(2^`1Q$h00~~9+R<1T5Rp+YbBso{CG*`=KQG|%_@$pFL zHxah?*U95YP$#YT71YU~++nms)97<`@3Y~MY%blyV{}sArrX&aI{6{xS&x(2`ED2W zb-H+PFwyaaRg7{FKjI7Xs-%Mk_;BuWX80k^HCMw`2afXh^AG1_@(Xz@pIWmNQM-Bm zuUAoyA*qY%F2c)NV80_VDe+U$-`Wg+D<}HOQt&4e@`{BB0&29+o5M8<+5J>rJNk? zAw6F>lt0Fo^9k6UNVYt7kfcuV&*9&49~gDl@dTbw^vZ>QDWQ@e&YNW`D0$t3YqLe~Nfa?|5N98GE*tc<*<{^VsdW znT!qlhdC$knv_Yvhwir|!ZsIU+1aX8p9~nC=2d03ffrV+I$(>x@QgsRqw?t>BBGhR zr>m_U(TDaqyeO|Up1Sv1HHsGk2Xij*voSjdFiWtPZdN!=B03B06%G>+YOReTaSEQZ zM*rHtK~Tj6{_enJ5#z--8E=Kt!}}5|9A;;{-R#&BFgsILRbcEZSH-txb2xLKIj?1< z2Jtv9)4q^jl*3%~VrNgbeDoJ_z?gyVAgNRROZb#kOfeBobiVxx<1=*>udoR&_H8%U zswL*-yMf9C{vdxk+<#|kw$t&}@+sBCIO#;0a&gV_Rc|>+dZ>1=V$yj7 zr1M3^=r5Ki`<*wf$`({yg`1?;*K!@kXRB|J{LB6nx*tOLdxZAiwCY?SW%UB8#n=+a zN1D^rMRSPBw34KlR+)w29HNCO7Xt~P>xNs018NiX!KFJWRadwtouI$M0m|ojhYP9{ zh^O}l!q41fQrX(oLwhntUW6p&(s|~63({Zj%uY7?*H%|Jw})j0?B&R)aL6vby~<-# zUDl*2{ATBpWRqiXUMkO4<()7gw!`Ag31j)8mbORJUWYg2YM?K=(+>T%NVG_3Own?< zKO~EsXKKCjYJwcf+dDtKM(E_KK*RoAF z6!poL+QYJChj4BaeUh;_ozAruaZGCw_ANxF&8hycJ;e!6n#=LtzzV(v~N z?@*FH!~f3%9M9A<)25xnxH7gYZ`N|7$6Y*gKtEdEDj$I>XVS%Yf(9LjaU`NIdh{hZ z`7ibX4#NanSofeUtimOG@Hu=90j;zBF%B8Vf!#Iz;^`PTv%qn(%y>KATI$7~N1edQ zMpu7&AlZ%hwPBw`&|ABZAsILy>?>4Pp5@k}uCD`ovq z?B=<5C@q^-ZQ{4_E3x~%qZT&6=x%?hSm(-|F8ux@U&McfSirMOy%nT4~mpnrXC-eLXSm9~mjxkBmxivcmj zEcn)^md$(*dz&ghvFV1#sauzy1bDnZ&S8>a8S7k5Xw5o9VXRA&v z*FM3m$bBMv#il0;SA6!wsTJDQ+{)b5*(*1#E?oK9>QgJN7VCKHl*?x=6STjm%YdCX zhAHi5bra<0=z7zNOE%@^S=j`Y|Ew~(hMsrA&aSDq$- z(DG!Ij9-}(18p_PVFafvYSHPE{a3@YPNHr+e}L$-8ue;(S>f%-Gr~kAEJmJ|Od3$p z){{Bap6QV1##^(6{QGh`$N+jiFnj`@*z8C|@1keFu(Hi(@jVjE@KaM;@OkH>L$&9bEJbUP?1k#{VKV->P2*?o zl1*Ek{6lSWZh}*^s^&#Pc^|HU{lQGFwCQSo1B;mKBXvo3^%i z<{IbHoPKnLXE|uQM!gjK)*u_j^g8i`{r@zY0-xnTknb%UJTO1y4 zl(!ZsnM_q{k=AQF(AEBf?Hn=g=KleYn9jw)W;oN#{9Vp3fxiIq$54FWft-0Q{pKpV z$oy9i%xf{^xDF_~boag8O7IPlN~8?PIn`>)-h6;-jmx&RaF8hG?0ZnU?(7WFk6efP z7Go#GnxmE2&2S+{c1~H3zwoYcXR)JUGRX+}KMfnKv!L-$AWAy+GBM5IOVMYUR_h$r z>FU{U^U{{BZ@0#>E9@s*P1el^NNc>+)_bVau7{3WlE@fD@?qfh(4E&#gFzQ%$VUuP3+n>49z+ z{|c-OR(SK)tll)PhpvjOu$Nh3rkKwO?}v%0kSAuE$_bB%=Kuuv@!&u`Y4f_ThKXC% z`k4y-WiFrvPFRFDUiAMq3GqXmF8B+4U;RBK_?nx>6X~h_HJ7wLx(3jCJ_`~=#A6SR z^nr)8>Wrr^$T6bxZpf;UBRwOriDQ!sA81i2;~j(j={N)Uq>G;Tynnwc$Z_djD zvLX)t9zw#?=^3+6ULmAvBO&kL`fU;&FA0W<`WGYSZN6lO*S1QJv8XeOFM=CKo@pML zNhO^Qd-gdWtksojUxJ4#!=9Z^l^7-OT(9=2ys;Ut)sT_!7b`fQcO}m12np_3>^-%% z^T1qI2y2R~m8)7$ar0VC37Ff*t-JO$2@*n7@28<-=&y&ojz zq0I^Kh^hSa%rrCcqDT?yTl8xa%UYZ&x;yB4?F4w}>QIIiMxty#roRv0EivuD`Rjh- z`D<2K0so^_SX@MD$JLE=&{b3S8@Y&L3!}T#J_Y|)aQ+K)CH*c=Y#6W)E}7S1e+Lmp zq3^Q_i1GE=s@Bsgd<&qNMcKAnv`tpfb)l|Tu%C>X4H{&HCnHMyo!8CsmOJ~Xva=Jr zpx4nJQ%bGFyR0xt?2o^Qo`V4;sm_ES+4&aDtIVjUd+R^RakX78n$Q0lCs*>o4aMEW z^mJ|kzX>+`WVEoco1Ttgg{+7Se+QvGEpN$+=6%$;kYCE9&0Mifp#c)eDtL=%kBRZu zvss~9e1jFfM9#;4RXJM^50LOuQ93^$%-l5@(hJhGL@F;>;nS{6KHmVIQ?=L9 zJd%;$@%Me(uNm+afPU+0ns$P}z;occWGg)RxDUs@zm@VGq)(wpuVG^`D2iFojEI!5 zlOp&y{azJ4e2SNT9ct|9happrj} zUlm}5bzL#QnH5I$P)gK9Yq*nQ%?gKaqU^d)a%e4$wK>pbvmougm?Itfub?UnNax!?W zcUR3K{%qp4#NDl&M!Q)LuGB$$ka5^?Q=lo=d#Mar)~P9GQs~o*H%{|ITDgAP`!82KWzbHJkdc63QAJ&x9 z8rDIl7!YZ}Hz@L=dXP%%{5@8FaIc(>%8NVQbm!B@&^B}=-xM6c&UW=SIp|R4d(ckz z;35)Tb*5RHt=TqN{I=I?UE_JTxa;W8;LFLp3p?Ez;Z)~GPKH-w#|li=8WSyN z4Bww4!yYy z*8T(TYmO3=CYMSo@PB7i!ci%HP&_eT?ePAVVGb9q0EUlGN`b=vDgb%)XBG(=0$o#YAbA$)>IjqkD$)O1n&8w{KZCUb8#P3Gk$u8aux zLvLuxpy8tD!b5-ECrV!IFjIN(-GoeoXS{hMEbIMEQ5kK?==58#n^^6k?=d4lQ##N+?QtAde9lQV8WL(u+ET}e3xak3t&@n^Wl zR3F9{qVT;Oq zhh$83T~5+_rg!-Fe$qIe?*GK-3Sh~OZ{o^)JjO4S^iW<=U6%R4m+VO2tQJnSutmjM zYWvW5q8E)z)P`(#f~zOvJ7E*u>fm_o`+Tf#BE0F4sW_pzERY|}dgyW3%HgOCUkhoa$i<$+o~R~{gvt;{ zpmt!C50kA_hZpd$`OuE+_yAGU7P+{#%xtaKYuE1>z*D)J6(FPN9R7+w(gYp(IJGRI zR>re8abid4=yYKwzKqax@PD}PlrDp~kP5O3R>)L*ohD-);tz~ky<;q&TuZ5r# z)j>wh3bkS_hk9#C6l;m@!+cPzV?58MKG(oE2`wWz`;`6pOLAIAfe<3D)JkfV5 zcn9~p0If^4__{vs_J0R zQOqZbHXkE&-lC(9+GA5X=O6tMM=eIM!%uD8ZRnS(+<^I*igG?smZmhbX%|1)Jdsy< z@8_u`pVmZYCptHeHTn3L8!0UvZJL}x@89|<-AxZw?W=0Kn~!S%S3u^|o{hYf02>!t zThx2gLl5B$kJ|HoLw?7$eNl`y@7oFmIhfc0!`+aIS#{=d3O3YB}2P&>ZTNBfc=dS>6!CJq>n+TvX)dP+fcu&#=pmv$GbM1s7WsZ1fQc}Ew2rIf7@!e=m*|a8OVWK|cQe8H*H8*>agMKCMWTMJZ3q5uc zK8knXtEj?7(*EENAJ$Q=DNOEuao)XQ+30uhC6yyi6XFT-Rz`OPyPh36)Z+^Iu+OKQnWaq=O`tO`WK?&y_Z=+wYxMraJ)sD~Mv!Kz1 z;b%WnF{asSCi7M6@0^<#r-<#QfGsBtl&p0eJ+F`|>l;fzAUFbZk z9F0@v3`c8#uX-A-?b5mG8^&kmwJb0%KD$ZH1t$|OIEg()!UxKWfeiM?a_6w)MhDShj<) zOX0ho;=o}0CxQNm`P`c^KGG^Wq7S?ocD*Pk;H68>*t(m>T{0M%`C@f3l1z#ZPToKTjr#i^+hcRPh zjWuPSFkScCz#*qNZuR@Ea!#MY)V-gx@a)oDmGNr$_tp6FW6+>tgta#>!;+D;s@Yw3 z2K)2nern@n+gx@IT6+c}i|l4bxbtQezBf_8vtA-xaur}szHrUXWy%wrh=Po_+Kc!E zhgAr>5?dJ9{T;4P?Eem{cFw}Q{u5)f=H3e$cH;}J365Xo_B~<5s$IQ3+m$0fOC!ih zo!bJ>ITr_345s6!5=ez;mF&Su`~k-zP>8m(k3xQ=ph^5sX{hsN-y2t)6}J42f9zw& zD*^{Zj$deAbFM$=M((EkvNZD~QQ)humc@I5DqT?yS6g52au(&t{MjP4c)<65%+YgB z%(CwDunM!R1^+z={v&Ozt_;e7puML-dncQy-$Ip@AaKyQn*H}-A2H<<$|F=}euwK@ zTt|>b<>w0fBRuBcFo|E}q@!HqfE<{95xz-+_2#;!W);3k#Vu-5jy%bS)(=X|mX zxSa}zPtt!DuY2}dZAE27Wc1kk|j_DoW>^mCN>und8JKqoLSbTXwXF~l9 zv`bvJK7F}E*2*$}3(HPo{3qvKxP(%lHu(&UiSlV;_HzzXK+_uSg~sYljJAz^DUocn z{kPgW?AG`waBs%I3-O{XeHvK(&_re9zzeNZHUg;^fYjMW?9|>_0=pFbvQ&z1m@k|Y zrRNJMbEPRNJIxm=sq93l-=oxD8(A-1k*TH>+jq5URIdjn@%ZKt!@vT7&h=6LNN~PG zNyLvrG#8){KevDW$@F5E9Jg9QXgU_ch>6Re9d$=A#X53;~K$p-NiNtx!rq(5f)D zO&Tb)w0t?Bj7gfNB$6f}N%?TbwMtQR9JhYVEPjkHDzodX?g%O_?ubRj8OE7mTZ9=@ zm={OKV(PrxDoo2~-{1e-dy`8VTz8-MefE8y=bWdf=l<^hfBxs3-}$`vo_pimhdpT) zE3IaJao+km#o6EU-2D`X`pfv2f7K(+J3sJ0UPnJezoO$OmE-cA_8{$bSg>Qs;3U?z zL+7{SNdlYrj$7|3-(6T_D8&7L>~!@vvEsjAhfn!#sSvvpP8&|%e)%Iz>NDbM=c}$g zpG(!B_DtVl56;JTaEqA9Z67SM(lb-yGaKIz#=98-Jdv?58$DN%%6{A1?J6`d>P?dc&t9O%Lnzhn1%R?)czPs08Uc@`Q z1J4Y#IiKG#LtSCdcmyl#J5PNt7^OT|r&Au>PG>6i|LoB@bqoHrP>1JWKNw=2w5BYf z``ahmZqhsR59ke6+bjKTcS#3KW#e@?a!CgaW#c6r)$8A9Q?fpi$|7^0WX6NH!rN^0 zl=rh99nR_DS*IR%VaA}(Wp{m;J>lS%?7~C0nj~AUJ2H36%UgF?ogcoO^Exc&jLVj~ zVRyt62z6yonm)|dAH3DX`X!dtMcU)dR@?C&dV)-8JZaK#8?2M+imqiA>{#yKj+(+M zph%in?Rh&{SFCT&klyL$bv1&oxvNK+P};nqf+7juo?(4caW}cYODe*sMd^Fj)#98fp0AKOD!fEU^89w}68uJ|9pCax*LE8t5ZFs_# zo!#4RS%*0Z_l%3|cpCL|^_jzY*XH!+jmYZVY#PzK$>^;1ANVQuY~}Qi&G{W>5T4eU2k3$zWQdbMw_Gk)g^vvkM~wT2dNovV2|eQ zv*+;kYO+K*d--q!?!W6kWhJ^JM(n@PcVTzO95&@tm%$^i$f%vwpV+a6TH4L$RgOoa zKT?0l!#CYZx-W5VvleuHJ9zIapJZd~Gi4oYvgUP_`JdmB<@_Xj@X!ogL6u`1m@zbK z$M=E(wCO_IWVGqRh9;}gndmO&ePIUfXOPx}p)z+C|1#dO9_hxB^A`_JL@yp1xCwWM z-@}@gbVlF`zXH$ZD!yD(v5tyZr0GAYI&*|Zx({*3x1kG92}hz$m@P*~^c!_v zCo8N^++$&QcNZ;V{G-@0^7H;Y{ZkGD{uc)F`zDz5__Fepu6_Ns5uee0#+-3}-D%U8 zRf!xOJx8UT+mBZjogQ)e%fq?j4gF(vV{eq(&BItB879^a>bKJUTh=|t84EW0jUMXz zh1B;eFjm}0J8@(i*flck)Ztz}2GpPUcV0J}&#C2Xcjg9IV9&Tqu@ffGoHyOvyHO|M z?x6Km>DXh{rUTfOQ3pGhKUg);WjsK8pyc<)F_YNjBRoL{+({Z`4+yKzKr@4cAenVG(SVAzhWpXslrr_y;2dIq_- zcdWzfF!WAzJc)CYYV8SweJ)z~XNP~*YciVp$C{2J_SiZ9Yzh?I&)%vApM|--l3|hK z*{aeHMmn#gr_eV}IK6T3pD%4h$Xr-yrsSigUQI?19ISeS(t_$HoE>4C1} zf0$5Zrpw;c5Ds}JcsYvv*P((zxpa?DV~w)%{%bzpE0addK8>LFI_b> zP<~E&g5j=ispoc2!7JOoyvFfP|14JEFg;n|##2e`*wEwq@qQ=!@z6lgIla^Liu45T z0!zBnQvcf(Y>V`WWXIQlJEg01SH1hWadfGNm3-9w9G*-Y!9MQ31ah%w@nd{lZcne% zs@gk>-%_CR=um=@(|P|ttL8tllU-fWm4(^Xiv8bg${g(BKbn0Fb|-Ez-f*Z8>xFXz zO9nr=-RQh#$H##(=N9a<-NnD9@ZA05vTnp)_tZXj`tHIZzR%qgEZjk8KajWS+`x%b z*zI5RQTDTUUvX$_%ejGP&z&3S!+yHRZdwF%5FM11wYnfM9cj_mHor_vq#bO{_LOOi4az+W$5*~M;$LXR#aDf#qKut z>T>Dm_ifkXY3k6xqCtFj2s6@@yOERGTd|&=2)hd$kK#)DmA1(fCgD5uAGeLc|Ch1P zqsTeRf0=uce}(%I{}XQ7flBx0$j+KG<#bo%@2=|0&YJH0YUEFPE_ZfejsCv-3HJ-Q z?kzg}0rH}U?J4#;$FFg`nqQNwc2_w6hVr4r@5+98Snjv=j3V3h=Lb%A!iufVF`_q1 zq7|AB>-h?&Y5EL1EZVT6W43Db{J@{icjI~$J6y}y-*olh>K7ME`rNF3E6eRQ;!3Oy zSNsR?-Oum6M0ZDi)~h47_!Tm-wY>VKfxt_XCw#~GA+DglGiVw)Y4Tg{&2IMPcCbVa*)>w{`*FcyXU*9 zMu>%L?6l4w8aQn7J4)*l2>N!vVx0z{p;9~JT%}vJ*Q`4z|fzs(|f$wM~ANh z>p%Q#VjzF!dxQC9WbHlt3Q`WnoM(D&aNefm=^swB1-J40K+hm&uxjeDL$#0C8I&%*e;`cvp#PjlE=QI3@1@mAfo^kD>YXsdy+!62l*FlU0R@zURN@F{& zGHHas6+Numd}KeBj+~FCU3zEDj-koA!+7&65}Eq|z9uIB=7|Y=@&bjr1|D)`}gXP%ACXb5Oo?^csYkzrpVt{NMJxDu=J%MNMZVVOS2s?`XWbyY$ zNdN3W?fHt=u@7zHzo8BYJFr^)1Fo-zt~=OscA$D_M^E`qJoTtEI8DyUxQnp=yuE;R zmG7Q{Hzdw~#XJ{|E3A@K}joA>H zYuY)4ck?bS(qolIb#B^SzB9WnyS8NfHmk8qnvJ!>9L!iHr@D6^FRx&I!8i9|?=0<~ zrE7~HY`+&*5l25Cdk33#OV?OGyLx38J!KDc#jjZS3Rcfo9UW}IJE{7Ijriu0bX)E* z{gb)Bke&@GG1tbLPr-M~f{-4NqqOO!`T_yUACaWKl2rzf5OzOj@H^&?l6l>Z!7C%@2LAOFf5YuQL+>U@ zny4#%YjAUi9y=`$4sI^QPDI^agt*sp_n)wz^1;G!_PzLjq{moW;6BttcFqkveu`@H zNDu9$JORHI_u6vm^022;?=1CS;$%Df{F}Brc4XDCAE8e2j5VyUdJtcX7&O$FohSIa z1|1JDwiSLa-a#$j`OFzz-ppU%3tU*8mA!)bpoG?ih8reLHe;M0`N8IWsWs46QJ&75eRJ8_u$scCtRaBW}YtXK;N}zH>tH zm<5;nuQ+^YFdx6})3{^tpsj4d4rz+9YX$cBvcjrCn-y&tcRT3axT^!y}V^R>l2lR`h?y1Si*=e ze|98vBPVn&O)&GagwA|dqG{z_3ExV(|6QM0y7XZ0vSqtETdK@QOB`(po$hYz6u&3I z*v5n;srcHdn>)*PkLqLf^$BS@`fx#QV$^h9S4EWvzhN_mpVwbgn;3(4VHQ_;@vaZ0 z%=-SZ4g2j4hx%=nOFq(_wwx_GZTzw=zY;s8^`|X3vZIZysI(mWX}`^AftEBG=g$^* z*7oVlB-@j}sP~gKQXi|%K7?x&%MrX?`M14{g?c4v$H}${wBUNJD%%JovuWKxdFSoD|-!#DicPjY_}^>Yp$rQgdJ&iaVOdO z{J_I~_%>IklsIXq=~1v9AmL-P#NQ5~-4-p2h<>9iT&57hOWkKv5& zaOc~2>jO`D`faDpXODDNCd^Vt#i!$*?Uaz()%LR8;e^33VRpfX=9w0J-#*W<7Zz=& z4Sk67RYgM23Qrrps>XhrvfX?8&59JOIGiQv)y}fr(CDM^G30O~u1&H!?>^Bs3g0iA zv#^KiGryB+^y;oueH*)|z76=z7AxXCoY0TFqLURe8KTzLY|Fwt&1 zIDf1#G8ZinzxZM3pJ9LvBfQjrK9)&tMSAl(E2>E6+t8^(E&fTWPfUl`MJ9BP?z_wW zXy0A+clWs$Mdr@vyQ|8C_b2F#p_9^;({jft--Rem0c;?5I?_Tekb2Is6h;fK)*X^t zZM~91SK{H7H=Q{A^LmUfuEY^@ey4th8tQtF=uo?Qc3OYdcADx$f4(ANL5-p29lJlH zcWBUW&9x_v1q%|RmR{ajkQlSjX-ld-0g6>-A> zS7M>1rPA4(l~tAaWnpDv*`oTyyS9qNZ4LE_{iUz=pEGaneB_*T+ znZY)D7TWT7wJjSNdaB%>jbD;YXjToppXzJ>Z9jmPwizdoVIZt1pV)rm1C7s&(J=*Oz|2vsaH%4*4}}$N0!@ly~7m%u#t4 zZQC86RUdVAy`6m3f8lfDjy_t8NnE4Xb-(4rJe2Hqeq=!%z)#j|5PMrHW zoh#}|wF}yh3uL(K#3e7GB^d8I(fjk!Js2ToXJKx}UL-Tk(G=DV8h(da8nbb5C1z@j zI&?I?OWKb3W;3-iGmWh}K6~T(VXhT-Ch5%b&c{)Y9aSZrhrbn0d=ja^6Q{^r)fR7c zv$MO&CRzMF<#qP^UAV89OKX^R%mJDyVV&~1#L&!P=Z+0MHtgK~4CnS<5ck5o<|Ol~<9Wq30`;l6U`Z~QZuChu zdhCHwhr%b;$6o1eTs-=N`ovok?O5s96Aeo+rq@rz*qo19gkn)#)Loge8B~JO;>_5^s&euDwe(MgsV+`Wx=gA>>$8U1e?{d>`sSfUh zf5Y^~)BHL&b$n&aR$5wWWA&>Ub5_>W7BboC5lHX`B{U)&0gn-nWx*3B9STkGB+FJp zI!}5Ol8kIn95pUslBsf4UyV;T7uMCd=6OO*L0^>3^F*7Rp4MPId4{n%UrU`Qj6xj{8*aD$QYaFs1Hn+ZzhxO6GszAsSyuo&5FxYDIZtw!U>SoQF8ji#RErC0HO}4eZXeBR)qrb)+V1V?u*4KMaCTAkgqW323CX5Bm_duY+>a> z7V?Bx+4LFL%$#-Yb>-K48of=vX3|ZatE{p+s-1MMIMf=%SN(9a5I)V>(p;LjEKO{q z455lxt=(m>udS_IS}h+hUD1Feeq2|-uzG3b;uWN`a9LgT;urK`GPawkSy$ILVURp_E*Jv*f+yC(QJoNqUcNE%O zTXj_kkFT9~9fi+6Hm(I>#h;%ldlq5G6?@>7juKmQYseeN@NEPRHeDIJ zGR8*&#=d{&&|MPZ((h!$*>FEiIKhBmKrkQ}5DW+g1OtKr!GK^uFd!HZ3pwnu_M{chMyt+TKE6E zJb+)yz05BcaqE4IEr;KG?|?Mh1xY!|)aOjdsHS zFF4*PkP*u@zk6Zk=$6mE_@j~K{x@XoRd7Po%fI5D%D{FVvytDW_&o^8?^ApiLh^?d ze+(h{!A%N6fNNZzmb2txAf72k%C{AR^> zASAy<@mmp+e_rw15t84f_&o^8?^ApiLh^?de+(h{yse7WK)5R#v#_y&aJmn+_lki1{<5rpK|E4~dO`OS*&KuCUz;3L4gyat^{un~?#}(g)ko=(HSr*CU#@sJLh^paM-Y-irVhi7_Mm(MwX} zf;bWk2nGZLf&syRU_dY+7!V8y1_T3w0l|P^KrkQ}5DW+g1OtKr!GK^uFd!HZ3GP#=(s-#zx=JdQ%A~n=`wh0@%4myi?!slZrQvYU z>-Pjgw&^oUXV{9XS_8o*nwZ#{k*5rwM zrrquftqz1@)7FKft78$5*EbDTr^Z#Q(_~B28p-Tb0-wG+tJ^r4EN3o8@!U(3Sq(#h7Skltx69j+xP4 zhTtTo`@Y3|3CmMD+Vati72vqiV%n&CNE)VB0$H)5dj;1s->tt#w^6FJSRXP}TJq;v ztX2Ch=E`^8e#-)RnW7(uU5W>_HiI0a%98hxvC?9zvE(6!m169nD8tg6iE~fEZ-tEx zWUk6GseQyI)4kdEWotx7s$2*DQ+Y^U0lC3?(emuk>2sCxG$hMI@yI3ZYRW@0N5y}xJSPxmFX9jv zl@@cmeiO=HWhq;igF4Ku=KqNY(sNfbhU^ZjA5;mOtz#?#c{%hPmi+dtO~!i-_vtrN z8=}0D{P`-zmf{#~V531!iP;p{S;*ozu1d*nx3G8izPo>TdX%3RDLuQ)1Eo)RGjIit z9TrnN(p{XhpHt4;Hp$76uG+TqaGdEpwQ)MX5BB9Ms~FQoOF-tVH0-gnD+@MG5|-qBqNstpB}dX=O!88 z5%*J+$C;@gR$Ize$(^tsUD3k!zWd8x?fdn6zuAvQm7%Ysr(r`Hn~{cT9_kUt8Mq&o zFDl0yh*OI=dvL66U+MN~3_1>7bB)}Gq1$fQq`wz+ifU`l!8nv#!~dh4if%h}8*%JR z^^g2pH2s5f3Y+92@My!T&0%>hf!uFI9$@o4j)N&(T`gy8e|75A@H(Kho~QWp5#m^r z|6yrS`#uQWjs-{)$4b8#rbq4fEcD1Nhfi<6GjUFpZ-ZXO@z-u~&zI}{7d${5%BvcW z+K#qfP*q7;9@>&U^H{2@%h~^%_mW)=LwkMN3D}`JrgR2TPC8aG4l|BdB>y6dq!@qU zRcxq5yVGEjt61BIigCUqo1~1bgm^Kn=;5%9@O;Gu{J%wCVqO^93U(R`Hzt&pF9pXfr=b0 z=JSmGy~8!88Mq&B+eEoCJz!X@<4w;$HU=yup#*l*gNjRyouvX z^FW27v(=4$%*T)m=3U+Voc+oaU6mVF5A7FD&go?*o=xCvTO-yZ7*nb<^IYzg^FLBv zpw)AfZYv+QmckY}8jH1aEwwL_<@*_%%w;AFpBOi=003&=7RmuRrI-%x!{x&%B!UMr`UT*p#e zlC?vI%M|$<0C)@VHsGqXHL*?6N5HnjXKLT7&DpnOj)9Jf+ktNbR`c^I@Je8mKbfBr zV9HO$YQ~x_RyTI^;jO$XlKI^hWb9SMP2--~6no2C&?j+>{I;7mW#5~1pK-I{e*FU& zJCiyUA;_UqN7q{mF&kFVfgJ}-zV;!WpyR$+zA%pIHt~F6DTeDg#8bt3-P;ENt0}7$5PTInICZra70x=KbRDD6wvmq)x!wiy z5!WXO_L^$m61&^`MN^Y9>aZR?U{yJ#LLPrIO73tc)+&m*;!!;mkBoa)DUk^5RU$k-Un z`Ou^Gtl_?HzkHcL%r=byGZE)0#5s-QjBB1sOT!&}5R+F7m__-3MtPyQTh^;}t*k%u z7bMr8Bqt~5CpjP6>N2c9ZP42T{WoCmK^$iqpEGfD0&(VTkk@Z>sa>P<5@&RtTn@Z) zKVew;sa?&4ay#NV#Yl{6Yc?Gk|=|mbWgyp zQg-zoBL;H1%V?WV_YC)@ZZi%`kCN)bzk)j$qj^WUUouRO$}p(pi_+E9JF{tI3vpbf zx&Ftx0;?gp?$r2;xlXdruFR1e4xhiYwwj=4PxbAfdY_B>JlQHio|ofTlb7#mZE5q| zAlpjzeXXqgLK}SkQuFFhlcl!}vSYmzb&I@|$hyfiNS9>WAv54OJucO6E#$94evb}q z2(>HU?$h5ODf1}F%kO0D3h3q}Z>-ehQo1Bt4jIikDcORg?E8>ak?a=L?%Z7Bj^gcs z49g*wlT)n2wcc4^Td8Er$7s#BR-9ObdG|AIXnEZX z(`^urK>o$o$U7iE`!(|Ikek~tZodoig0GSHL0!cQuY?@U=4+jv zq2a(Cqy_<}S7IDVVm&{9Jc$h)=_&!RP3qt-_Y3q zC&*Py_9=gAewdOUH9p~A%ZB-XwZ0lO=}~;OJyHuM{GvwQl_7qYM*bq2;8_hmjN@Y( z{7HuR)?_o37{y<%!4!X{2A|ImKcbOSe7c7vJBd;Jtr~opq{?WHlH2Q`Mo#6m=BPRn zn*jYj@J&S^{&{=W$VtCegGt_?!6aXgu|RfmfAFWtw`nlNe^rAiep_1nt~B}KG~Ax1 zeT$@$A>Q!?;#T%&(IiW8VY$QDB;T)$)I#kyC#5CexSuqji&NQ7(U9n!bq^qTI>jOF^2x zHBDZehRf3Q%hKfY(y${!_NGsh$2WC6=op)lQ`Dz4n9?gqE^=gh#TrcVFX-ZeJBj@q_VezO z*RtyRXHX+2{q{+UI_m#lpg$>1f2&4L`oZFqKIcn=Nj?eXk)51B+6$v%l2d-Pn9@`6 zo-}^RKJAN9<-dh|2k=^)AihXW`(b1!*YAn6`lv{gUo}0YPw|gvFx5wO#f9~;o)&%F z$$ZY#@I!n#G??sd*634t2fr=Hrv628^GzvC^4S$BOmc??lbq6%om@UjPsM*seTU#_T0!$wr|tm;tcw<&qk|nn3uvcGU(Hu8CCy=4075Zqspm%O*JWO&!A6x zWMn6|584N#;@S-Qw5LUt-<(N4KP7i%kkdXDHGUw2oc4F9a>_ryXyYejK9_6y6Ui}M z*|-?D+V3GdrM@-zt_=ESjhyV)YVcKh)mG|ygZ632PUg>@hG`#$R!;jWR801_ zYuXdxSJUt=4Zb%+euEl&k7bY>>XhU#-fhbur@a}nli14{(ecr3S2#q||t}ZF4F<<&XAK$WG4Rn~PJJkrClP$>O zKlTqPSib&zK25$|gRL3zPkTRPC$aGvuR+zP__UWn#bp20wDw7R8?^G7x2G`0r+o{uQ|6B}O#2p8IoU5?ox&8K_Abay z&i_-v6eczt>`V zMz6*6Y+l9R$zYG3zpL`b403uduF3-$c6PLlz)0YtID%*@uE64|MYxTb`twm z201+wRprGQDo*%06*%``5&kj{N@o`*(mu0X=&l0uzgBtwZ z4EpqJQLAriPvPba`t+<(tDmpI;SBop>`|*+<;M1%h!O`qy_ ztLFOcGC3psKjn|)>hnP=(MyQ0Vw&G~XR!Buc}pT<%Imc@4gO^YeLVlN@dR1694P-O z{r%8apTCm5{|T%C;~2bHKRf#W{(6ANbu0`MoZ}?Z6)NSL*Vl|3hGMdr;I-kpDgK zPhp>@54;t)+roGbDET-PoIP3rXJyheU!hvaty_iN&J z5Wa*Z`FaZYdf2N~(%S~CKJT~U4%T*HI34|?`uzp)BGeB}S;Wr>V7lH!vqE`|>>c1b z@PjJg{QL%3-7jFp6a4pq>Hd?fpUsW~FVocLyTH`GRe3+~$7n41ial!be?j`Z{UCiS za4Nrgb|tkF)Cpy{2<0yamUn8(`L_cbalhaT#om12qp)Wq412c%tNmw`&d62)uRwb! zQsgnh(8nLw{~mC6PO7~31A8i!{2~2^fVKH~1{f~M*ACznC`>uogMKIQCBSNZ{1$i( z>X#hJ{|x*WP5=1xCZn6Y(@VP6UuWxJTUd2almUd?SY>4)Agas_qTy}puH2O z^yUF?9>?hc=O(rkxJ$V|VMS#$0$&e)t{{Ez7X~&W^9XqSJ8|%5TtBT;@J8S(tL69d z$bJX#PU`F=NBpOO&uPYwe*jLk9|Lz;r#W@^LA+zZ2zy3eVpp;Je$vfh6PUz;7Y{YjrZd0a)Fy zU?u*5r(%6Vd$y>2%YZLwPqmNl0taw?MEjp9eLu;O`36c4_zvI|$P+oj4*+kfk>B$p z{1`A@uN6a1^}7xDa^we9%=P~X_^0UaYWsQ(SX*Cvf#)DUHbwtE;L4CvUfAmfK85yJ zLFEOWU`G3fJ~_fiA#X*0r}PQ`JMdWazikRW2dr(Mw3k7PCj!qw`*14yQ-Obo`dUaB zd|d~;9rc0a`FQ#rU~T-{f$9D>-JhcIHVmw-kM9F(`zzKK?59DD(Mo!MNBXGW+m-a5 z0=@+8Qyp(;|Ax9>iQ4}*$OEV!R3|U*FM+lB`2biQKlpqCT&HP&CxIUZpZMePhX`x> z|F_5xmtd=gydzw4=<44GbLOOO!wEQ{Zji`4E*PCs=UjA)$)+N)xgs=<5L`1J3q7m zdumhdV=Hjoyp+8cfwkkuPGGwKW&%OP&)dLsKUd}}oBak@JANDl{x!%@+vgvF58-}D z5%N#v`3$%)kg8vLZ|G-L7=IM`dEiZ$AJE0Qz6oQ4uQrw6$-tk_m*3eXd((lZYucwB z_in`~M8=!TN`)kmRoc zcN0+XuYj+|{bt&eLGphC{%NE94jkbRfCGrHuFsBBeDv>IC_cuM6Tpkm=T!a-u#@Uf zCT0dc3OQ)v_6vYNSKjR-`%{2_g7NOOBEKG3J3lQ1ejVeV%J&Lj?fNr;e$s;WOpfgB zMM5Xhztr?R(0?55d6FUz0~aE_5(TdZ20J$TZ^eGejlgvMAoFize+Rs#65~DkC)s}j zcm&#?1Nwyj5%@B{{4O8i*MPP5-UL2(59z{#uipW~wvF;f?eh?@wm;C`mP=~n=jUXv zADG#)ZTQdgcOLi@+UK`{NuCXE=>D^s-xlO^1rk&1YXama(LRe6{bJzCGUNemhT_ix z-W0+!U!{C=fVK0*eBjI5Q{zPga2xosL7(#fU0@00yUKS2Samdheh-+g?zyFZe;4*qjXXaOL7u9AJ$nXt5z1^+?gZY6_NA7W-hb+^ORbOI2iC4{ zKLOURw|jx*-a%#TANnis)!;{M&k}gju2*O+sU4p$1%3v6)Avb;&m!PWx6AM85iS89 zi~P-2@_Rk-j)j0w2Zv zLeU7fcS` z*7eLAZixhaabHvE^voKr=0GUm@U z&Krues3+7EZfV>Q_r)NIwMLs+%;#%j;nujTv9+1PpszWOllTl%I2I3tR*_x`xqRyb z5N`l&EgHUCQ2|hF3Bk_< zKXye+BV@CTMOx#O*XHJ6Ys^mvflyO(1P9)r&l5te$Kp+aFjMM`1J0}(e{Irs^V)UM zK-@<*)_H=f5w45*&~E4iQd!5d?rUiQVK{|vC_##YSUeiS*OYJ+40}C6lF0}7AkkU? zszwC#%_V0|1g@y0D5 zka_f^7MG`~Dc0K9)O-yn=O3mq7R z0xgl&s9XaWIJo$Nv80Ix4e+E46nv<|DQa}dv4eC_t;5b~PN?xNR8!CL56`)pz)foi zww7R=bSQ zcaR$L%X3Ae8gEH1ze1afw@Ku*nU|gq<*wFpM3Squ@qHJEv>V=P**T9c*SaWr3;ts`MxrvJfn(U*Cr@nJ`SC|u30A?jUx zghQS}F}5t`S>?M(oyNlvR|F|4EtxlsdFaHoffn8@8e_2}c`C!LfftKy7STznQ_OK7kg*2-he5jiD~cTH;`>QmMy$OtuT)#Q?;mG%YdP(-7? z+!kp~!7EX2>Mkq;kOpd-52ebAg)3p zt2Ba_6dTuT*iagbJO%LC1$Kime*i01B3N0?)AUQ}qwB>z7_OCZo+>9032smqbcmwV zJjxcuYh(kyPTIwKTWM zJkeP!SkNw&Z4eG9?s;^E1pWqw6gX z6+`WhNWc^G1_GG9X(dXKJ~Zp1enlNgzL;^*)eK~ElhzEM&7J^;F~wLs8fYQgskNpH zLYHPp=3^iRWqGoyk=F(xxpQk`kqw%*INuoeysMXnBN#3DGR_r=LyozCu2o2bUud{u ztAeeRMY%LBe0ff46q`=-Qqkn?TSKS0cCztJ*H9-9uh8KMt20$hokz80jr*g%m_HnB z!fHXe@^U3Fw0P!QngY$5(QHQG4tZUbEDQ}klpl2@QN}EsqIH8jQpl@XoKjcPIMl2@ zncF7k245gKA z^1=qg^6-yKVw&(>@j#1@E}hn+?$qmjd9VYf<(XVOb&4Pkk}X?VR7O1k3{-UEfCB2c zaXuqd(dfxoX>3CamS70uQ91OhM|_D_in|r1d{M|siLp|?USp-yB-LYH1En$CN?@fJ zwpl4%WspT`INSg(zoLOgEOH^lU5XYAW2{tWG>ESOqm4yX%x^v57RonvWR6`i2_Gvo^4p%=j|K`uXt7qm?BIsBKO6H-X`C&`{3o}R}ZfgassQyt03mBZ69 zdAe`5FN^M^rJSngz$rQ=dvG?oR5FzF2-vO4G3n9$&V84n9LEUAj-SV2BRrCcD5u%Jzv|cx zDnO{}MU~?+MW(~X>k8$V0@5S@Jtb82s-X9lhC)^E2Z|m7?hL1OtU xbc~*DM`hA&#!n$6Ju3bPcnw|5P@!i(B7x$_4!KeATD`{`^lS^4$p2Km{|#{izBm8? diff --git a/envs/m55-an547/test_loaded_sqmag b/envs/m55-an547/test_loaded_sqmag deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index de0f951..1c72a81 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -1,17 +1,15 @@ # Makefile for images for AN555 -TARGET=test.bin - -INC_DIR=./inc -INC_DIR_TEST=$(INC_DIR)/test_inc -I$(SRC_DIR)/test_src/manual -I$(SRC_DIR)/test_src/auto -I$(SRC_DIR)/platform/ -BUILD_DIR=./build -SRC_DIR=./src +CC = arm-none-eabi-gcc +LD := $(CC) -.phony: all clean run +TARGET=test.elf -CC=arm-none-eabi-gcc-12.2.1 -LD := $(CC) +SRC_DIR=./src +BUILD_DIR=./build +COMMON_INC=../common/inc/ +ENV_INC=./inc/ SYSROOT := $(shell $(CC) --print-sysroot) CFLAGS += \ @@ -22,7 +20,10 @@ CFLAGS += \ -fdata-sections \ --sysroot=$(SYSROOT) \ -DARMCM85 \ - -I$(INC_DIR) -I$(INC_DIR_TEST) + -I$(COMMON_INC) \ + -I$(ENV_INC) \ + -I$(SRC_DIR) \ + -I$(SRC_DIR)/platform ARCH_FLAGS += \ -march=armv8.1-m.main+mve.fp \ @@ -41,39 +42,41 @@ LDFLAGS += \ LDFLAGS += \ --specs=nosys.specs \ - -Wl,--wrap=_write \ + -Wl,--wrap=_open \ -Wl,--wrap=_read \ + -Wl,--wrap=_write \ -ffreestanding \ -T$(LDSCRIPT) \ $(ARCH_FLAGS) all: $(TARGET) -C_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard $(SRC_DIR)/*/*/*.c) -C_SRC_FILES=$(patsubst $(SRC_DIR)/%.c, %.c, $(C_SRC_FILES_PRE)) +HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) +OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) +OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) +OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) +OBJECTS_ASM = $(patsubst %.s, $(BUILD_DIR)/%.s.o, $(abspath $(ASMS))) -ASM_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*/*.s) $(wildcard $(SRC_DIR)/*.s) $(wildcard $(SRC_DIR)/*/*/*.s) -ASM_SRC_FILES=$(patsubst $(SRC_DIR)/%.s, %.s, $(ASM_SRC_FILES_PRE)) +OBJECTS = $(OBJECTS_C) $(OBJECTS_ASM) -ASM_OBJ_FILES=$(patsubst %.s, $(BUILD_DIR)/%.o, $(ASM_SRC_FILES)) -C_OBJ_FILES=$(patsubst %.c, $(BUILD_DIR)/%.o, $(C_SRC_FILES)) -OBJ_FILES=$(ASM_OBJ_FILES) $(C_OBJ_FILES) $(CMSIS_OBJ_FILES) - -$(C_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.c +$(OBJECTS_C): $(BUILD_DIR)/%.o: % + mkdir -p $(@D) $(CC) $(CFLAGS) -c -o $@ $< -$(ASM_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.s +$(OBJECTS_ASM): $(BUILD_DIR)/%.o: % + mkdir -p $(@D) $(CC) -x assembler-with-cpp $(CFLAGS) -c -o $@ $< -test.elf: $(OBJS_DIR) $(OBJ_FILES) $(LDSCRIPT) - $(LD) $(LDFLAGS) -o $@ $(OBJ_FILES) - -%.bin: %.elf - arm-none-eabi-objcopy -S $< -O binary $@ +.PHONY: test.elf +test.elf: $(OBJECTS) $(LDSCRIPT) + $(LD) $(LDFLAGS) -o $@ $(OBJECTS) +.PHONY: build build: $(TARGET) -clean: - rm -rf $(OBJ_FILES) - rm -rf $(TARGET) - rm -rf $(LIBDEPS) +run: + @echo "WARNING: AN555 is not supported by qemu. Skipping" + +clean: + rm -f $(TARGET) + rm -rf $(BUILD_DIR) \ No newline at end of file diff --git a/envs/m85-an555/build/test_common/dummy b/envs/m85-an555/build/test_common/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m85-an555/build/test_src/auto/dummy b/envs/m85-an555/build/test_src/auto/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m85-an555/build/test_src/external/dummy b/envs/m85-an555/build/test_src/external/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m85-an555/build/test_src/manual/dummy b/envs/m85-an555/build/test_src/manual/dummy deleted file mode 100644 index e69de29..0000000 diff --git a/envs/m85-an555/build/test_src/mve_test.d b/envs/m85-an555/build/test_src/mve_test.d deleted file mode 100644 index 285897b..0000000 --- a/envs/m85-an555/build/test_src/mve_test.d +++ /dev/null @@ -1 +0,0 @@ -build/test_src/mve_test.o: src/test_src/mve_test.s diff --git a/envs/m85-an555/inc/test_inc b/envs/m85-an555/inc/test_inc deleted file mode 120000 index 31da609..0000000 --- a/envs/m85-an555/inc/test_inc +++ /dev/null @@ -1 +0,0 @@ -../../../tests/inc \ No newline at end of file diff --git a/envs/m85-an555/src/platform/gcc_arm_sse_310.ld b/envs/m85-an555/src/platform/gcc_arm_sse_310.ld index 2eb4b9b..e82fddb 100644 --- a/envs/m85-an555/src/platform/gcc_arm_sse_310.ld +++ b/envs/m85-an555/src/platform/gcc_arm_sse_310.ld @@ -39,8 +39,8 @@ __ROM_SIZE = 0x00020000; RAM Size (in Bytes) <0x0-0xFFFFFFFF:8> -----------------------------------------------------------------------------*/ -__RAM_BASE = 0x30000000; -__RAM_SIZE = 0x00007F00; +__RAM_BASE = 0x31000000; +__RAM_SIZE = 0x00010000; /*--------------------- Stack / Heap Configuration ---------------------------- Stack / Heap Configuration @@ -48,8 +48,8 @@ __RAM_SIZE = 0x00007F00; Heap Size (in Bytes) <0x0-0xFFFFFFFF:8> -----------------------------------------------------------------------------*/ -__STACK_SIZE = 0x00006000; -__HEAP_SIZE = 0x00000000; +__STACK_SIZE = 0x00000400; +__HEAP_SIZE = 0x00000C00; /* *-------------------- <<< end of configuration section >>> ------------------- @@ -132,7 +132,7 @@ SECTIONS /* * SG veneers: * All SG veneers are placed in the special output section .gnu.sgstubs. Its start address - * must be set, either with the command line option ‘--section-start’ or in a linker script, + * must be set, either with the command line option ďż˝--section-startďż˝ or in a linker script, * to indicate where to place these veneers in memory. */ /* @@ -296,7 +296,7 @@ SECTIONS __StackTop = .; } > RAM PROVIDE(__stack = __StackTop); - + /* ARMv8-M stack sealing: to use ARMv8-M stack sealing uncomment '.stackseal' section */ diff --git a/envs/m85-an555/src/platform/mps3-an555.mk b/envs/m85-an555/src/platform/mps3-an555.mk deleted file mode 100644 index ae198a1..0000000 --- a/envs/m85-an555/src/platform/mps3-an555.mk +++ /dev/null @@ -1,55 +0,0 @@ -ifndef _HAL -_HAL := - -CROSS_PREFIX ?= arm-none-eabi -RETAINED_VARS += CROSS_PREFIX - -CC := $(CROSS_PREFIX)-gcc -AR := $(CROSS_PREFIX)-gcc-ar -LD := $(CC) -OBJCOPY := $(CROSS_PREFIX)-objcopy -SIZE := $(CROSS_PREFIX)-size - -SYSROOT := $(shell $(CC) --print-sysroot) - -CPPFLAGS += \ - --sysroot=$(SYSROOT) \ - -DARMCM85 - -ARCH_FLAGS += \ - -march=armv8.1-m.main \ - -mthumb \ - -mfloat-abi=hard -mfpu=fpv4-sp-d16 \ - -CPPFLAGS += \ - -Iplatform/ - -CFLAGS += \ - $(ARCH_FLAGS) \ - --specs=nosys.specs - -LDSCRIPT = platform/gcc_arm_sse_310.ld - -LDFLAGS += \ - --specs=nosys.specs \ - -Wl,--wrap=_write \ - -Wl,--wrap=_read \ - -ffreestanding \ - -T$(LDSCRIPT) \ - $(ARCH_FLAGS) - -HAL_SRC += \ - platform/startup_ARMCM85.c \ - platform/system_ARMCM85.c \ - platform/uart.c -HAL_OBJ = $(call objs,$(HAL_SRC)) - -OBJ += $(HAL_OBJ) - -libhal.a: $(HAL_OBJ) - -LDLIBS += -lhal -LIBDEPS += libhal.a -TARGETS += libhal.a - -endif diff --git a/envs/m85-an555/src/test_common/misc.c b/envs/m85-an555/src/test_common/misc.c deleted file mode 100644 index 80afd37..0000000 --- a/envs/m85-an555/src/test_common/misc.c +++ /dev/null @@ -1,137 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - - #include - #include - - #define GEN_FILL_RANDOM( bits ) \ - void fill_random_u ## bits ( uint(bits) *buf, unsigned int len ) \ - { \ - unsigned byte_len = len * sizeof(*buf); \ - uint8_t *byte_buf = (uint8_t*) buf; \ - for( ; byte_len; byte_buf++, byte_len-- ) \ - { \ - uint8_t cur_byte; \ - cur_byte = get_random_byte(); \ - *byte_buf = cur_byte; \ - } \ - } - GEN_FILL_RANDOM(8) - GEN_FILL_RANDOM(16) - GEN_FILL_RANDOM(32) - #undef GEN_FILL_RANDOM - - #define GEN_COPY( bits ) \ - void copy_buf_u ## bits ( uint(bits) *dst, \ - uint(bits) const *src, unsigned int len ) \ - { \ - for( ; len; dst++, src++, len-- ) \ - *dst = *src; \ - } - GEN_COPY(8) - GEN_COPY(16) - GEN_COPY(32) - #undef GEN_COPY - - #define GEN_COMPARE_BUF( bits ) \ - int compare_buf_u ## bits ( uint(bits) const *src_a, \ - uint(bits) const *src_b, \ - unsigned len ) \ - { \ - uint(bits) res = 0; \ - for( ; len; src_a++, src_b++, len-- ) \ - res |= ( (*src_a) ^ (*src_b) ); \ - return( res ); \ - } - GEN_COMPARE_BUF(8) - GEN_COMPARE_BUF(16) - GEN_COMPARE_BUF(32) - GEN_COMPARE_BUF(64) - #undef GEN_COMPARE_BUF - - #define GEN_PRINT_BUF( bits ) \ - void debug_print_buf_u ## bits ( uint(bits) const *buf, \ - unsigned entries, \ - const char *prefix ) \ - { \ - unsigned idx; \ - for( idx = 0; idx < entries; idx += 8 ) \ - { \ - debug_printf( "%s [%#04x-%#04x]: %#04x %#04x %#04x %#04x %#04x %#04x %#04x %#04x\n", \ - prefix, idx, idx+8, \ - buf[idx+0], buf[idx+1], buf[idx+2], buf[idx+3], \ - buf[idx+4], buf[idx+5], buf[idx+6], buf[idx+7] ); \ - } \ - } - GEN_PRINT_BUF(8) - GEN_PRINT_BUF(16) - GEN_PRINT_BUF(32) - GEN_PRINT_BUF(64) - #undef GEN_PRINT_BUF - - #define GEN_PRINT_BUF_S( bits ) \ - void debug_print_buf_s ## bits ( sint(bits) const *buf, \ - unsigned entries, \ - const char *prefix ) \ - { \ - unsigned idx; \ - for( idx = 0; idx < entries; idx += 8 ) \ - { \ - debug_printf( "%s [%u-%u]: %d %d %d %d %d %d %d %d\n", \ - prefix, idx, idx+8, \ - buf[idx+0], buf[idx+1], buf[idx+2], buf[idx+3], \ - buf[idx+4], buf[idx+5], buf[idx+6], buf[idx+7] ); \ - } \ - } -GEN_PRINT_BUF_S(8) -GEN_PRINT_BUF_S(16) -GEN_PRINT_BUF_S(32) -GEN_PRINT_BUF_S(64) -#undef GEN_PRINT_BUF_S - -/* Helper to transpose buffers in case this is needed for input preparation. */ -#define GEN_BUFFER_TRANSPOSE(bitwidth) \ -void CONCAT3(buffer_transpose_, u, bitwidth) \ - ( uint(bitwidth) *dst, uint(bitwidth) const *src, \ - unsigned block_length, unsigned dim_x, unsigned dim_y ) \ -{ \ - unsigned i,j,k,idx_load,idx_store; \ - \ - for( i=0; i -#include -#include - -void random_poly( uint16_t *poly, unsigned int len ) -{ - fill_random_u16( poly, len ); -} - -void zero_poly( uint16_t *poly, unsigned int len ) -{ - for( ; len; len--, poly++ ) - *poly = 0; -} - -int compare_poly( uint16_t const *a, uint16_t const *b, unsigned int len ) -{ - return( compare_buf_u16( a, b, len ) ); -} - -void mask_poly( uint16_t *poly, unsigned int len, unsigned bitwidth ) -{ - uint16_t mask = (1u << bitwidth) - 1; - for( ; len; len--, poly++ ) - *poly &= mask; -} - -void copy_poly( uint16_t *dst, uint16_t const *src, unsigned int len ) -{ - for( ; len; len--, dst++, src++ ) - *dst = *src; -} - -void debug_print_poly(uint16_t *poly, unsigned int len, const char *prefix ) -{ - unsigned idx; - for( idx=0; idx < len; idx += 16 ) - { - unsigned sub_idx; - debug_printf( "%s[%03u-%03u]: ", prefix, idx, idx+15 ); - for( sub_idx=0; sub_idx<16; sub_idx++ ) - debug_printf( "%02x ", (unsigned) poly[idx + sub_idx] ); - debug_printf( "\n" ); - } -} - -/* - * Things related to modular arithmetic - */ - -/* Scalar operations */ - -int32_t mod_red_s32( int64_t a, int32_t mod ) -{ - int32_t tmp = a % mod; - if( tmp < 0 ) - tmp += mod; - return( tmp ); -} - -int32_t mod_mul_s32( int32_t a, int32_t b, int32_t mod ) -{ - int64_t tmp = (int64_t) a * (int64_t) b; - int32_t res = (int32_t)( tmp % mod ); - return( res ); -} - -int32_t mod_add_s32( int32_t a, int32_t b, int32_t mod ) -{ - int64_t tmp = (int64_t) a + (int64_t) b; - int32_t res = tmp % mod; - return( res); -} - -int32_t mod_sub_s32( int32_t a, int32_t b, int32_t mod ) -{ - int64_t tmp = (int64_t) a - (int64_t) b; - int32_t res = tmp % mod; - return( res); -} - -int32_t mod_pow_s32( int32_t base, unsigned exp, int32_t mod ) -{ - int32_t base_pow = base; - int32_t tmp = 1; - while( exp != 0 ) - { - if( exp & 1 ) - tmp = mod_mul_s32( tmp, base_pow, mod ); - - base_pow = mod_mul_s32( base_pow, base_pow, mod ); - exp >>= 1; - } - - return( tmp ); -} - -/* Scalar operations */ - -int16_t mod_red_s16( int64_t a, int16_t mod ) -{ - int16_t tmp = a % mod; - if( tmp < 0 ) - tmp += mod; - return( tmp ); -} - -int16_t mod_mul_s16( int16_t a, int16_t b, int16_t mod ) -{ - int64_t tmp = (int64_t) a * (int64_t) b; - int16_t res = (int16_t)( tmp % mod ); - return( res ); -} - -int16_t mod_add_s16( int16_t a, int16_t b, int16_t mod ) -{ - int64_t tmp = (int64_t) a + (int64_t) b; - int16_t res = tmp % mod; - return( res); -} - -int16_t mod_sub_s16( int16_t a, int16_t b, int16_t mod ) -{ - int64_t tmp = (int64_t) a - (int64_t) b; - int16_t res = tmp % mod; - return( res); -} - -int16_t mod_pow_s16( int16_t base, unsigned exp, int16_t mod ) -{ - int16_t base_pow = base; - int16_t tmp = 1; - while( exp != 0 ) - { - if( exp & 1 ) - tmp = mod_mul_s16( tmp, base_pow, mod ); - - base_pow = mod_mul_s16( base_pow, base_pow, mod ); - exp >>= 1; - } - - return( tmp ); -} - -/* Buffer operations */ - -void mod_add_buf_u16( uint16_t *src_a, uint16_t *src_b, uint16_t *dst, - unsigned size ) -{ - for( unsigned i=0; i < size; i++ ) - dst[i] = src_a[i] + src_b[i]; -} - -void mod_add_buf_s32( int32_t *src_a, int32_t *src_b, int32_t *dst, - unsigned size, int32_t modulus ) -{ - for( unsigned i=0; i < size; i++ ) - dst[i] = mod_add_s32( src_a[i], src_b[i], modulus ); -} - -void mod_reduce_buf_s32( int32_t *src, unsigned size, int32_t mod ) -{ - for( unsigned i=0; i < size; i++ ) - { - src[i] = src[i] % mod; - if( src[i] < 0 ) - src[i] += mod; - } -} - -void mod_reduce_buf_s32_signed( int32_t *src, unsigned size, int32_t mod ) -{ - mod_reduce_buf_s32( src, size, mod ); - for( unsigned i=0; i < size; i++ ) - { - if( src[i] >= ( mod / 2 ) ) - src[i] -= mod; - } -} - -void mod_mul_buf_const_s32( int32_t *src, int32_t factor, int32_t *dst, - unsigned size, int32_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s32( src[idx], factor, mod ); -} - -void mod_mul_buf_s32( int32_t *src_a, int32_t *src_b, int32_t *dst, - unsigned size, int32_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s32( src_a[idx], src_b[idx], mod ); -} - -/* Buffer operations */ - -void mod_add_buf_s16( int16_t *src_a, int16_t *src_b, int16_t *dst, - unsigned size, int16_t modulus ) -{ - for( unsigned i=0; i < size; i++ ) - dst[i] = mod_add_s16( src_a[i], src_b[i], modulus ); -} - -void mod_reduce_buf_s16( int16_t *src, unsigned size, int16_t mod ) -{ - for( unsigned i=0; i < size; i++ ) - { - src[i] = src[i] % mod; - if( src[i] < 0 ) - src[i] += mod; - } -} - -void mod_reduce_buf_s16_signed( int16_t *src, unsigned size, int16_t mod ) -{ - mod_reduce_buf_s16( src, size, mod ); - for( unsigned i=0; i < size; i++ ) - { - if( src[i] >= ( mod / 2 ) ) - src[i] -= mod; - } -} - -void mod_mul_buf_const_s16( int16_t *src, int16_t factor, int16_t *dst, - unsigned size, int16_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s16( src[idx], factor, mod ); -} - -void mod_mul_buf_s16( int16_t *src_a, int16_t *src_b, int16_t *dst, - unsigned size, int16_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s16( src_a[idx], src_b[idx], mod ); -} diff --git a/envs/m85-an555/src/test_src b/envs/m85-an555/src/test_src deleted file mode 120000 index 6142ff4..0000000 --- a/envs/m85-an555/src/test_src +++ /dev/null @@ -1 +0,0 @@ -../../../tests/sqmag \ No newline at end of file diff --git a/envs/m85-an555/test.bin b/envs/m85-an555/test.bin deleted file mode 100755 index 7917ca55c81c704376a5aac6eef681af86f325ef..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 36324 zcmeEud0bOh`uDl_W>0_sE(oY0abZ(~NbA-yhHzc5b?ai))+k+K?Lup3>eQK%09sq? z3=p)CXtky7sJMXE8JDp*)#);=^9yP_Zd5u2QlQ!!z~%}_-tV~q+xgAB??3Oqqo2#Y zIrps3exBz!mk=f+D}qF4;2PI|r_Tcy`Psjpm(k~Y)j!o6c=vz1|E!z;KK>U1|AoMR zA@E-a{1*cMg~0zO1i0C<+#)OM<8ry$xLNtL&L~I*y=b?JT;-h4;$PFxu46Lh=sBM( zS6j=>rnN}j&+F-rDCU{gWgiO#FNS2fv9+?<2Rn(AF`4r9CqM1)OZ6&=RvnS&Ce+I3 z$ffruzf)O4!LxOwj*j&O>d`t?H=R>y|3h<0nyM}~cLyH|-VBkX&3qIeJA2UFn0QXO z9g4(T;;Wby(SIifk|ZM$Bw{v6W_`NZGU54dnXq0okQ73MH+zWS?V;nS=urLtm<&2M zr-o$w_FfWw`$ZBhy*M(zZBgWXLe#0!UNW+_*Dx~a26Lb32KVfkA8htr-^&W3@a_>& z){a@yfBp4`e~S7$QGaL34fTz*8`|*)0!$>Sqou|l?lTO$K@T(NTiOfCo%p@8zqdO5 z2l9CrT8U{vVs?NqXIcq^H@W$aBxEI2~yV=3K4RhG?GFR{Tqg4M14T} zmhD+6%Ct-=C(P`X`fiuiIQJa6aqC)Vt|{^a`d z>#7+ad@oxu?z(KVEcD>ao!1}J?!5kZLiu%4fcs|D^KRtX(2^uYXu-;7?)`mD)Y;4% z`pq#n^naUpL%%-uhJMZ8MLqeuI7q+q`XgyOuRl6%%XM}x?&xPl!lWDe6N7K)tFvzC zy+dy3m)>bp`%a+F5n99O7!AkoLw+XB&x4nT8=@`FRHPrTM ze|d1x5ZgvUC6)hZWlBqIQNjbK&bDgqKijHf&bBH(IoqmoXsX@U$qEJiTflitxg>Ny z@;Ir+V8Gic`bq>gg08aPtoqIV7#+kdAVB*UfX zFQa$*o4y~YFUK=&LoZsN-X|+b#wOst#QV~>`HyW%G~E5e$HI}6dta-W;8>rSW;dN! z<6tt^*h$(id{6TRaQ%%w#`G?qbYGhN&)-FnsE@bf??znoU1Q(7yv3N1n>a5%{Pqle zTZ~IxBb*Gk|2yi$-1k#k_HNpGf}Kogs2>>XVtqA6t;VS5>)Y|vZvP+uK*@>zl0W(B z+(q!9fGZBqm9^p#8LzodGe6Uj6PI1fOzU2rX_tvi+CFQro*2ZFpQrws;;&{?QKxnj5$1@7d4}97>*Tua{GfAJMU{C??bTKiLrk(M zvZ>&%C??a2zj#B6tUsRD1$4T1bK1@inmstT!kTc7jN$^J>HFG=js)~vx}eml%-d}X zMA`JkwU8*HDqS`e?dPK;k;&X_70;{8_M98_q2S!x1baBrOEL_Qzk_hmJH29ZW9P=s z@z^yEW7oVFjG#k}{k*uzVC~9hPvfp}SOq=GmaevT5#!R8^zOvHu5pGwNy*yVj zH{MQ!dyhi)Qkg-(``o+eb!s`#7JAk+w)H@F|xU|9iC^=c9)VkBa=I9r3b%>**r%zcd_G=viE`p zdPJTZhtdcWHz(Gf>$rkaMk=B8WT@v=_I{~%oH`Vdtg6#cOrv~G?PSnKPFyHxyG`d# zN1uE$e^&mHv9tHzWS*hr^Sj7O+3rxVH53ZQ;+OL=IUmX6{(^&-?Fj`>_A=9hz>NGh z$3K#g%CBr!aEsb@nug8}1!vt31#k8Kyp3!Ri@r|uw`uQ{@sA}K@3hgP&A8hQtt{Kn zN@6`$Rox~l;|m4vMn4m76GH^%lz)=cRe(3Y!1X;Y)sax}4*Z8q%p!Nr^>VV*#@O0 zkmEXZOJ^vRDKDrhL34|8pMBkB>LQDCpNPd<)bl2+hGig}aI!)?r@_--oy4^B`u6p2 z@n>5zGqbACipv|%is_9L-AeaaabDxu*7zG|TN7_=xjtyU=B@JF^J0^Q6YlSw$a8$^ zo+bQQk(BJ3MUpw52p>96b$sAV^^8tsd?Re%`88fSAB?vMf5aFsCW5M_#`9amoo1bT z?K(vnGpQqOwej zGOe48E9koG;3O|en<-p%5u?hRS$6oA&OEGf;@H&$Rd)~Bvy zGt{Mu%4nfjCa>33sw*Pn$#~LPUo>gQ%eBf1#>iwv1a>l3d!}d7R%Rk&SEgo`kxsT) zWmcA~t@BRmO9gkP*}LPU8$gbo-R~)zm}80 z@Is~M9$S?6A$zPn(~Djuh?343MHHqA@1sijmpZT7^OiJ|o+5Q+L#uY;>(+gB-neCP zVR$b)NKBdTpP;9_P2u}Mi{1^sm>B8Rlq)k$f|5J@J-eRrT0`rgoJjX*fn*!XliY9S zHnbX^H1kt_iM|bb?_%6t=p$CpbJ1>#uzut5KxhUdyzE@WM^{EwaKe)A zh!T>_2ysOb-Y8FlXxp_jK5`pf*Vs1y+Oy)4U1!DFyVUNrrnBOtU3$02uXif|&m9U& zK;eq?hT$5|KkQf}s-{JjX?T)ys$)|z9ft^cMT*KK!O18}k3*(fgk?pG+gN5$9g{?o zEW-Uok5<2qomG_kKx3qru8pMX%A6gui*=qz_adtf*v}SZ5=^!=o7O#z)x5Gn;h}n8 zQMLso4wO72l~_3z5yuaQcG%RNSR&_X*_K;Hj-lZ28C{?3CuuXf&g>uR zO;|^t>h@#AE@vLapG6v-kw%}|918tVpeiEKI$UAcWdj=uKrvP@utZ2N8i3+3eyL+2 zg(5gG33{0%5d(!I)`Y!No@H~M6y7DEjbChx^h~@Z14UdO8qIgL zD7@rP1KtZj24ODZRA z)!3D3^RXui%c2HJ7NvM&J#MSBBcdn{5`9?Za2uq0?9b)T!eT0aV!e-!Ama;yC-&>d zXL_%RszOU=+6Ei1AFBkn7@S>rS41;KUw{nogW8{+wFNSC9$ew1>@NS>8?9@@Z6)MqGD|Jq7k zM88S$fJ)wheyJwO+Rq3NN6-`)>(W6U&3Sko-7+?B@*X@b~-Q8 zd8mj=VAVu2URhQUyr+MBv&+5l-pH1!ehrZU`-Qq<|xzP4TwkUiL!u7Mpu=B zRb}mCgrU;t?+%PUvE}MuURnO1M&0M4(3CeTwlgV2SmI(+9^FF())fJ+QZiRm$$WO3 zb*(w$oJuY-!$ISUj(T%S)K=E#SCT*+Bt?`*QNA<2vL(esQn3mb#qjVs7^?z0BaC4t z(>)W~X2h#h8^wBKvx)HXtH5vNu!`ZfzqAtW=lEUmswzXrlhjikH(rhMFqua55$hRj zTP4A1aX*Y1;XPOr1zi*OWmQkwp>+STDGq!*8`nHsv0euB$GS4#VzQK;C@*u@q^+@b z<#ltoUV|GL_*^ zftGj*qtT>uaFHs5$`oa3DEPyPFjbClGlEv?tD-s#vk~4gokHnb||-r)|3>#(L* zF{7-3H7)F?&rhVW?jKlFmjCL{sI!%9q-_d6bc0jf+XvQqv_d@1hQ&Ktjf2RdqA*F>xf2?=K2r{7{7TepP%ME6j}Ot(;#S~$O?<_rgM;BVTka1 zXFQKR`ZOj9?@Q{MqPNGwD%ft_U6;t8k0a@NcSM8IJK7d+kLQ>Fi?#OF2Fl;jZ;|xf zb$2CD`%YUF4UUMlDLe~d<^8TW<*iuHHY;~U;W-C;c2IGghw|lM%9m`|nhe{LA8(+x zBI+jSJ^zUR6UM%qClhoH)Jmil;zM>ObJA97b(79h(x_y@?z$f_W%yPm(oP22~o4`b3ky>L4zdXMY}9*Af`|qo-Q!Fn+$qpJqZn2w>?6kcLcwCWFbz+{ zq~teYv>l?VmNBKc56smBLQ@aq>MNt(TA3187mpq2?3ziux=h|kYb%6JqIm2wPmJ&9 zPUdIw(dBYsXkQesKp&??IWNb)Ma#3Wn$0|VnkE+(^s2qohEej%9@#|wDbp^)AM&Qu zdJNxy-W9?|P_9KN-#~jL#*$ox@PMeVZG+r@eFL>_PIc_spb0RX*_dflz|VEq+EX9p z@y5r(a$%D69@rFI)I^0l#e42hw40fh;{N47&88?1J6TcAip+HIXUWL;Q1CY=H07{d zwugeRpU}Hs%+-{~-?74e@BDWwjM}PNpQ7?prRvL>0t+ivUkP~Y%}Kz4K5V10Nugjm zdefALf$ zZcDCHjHTAz)Lgpn3k8oJi+n3*Ki%V+ce6rykGUHAhBT_xSz*TQ>_jCj$b-d(w#K!Q z+tjH4*|D%J8Rr|I_xw9694w9fPXC;@HVw>4EtPNUFaK?S`3U%Eda>Ir=z0yarMiP^ zRjMKW=jecx8sz%9l$z$Jvq}~Fg45~-`o1Fd;JCBiIK&31UQWkFh#8`;afSuY%W+XX zZQd8gYX>L|X>TaaJn};ak7mFnZ*loR9A_RU##TQJSKk8 z!?G-r`h7i_K!gLXJjiEn(XSKfZs4oFDD0hx5MLbSjc|wiXNBwC)A+Q#QLxWQIV*hF z`>NF<{8&stkCEFv{P_pyzKX7!?yb~rU}bM8C>-6zE6e`fZYl`=7T&evCuCht2T6h# zNt{%Z%}3n)^KU-9Zxgs@d5xm{tT<|yvhl-!CJGvgoqGNNZ{+{Xk7#EhYvdYLfKllJ zp}Q^EExg}D28|LphG{Qn<_lj;vc!MbPMB2zBA4|rU5XyL?d=}LLY++x8TE|n?H)~h zPTQGZs^cD0nKN`ElStA&_jyI4pPBPXeJJIxzJW2F!j-~Dsc)+$>dgks=2}7n<_Z6n zGZg&j*d2@A9_j82zVTsLLQjVseaktE+PeQ@M_+f|kpp40a$tN#anj792Ft;+pfP7NH=2EKCz4U9-l^ z>Z#OG)j?Kk6do#*;aPr>?rWbeqSl*{8R6!~6G3$4KrUAnnyVq5RCZ^n;3J3fnH2h%E^T$P_RYVTV(cDTNUTbNlO#WWyA8#-W}Fk z7o$A0##gnoTr76%+lo@Wi*usg>{Tw)*-_}yd)aAvceIz8Mv``j7DI|VqLid<5o-*v zV++%aYCP8(ES(<}Mg%^QT@!t$M)Ui6xTFz$GSA%enJRBQd}(YIG!W!8^{$Cb`i-mR zOjzC1i$+vgokC?%M4*<96`6!8E2QVhTd>07;&%o1a|!NPZ-R$Oi9{$Zf3fcK2En^>ZktdFkC$%JuUiJRQ`_(HBgDofS+&m1zS!I$RWgbL64&j49MM$C*}(ZO;Zc#fuR!Ouf1Ujp)RBnVnWk6=w70s zUYow)tq;c;!%ts-I0Snd_G%3NPH#xzuZdST=tng`J}5oIc}k0S`>0+{^bW5WM)#>) zX2qmZ#;i=vELjBT4fX2Zid*+m!mF^AAh{=K8Zs-U*ckJqQd$G`AEGdhTSq4BtNS=X z!ISYruEu!fHj?JMN_#(hH4VE(Wtkp3;^ewmM7UB;35#4YezirQIn*I$w1T_0qW>xT9a1r4}#1(~`A!Qf1sauRmJayPxXlb;5%PTk>zN zXm6ZnXPmO3Z00*f;~6{sSOEgn%n$2Q7QRhBgfNAEg#ZttjtsTkoAQpOePy?cZyLf)5=Z zn9VBb%@6wqpfKVNd>)YA{BvK5N3t5SLcy(&au=1a++4Z;Kl?Vm0W}N)mWz)@xj|Ev zq4;=yGQ1Q`A+N|7qP&_~ZGh2&QdJlGwKhD8LXtGj0n9DsEeek)==6t@?@A#7%{1 zm-S=AyT&opuK{1p|K2sy{S^`R!}5NmYwEr*G+X>C@3LpFDG~GWdS81>&f#A13?m#5 z!9$#>g$?5g4#dFZcXvPmz)}n*c8KAI{)Zg+@|3ksei&0y;JZ09jJbs@ieU9 zh~{Xo*{${tf#kllK9euCs_V@LPGodu@6r+GGDz<9QtOYtDXBzw-^olocL?v&qr9Df zMkx5iafSB^b~UE`Bby(BteDYd+aC&+pBUIL7)wpke&Kgbg8|(YSEx)@co#_iZ7SR2 zU{l_g3qAhlwWf!_3!`=wN^&%Eo8CR7B&MV<`1+qC0*Ba5{^*@|wZwUm!O*TzJ^>7k z^HV$xt<7QqoX~^JlDvWnT$}H%ICH!E}x6fg)jAHy>=aw)#Z;ViAHQ<$ig9> z7_Xu{)|TjDd$^Y<4TW+1_a+LBkv7_ExJ1*!kDt-C{9uZg?gE}|8a@Ain!_@7#~p6@ z5POYRx^k<-*zF6lb+A)*&1<5tmen(3pD?N4)VrIT;jzgan0dJOE6r+8*~?4(k)FJD zIv!R@fG|uk9Bj&7ZGUx^tgWyW^{gTjrh*4a>Ua zRF=WN^2@$dej2|TyR5!o=`HyPIN zAAvE2ljB;&`MJ-Q@Y`kMzH~cc9NC8>u!n;LS3lg!rmkr7CHVbq4-^dpO64=Hq-)Q^*hIRmLF)xl)pp2#KV@O8MJVL& z4sYZTjlBq|Z{9$ovD*-h<^ESRR*$!*oZNqm#=;`1P;A|fcx*!EaM-MtQ)flPDK8)I zhUPzYwl(q##g0pe!ZJeX?Nmfz2klwFlQ#!qu%{7&J?0!F#bAfq+TgF2@i!2IJ%Jc3 zjTk+hxO&|dG1shi%gb0dGwJPsfm@nJIs%Bn#tKi7AMk$yNB{mnqNIBin}vHkJAwUy zr2K~%Y`E^v|DIwtq3*x`pWRrOP5YN#2#jA^)Sd_UY{dG;#7OM@Yw1~ReRBnZ<{s^ng!%mcy zvGB(gv$`^Qz1kzId?22TqyDmn)&#@CL_l=GhFU~g&B~O|MHw_lF0#WC^q-5!Gwi%r zsNw1Qn2am6vDK_phahK+<-&DjW=rQcg>>%U6%DLKg|&F<>H5MW=)6wE&4v}qy~-Vr zB+#*+EmV5xXv7j{!DKXlJh(Dkc>i50Xj*1H^$S+!T@*&a{QNoPWl z43SyYoyaVpkJ=9*GMnJ0P#nfnC{FF8ZRfRk5CL6rkV6EN5eD_r2d+;YWa0!5%UcOc7Hnn}b zR|BejlB+(*0bcrauCzMg(+pk=nD!dbU+<3UibU z9QBXpq;;&<0eIY1OtlN^{(E!ytrO3mHXHdE@MkLC+cr=R9mK04d1Q!M|0mv+m(`%P zg2w&On!=aHB0nq~Skpi(Qd7!S$}13!T*;W(QrLtHprE%1=L-3K;mD2QPUHsOJsP=@ z)5whkXL#^#6!O(1DIr1W2DM>-aTNOt3QMiqEJ1Q$d>TW1=pZY_5I-$sr5NHri!@RU zQ6t3=7Y@`XdTFl9{=-}eJaeD{3HmP;Gf4))DPzkqCz(aqMM~13TvaHSRAIvED~;mxFeFVj)8I#hT>_@JLiE`@j@)0G8CE#Gl9C-`4 zU6L3D!Eg5K$5}d+#n;4Qmwrq(zl%@gBO1ff)CY-Q(6wqml{RdE(dqBej~m*sFi~%2 z1FTM&qTwO4sRq{5UVmta-%Cd#$=B-j_rT|(V|o2J|5X!j<16@N+n)opKQ|zdn=?(I7{TOb$u8~UoSEM%j zf0^SjX^!c+Kbz%U1GBt=Stjw}d4Bz~c{X2_Z(oU7O44&+j(XC0(;1f1i<;NfhUIWH ztnX}T)GznZexki$X&m27vo#)bv4+@cCW*coQ8L^;Amx*rDV#?aT|=x{KPI}I!UXM3 zt>OnahJBti#|M6?=-=R?D~p!Ks6uq;C+NrhKcUZ6hSx>m&msm4jE!r#n$52QR-?CL zZK*$lKXlM3X>AJ=_cSq(uU8JRDFt2Pezt<&7eOO=OoDoKeo$H&ELL2%!`+5H#^1=mac zN9;?i*wsXJm#lIiuSn&7r%BTHhEiy%FZPZ=juj)E6sZ*x3Z6M`;^pQ1zG1x5UDiyJ zXLLPzU_je0l=}R(IzfVJVg-}2pedmY{_PX|K6&Z-W<4aXrgCPzUa-ifA`_Z=tGtIa zrMmzhnx`|ro>~VqTKAkwb0=D-DV^Fpu&NPFl>T3J{zO;dIQ*XKpVL(+c=-fd@mpvO z4Jhx}aS99}DNTP|>Q~1#Qamdxhqp9o zq?nhcG|*%=v6GRZ3Ev@KE#nj7m2Pb_3r`%K2aTXA%Illh3Pt(o<|Db(2NTvakI*bS zNplY5muRG{7LAXsIHOd4xkbDe8ir-qQW|NQr-#Jqa@uXp?@4`JMJ^tbG;SZw0+B&l ze_+tKm8BD!Aj1sQs^eXB{9KmKLB)?K9S+z(ej=>t2bU9Jcu4MD0vqBVPBL0nUcsj$ z0x%9%J|lc2&P+5Q0#M|#ND+YTE*U@A6IVSD0Z^99wtv@YqkH9(yhSb`KdmTJqU%r9S&COZ?HW&t4(#VQz9giU;2A;a18`5jF+9 z_P>JHo=cE<`L@6QN*(vi9l!l&J`eo%)Mx*E&guGrah<|fPQJPUvpI`-CBSQ+zAJ2v zBtsh^0@4!-&OS!tPb=3klUPvEwqw+~ANckAV|R3#ozmEU=?f>)A#-Wm ziTb2zyot(S`c|4VYW(TMPuwt;a#q*^a8hV}e)Qo)Y7rm4CF4o*Y@fE|9sj0hh_Iy( zkxD=^JS%XNSF=ry`jy9{y!y%}`05N;No1b~zB)}UJk1(PWeqKC$>1gfFW;)!MuZ6= z>eZHTS)$!a*Y+~JZ?u;`9!>B-CQ*-=GrBpU@|x&$G8xY_hh<4$umdsh8C_2y4}8-x z?q?pVP_XD22T#}~c)~b%!aVSVaqxr@A*+vi!c=sOV{gEBJ;^x=vQ(C>bZ3H7SFRhx zQ%<$E{qqG4cEzFKi$}RT{*h2{{xJ$Os=xmKd?G|>$9{dlC-RTbfKLRGq5p?Zgb3$u z5BNmPN96o#H7XuCLBj9*jJb?aR<3X(w>%D!IX7!uvu2+|UUIgTO(JO%e0k5t3QQE; zw=ne;@}w^-Wr_={o&~QgHX<{;%C+hV#9KglMBuvfo+Zh;I?ZV6m${6cZq;jEIs0WY zudXql+yw}hV-5q)Z``JR{==V)mNmCN$N7Rc02?-!;o}^k;C`G}qdD!h&_eK;%+&GG zN++LtVWk84=y@ci%4KhO?@8B=%3i`K#&~(>JRv{O(!Vi7fw=kB;%) zQ!B84%X6HIH?7gue!J#eJn65KlQ^#)Ee*p{@3}ZJH+Cjg%1y3MTAgR7?UL*!oFKR& zd)u|}Led)aZl75z3s~TUVSN^G=9~~QnevDMS=6pEtI7CVzEb6A#~8b=E^jq4?N6w( zYa7$}i9DUplTRI;RTCh(_u6&xCv1uaGA?lS8~-v&ad_8RUmdMBG|^Pat;WG=YmQNTwMERIE?yl3V70G3oJoqSS`*x6(WQzv`d{S;#- zdT9ybk>VBqg2R)K49%7C!yOT|Yu22o{nvhD`}-a9Fa7F%)-T2M3GrL0-z~j<>At!5 zhM$`&neLjp7{VoH8;jB-)8*b}l!E!@F5{N6W`O8!p%7&U92-^*95<_*mo~ zyzNluQoI--XA1hhaVYRhhF>SY(_2{KS0Cze#=?O!MLt}Uai#)(!R(B=WOT+nnJHt= zHSuDJEU=th6E8R?%%5;yv^@p>xzYB=pE%RXWu0kd9yrq~e}D)t_Q(QDNXx1-t;#uP zTGg)-VRY~4Tv}HaI79ND9R1`O@rwm#MCXDtt?_6x0&SL}jf0Z~PLVTWvH47^vEoeY zs9zC5)jJd|WjD9wtBhyFtL8J}MYNo>@QnB!T2Dgj$!I+dt$m#AW_kV@Q7Abh_S(;g z-=mj%UWG>zS%1yQ2cTy}=-O@pW`B8sM7QEn69$KxSn>#-UAQQuG=O7$Pu?2SF?Ftb zZh~V~o~c$on?i^PpPUE93F#% z`ZisT-r%VZDgN9JYUjJ0)Ys|c!NEkw7u7Jzq5SACO{)_Q>*2$>$C2iPG}l}cYYrXb zALJj+$>bOD7CyOd8KQRcec!C697B>9H(Y|3wa|7~d_w$ZlE1YT{#H)%l_jA*5tc*h z<Xa`QhtnIXQcS0B|1Op4HZ!N z0s2+btg<2eSFOzb%sqPOJLVoHO?Bx5oWDt;k;Y8ccWTz0HlkjW2q_@DQGb_%jQ7{i zhFlVhCLbA}OQm#zBer(w8j_kLX{$KpkS~p^J8Gee7^QFJYGRmU-DQ7!<bLDAHCR7PG5LZX()p4i{mZ4wmMGadlk}J9r+l7&q_9STc={kf__OJzrFDvd?CiKi2wM>jyJNDN9qLHXQ!u8uD^KK11 zGo?hxzC|IyR9?myo2-$SHpdlNwUIDR zK&DF^`8z&c;1u)Qw(CZqKiM*yC|mIX&O>5OG8U&vx%Og?X)i|JLTLKj+W$CHoSdX@ zIo{)6#kXPo`A0S?=lbp+K~iV>{&tAtnMP*%^z&F(+79KddM)(YSIZ zUwSv7*Kt@!JmwOGxg;k3(KgtwpJ)|z%dMgcmu&e9I6;rm*+E!`4C_F4O-w9Z17{LA zPL>&KL#<^V5s=gve6%gu>J^B z&!rB%Sj;nl&@YsYBaqE={|MSPrPj!A=kt*L-c=79AidXDF7=t0(}Ul?!Fdu9r=n2d&1X-njTYf8NjJy2WHBe-`yYk=Zzd3i^-A%|8-# zZ?(0mU9M32{-{R`aT0v%(<)}YkK88By_^u&tImBiq1rmX?R|y9_pk5sPaeft<$>!I zp3!UFTu3neY=HvKm!S z+JPSAKkET-Gw4t1`5*kz{AZwzVt!^DK?{zXQ*W7SIJW=~^ed&>Tqo3M&PB&40p6)j-xMWjros&&e`OYa5>*#qWWOhws&TBVuPCHNG zr%H`Izgb!9(_@AQe5XNQ=i;yYtnRz#GemZp0oP8yZ0e=+%+!HiZk^MvYnIg}@~dkjbCdb=Vx!%}*I?%o z<->O!PLi92$i)lPD`(^6CgiCP?~|qXKOpPob(7JaW~cQiP3j?wW!nio1(?Yo8#CC; z)=-;QJaSEZ>{Jr8JQ*wF^Kv4ft%lkS;FQH}I$fggnmGFu`o{C05q(ypUu`Zcyc2p( z$Wp>$57)x}V5U`Db#=dlMND>))Ha{GTi*GV zxGDN8^scpP>q};>bu7z?p*uYDVe4hz2w0Y1)L-_0CB6|&-(L`&dZr>=>w6!a3mO0C z`oH;6PD*Aq`ggG;;{({d#Kgc7iGyZ(^v`r9-0BTAy46+IQynEvW;zoXgnXPqDi{BN zILbV~tys529A%KV7b}@eReQ14V?ET<`J?qbG3@0(hDS{2C} zwj(y#OH5UH$MAKPQymB3xk|JDD)$8TIJHiiAGPDUZ2iLThibQ6SFGoRW1@n0)H0@I zPpm!F8(aHyZYrLxi66Ku{BUXH_-d}Ap;b}MxD%>jv$~gIC5eZdZ^dQzXmIV72di%yT6R}6J%CB5f_Jt8p{@x(++IpHzs9Dv|D5f~Ch zIy|mxB5|qOKUZPC%tegA35!wVW#6w75I@A}fc+Nt2ffT44q6Z3vo>xV;U=vS!wnW zx$-wpWEXVJKln*!tkT*MwKm!FP@75_YaixI#Tme-o%F=#g9lXsj!X4e_*i=wwx4&# za)Y4ZA}!`qE0Kh z1a2I2wslM~mgNt9fc~oSUma&Q7;VijsG4P-4Z4MR94!Ti45gxh*v|)uYXdA-}^5Xj- z#+^8S9V4B;W`$MoKU#z(#gulOy(kA=we-H3izv2`-lO&^__qTKUZOkc_i$oEk36_U zUWfb+B8@c?vcl6LCH~IuWqI@6 z160}FjV$Q(bjFlY8&H=OCQI}271MJtpd{5<@v&WR(|6TrjWoCZ(;R2VmEr~b!#KH; z4{j*wCB|oS3;E5k*{5KHO}+GV3@c0u$?$h5#?$iVN#S=NbuZ$V@fb5#8dGS31hNR8 zVmf1Dcz6yg)Jip2;U9SOiBDC@Htw+^E1@`*A1pHWOo8-*G%dYyC@r6!MOBsk_M2ZN zFs3kH6x@9Az{l7(QXQgoKi5Jb`VR^PK;<6H(u&y@b@#&)zwid@UDr)Vmtp_H3dg1S zoG+p}Z6dG4X(CoAlge5_FA3gAa>a<_&;?%? zlC*c)l}cyFOT}88;@=T}?Vt3tMOlxP&&A!2{tCJj2Pi+YEvJz$rGljSTQf5zy@d$J za2ENRnD1O+Re~ldMN-NOOPeb_GomT|5n&yo(32n;S)6)|VY=bfWU{EdV1>_mGWh~M zcuv(>Pv4QWg08iXuWPxuf%2LFzDq>Ftr7Q8xDT>WzJv5BlIT@5lz^g` zg{_E43A-o+pQPV8qGy~~VRoM!f75y-ywY<|S*-BYEhe2}R4p>gh!9mgnpb&2KhEK{ z-Jp^`NM99Th4noVfHNzM?W2^a3HNX}g*7W2xrMgtKh2@NG}q@qm(7N>_W(ya%whMf zFin3U<>Pc6AO`mwyuo@a``pA*u0(nxA^QS=P}ENCjuE^L*#x=E#}X~UytA0&S|t|QQ%4)vg!xjzu`_RO6m}6|y>QbOU`R!gSYIFQxP=*?RXBseZ9IqBcW% zDiAf5w1@T3DSAX&@C}OmupXq+x?rD$AJ#9Y!}8*87tMV71loqK6=h*^kncZDeLGD;L9nz6PfOeaJu^=2g9q8aRUh|yzf$! z7uiv zWh5FWnYq9qvfCr1WG_|Xt zRb8%Z4C5J*Ng+MNWXV8J^xI-fyYI~*0kam=aN_zEU~e@2ddnAe6*MnHh1_!+ppTg_ zLO=v&N*G}!JiV6AM_nop+i1-pg2NSuc>G^@RUkBZO4?0m2s$6CD=EhyPS!^?{!G`n z+9OuZhwtT}?3+G1J7mI^EV+Y)nDrByqP$v9PkhvygLyL6Yg2pU!NuA#Dg}t}Qb@sH zZVuyV=t1C#`3vxb?|*>X(hf%Wx$PPF5&_N8vx*I3vQvR)E2x zW3ux~LX>+(mv8^4%^5WR6RRtPC7aR0Rd{)Z?Lpop!}a$wjl%OmKP*Pj3q0K?D4)&fx_2L? z{eocrzDN(%@rgV$H6mA!oH6KHtSE|iBtj2830qkV%kXuOR*GEY9QK7ZaVS`UI0Cf; z!+e-%p*p;fhs}p^WG4nonzq=EXCn3_@gb*kxx>~B5Y+m ze+wsegsyHUF!7HNEeHRL1E+Ku#D!FlU9du?;p;SM>k)ro)asq%`NT$AH>{85-%_C`(hG*|M9TV#?xGo(Fj< z$)~qa?1XXiM2nYywVBe=v6d-m^!}ZX(%p<;&4HShd->=la0O&Oo!OY%aj42I!HQ_e zgGJV>D%o^Q59`^|aWFax@BiLQlBnDZ1v~NeJ35zw;ByC-wQzv(*sY#C$x~vCbL+|| zUF#%T9A)_)`ws$JiTd{yi1F`pstm7zo(}fNL1l=S_tH}UD;#45PWgvGwCvq>wJhEC zLC?)DX1csb*7cqhCu)<-XyNFgHK*Asta9uWV7v9wNU>P7(n{Yx+{2iod@~aCb7?Iz zKMP;8Sz(WLBqYSziLk1dk?Mwf%BDRTisGZvE;nRDTXS<3+v!*0PQ|P2_0VHS;iGsL zzKR-LB;~h0@raIUO<_v!%k%FSW$EwYODablM#K~3?Tqd!vYzcZ)Z+^I39- zxF26GR0dRJe|)rUucOeq#;0rNWEVtPo@lli*k*A7kgLt>`>WxKNdG^?8W*H$)WmYBKVtldnb9Tw!IoIs!KBF8ypI!>80 z9PI(V>glw%%jauv8lIcqw$QZX+-5Zwm_oR~WcD-><9e2;CRbRw`X?H_jtayqmSQCw z`?@uF@uCA`m^YyR277pfk5A;!8fK&GpkH<@vS$0PiaNt(yW1X~_v!^z{_q@p$`u#1 zwyT)q-|VG!K$U&U-o9XuUu&yJ-Uo%o{`gqim;ds9?A(*~7>CEY?cx|%w!@I6@O@v2 zf0*r4{~*MC?oS(jQLE(w&tc}zV&-rAIYgAXz%o30{dD|$R{nP7Zu@ukHTKmDUi33b zWqPKY?Vxab*CMo(?xv>={W90=5LRFbexGz`tZI?+j2)0 zPBE5ak9C(T&oLQib2#n-R%^1UCQ*ni#WyDzb}4Y4*n?b~urDVKUlx;}>L_<&Z`M(Y z9#uU@ZuSxyiT?8C zIq0ZM6QWW(8&-nX*8OwR>KCEcKXa~qTa;M<1dK1Q;+4S>RTJ>{;tmndckiZ{1ZRE@vk5IRYaG z&AbSnS-K{IIWMu7XcDIc|e!kpC zZJcbY)5bw-&qQRAEn|ecZ`I&?6NNnMA;M*6A#n1gb51T(8RtM0B*S7W=Hu)ZL3GBq zF|hl)oS)jh5>>lq1FwI>+AO*EgN9xB5^9|N7r6sZSg~q%|IBvh$j{LTazgiZ{|k;K z{#C>1`l$p`AzCF{KAAscUknP-b`Maf3sD6u zYr%iZ!GENq-I+!?5VZF!Xzx@D^;@VW#R(iVuIAta$RnnFLV1MB%qCoo^ciJTey*}T z#smNQ$^2pmUFBjsQlSt-G0GXFTG z2|1;S0cyC{UbO(nDLV>4~wcPxj8s> z;EcwsDIRauuXUIZZ&_=%5bFq!l{??iVj-%4OciJC_RZugTULZ|K1Bt%ofgR{&<$g^%XlaAWwuzG&!%c)bTnRp$b|M`=%gson#g&d@$G8Y z5Ynzy#W^Ugd|>?;RI|iJX(rA-qwCuP#{yAN)=MiLdjdKZUtZ7|(LV$25}h5Dy3#Id zXPLiWtqg5BuXw1;P*v3TOcVRdB2wRe@mF2%ge)lv=9#q*N%d0&jNooMwNwE9ys>!CX`)zlK(?sko8i+?hYZw@gGECA?SFXfLo$GcSSEQ9C{0_U~Z)pCE*(0yJAVp#W*({Spzcpi| zb4sdm9fkyRIwtU+A}1|UPpz2~efndx&ENf1uhBM?UlI;|;%8>}1$Q1Jb-Te=^x510 zJo!lW(=^v1w|22Oia2XoM>Zk}#B|!ut(<%^$tX*L{~wvIj>kyyb!7NVKFK5@JE2>4 z^Qqxwgc0P}?R5H1wK6$P{*2n}r|053xJII~XNyKXjrmb{D)Dt+oXuE`h)g6g(6|a? z-h5~Sq8HIKk<*erzb$s2_Ym8|e8IKr@GH@V2+Fj!tJ7-_JF2b921WY@j>%fSGHWHZ zSOnMe(2F>;8+<$Dd~*-gF9>XsxB+J3(Hg_zz~Ix)=tpr4d3}`1euQERk2bZq>kc>?qzaompE#GDOdqU z#+0-2qdT}F!aX65w@bKEOEz~-vYW>E2Ct6f%XsLRE2;cqK8JrDU-28k%iC45%dJdo70ypl*nbP!2Na!%uql=8#gQ*T zC&6cI%tkcot(=~VgYMD<2B{V8n^fxd4RSumaqd0jY-s|6HD{rX5AcJhape+CXFiaV zUgw2xL8dL_<&A1xdVNifG9XjVyM%Li30n@4ab3PI`03c6Fx!@kGHsG$;y z|DWovJ-Vsm%+I}gSYXAL9}vel_{s)iY-0=ahzq9ZdBI>?=G8#Uf`n}%1!F5160$wJ zh(J@?rl%&5E{`OK*KWzu20O$lX=4Nl-LUBw&)OpggECRgpL3?>pb#{2uOuAvZ` zcK_J!*?Z1U-@P+)zxg!t&D?o>Gt;llA}va{Yx-di7fQCggt;=BiQ}yJ66Rg$;9i>* zji)_HGluC;2d8^j)@*SGv~j!0Z`u7O*U!Jlhf=NB_08sh&XIj}zK8Vj1;>8)>ahYl zS=P_)jwT0jBFWzDsKod?{~62YJ2kMsPz7P#F@xpZ465C+78&W8v$y$a(zI{f9!le6 zlt(HLso2vkzwIIMfFUFJ_1YJXT#iHCr_-LTHlzkko`cO=@p4?Wy$C(1V$=gggHxKn zUF(HT8--lxw6U_Y?R``$V`k5e~30M zg&Bq(z8Q0K=4(%99Y>vp{UNGVt^i7f8of71k_yre4zR!}l!+mkA@3|bG(amlr4^4n z=H2g!3$1v#)bmdnuUb_Xzs0zTuUh*%sir%R5#P!nW(gCa6Z(XU!+JD4>^qE3>G=eP zwY$)aQQKfK@J_^hzeaCbI8&#>&dPbgQ$514D>TkHJl(_h>8@F&;XKWwlI8YJ zYgXT|zH1%vqtrbk_z}}%db1TRBz1OKRJDi3Z&}+6v7vT_N8_flZ=|wsMqTk;vc!q; zfYnIh&5n><2RM8GBpEwOcT-cMmv+7!OlO57+EC|dQ;T?T(}Ek<^_umW z)8JDMeljoZ={ynp>S9e5*LCv(d(4a8%M_&Kx?q7<(~}ynL!H?XHJFg5!#{9odJJ4b zXw$Ix&MGGk&J?W_*vGa3DR6@gYt<&%V!Ea4u0COS90^Gy670{~dL8 zN=23ZBse96^1+Gf>HIR{_J%6xUVyeIcnXCttIv+`>$Jju%4J@ zL+i3*u0NCG!`LF)@0XD$TGBwH78-aAGgK-(JucVNKfiUio({Kg#!XElVX-+F*?)`1 zSLOsixITO#=15igX6*fw!V0);Nb`f-xWP#A0Jn=_RJXwWlsn_dfoqfO@Aph$8Ft<7 z3>Ri4VX>jYSqUDI{dh2vg`MVgN4Q-W1?KoZuID{7+rhoW*{~aMfE&k;`_<=~@vvFEgx%$iheX2=KkPD}O#g?+KS-P?m;YNsI_7I(J2cbUjc6|nL>5FZKcx{THDB_Y#2v7+Ju2^|u>X7M+Uc-$zia1C z`!2N0cJA(4V>3N{jHjyVfk^L7So>#Ph}*Yubo-vw1Cf0L1CcQ7>3pr!E?{{dwz0AF zG!Qv)^B8Qs2O`^Ki+`5a+E#_t6@NNQZa)|xJM)3acLzGVjgaMb_K}v8+_;5x1lF}#Gjiv2K&%o z@~7bUP1t#4iAkPET=P6jTrYWcxyS;Q#^#96)BVLXD)OUo;W+I?@p|H`-H(bvw9(JF zcDW9r$C=e}4zj4gJI_{b|4-bnlKUj9TxRhxl0$hv829Fvq(9J|M7*81BYgpI5z6hu zLRyYmXgu2UW>Gh>$OeuAEILYtxZaL@db zkN-6G{rvV}8Xb8(#8Ya?J!DF)#g3T~@8R6cC&csUK|K-GC638G;o9P2*P4(T=4*%i zM$f2RSS%ado(~a%UBk%Qktd zRbpxLU@qT*m0v#JtnXttvDjH6I@shzTWAlLRHb&CLsrwmey<^JziSlM7yP>Uk8f^O z7mCLAERWlJCVi(h2fHBc^>AO?>zbEgbdB(>@b0pX3l5*ax9$Jlg*fBcV%V|8YQ?EH z*)NIPAK%RO5|4{#$&%RZ-M4W?T^8z9MX3uB&-4D^H=WWckF@uiotI)IK`Exqctmt)nXnv9{roc-i%zq&Np9{ z^WM03{15nj_CS6(<;6q&8L}nU!Fn8h%u3%ZIu+&G;K(c$T4|K$y4K>BxbnEN{80xC znjlw#w!#d2vGQ-WwO%bYvvBRPldzp7`&sH+{GjOt^dc^PKH?u);WvXLJZ${ZAkDIS zgKI}u96{@O+{I`m)}*SoYOv=Lw=BL(wLAWo-0n0F%b6PNbUbmID_JR4WuyeTov^0a zfp$jW`Dj-qw>9qd-+s<^9Gqp%cs$5Z!aE(8&V2wElNcWx9nDF&sNNm-I`?Xt{hOm1 z?2yT-6xQN1*EnvraWAVIHM`S2>Y{X4EVff^s-2o$T(bdVhl5jDmPwrEQaIgs;aQS= zAKAc9O5L`%TMN4gUDx-G6zXI2YllLnz82_S30r6?Uz?jWW!fFA5&Ic5Vl|bA$3Csk z460!TYJhzxr7^X!u?ljJ7OlWU)YFV@MO}94<|A`6Jd=;qcrqu9>s z(XqaP$S-cnE8M=T4tJMVs>A}%Ln3Pld%n|TZ`PKwmysu_no<^a zM6qK;R9&hUd*!tTr(mbj0bIqoImIn|`}x$#zr+q)Xq}mk;Cqly?LzgmF}Zrw^NHt1 z<*HqFSMs?rxtm=lU1>qo&AbNl&}f51X+0Z{a*+dTwgs+jD1-QB@LzOe4eM=vw#r4V z7II7D)o#0+>aK*h3^;u!u8b2khXnMG==`1jNP0iDFb!r^(~u5p1)Sm&#u{+}d(NPL zQ{0l7lQQ>F&*+Z!Xc|uUS}MqctiRpvxHTptW)1qgf({+BBxu z`lBPWIUx`7jm)O>qq|sHd0>4Zm#{C_7c6|>-Hl;p;W75P3AVf@oTFz^J0+WDyQIO+ zpN$fqm>F0oG5sQm*Kd%jmu`@1meTmQQ(Cz2Y-rJ6EyMD8m_LQqn{|Xtug>hE0m>u9{RPrC=@0d}|ff`cTX){1ZRy z?bYo)Lc&8A_`ZZ&S$&#o{xl0LrB!_ixU)gWmRJ(H{H^E9N_{1)yDcI0Jd32^Os&hMGQHVi0Uxd;CqTSDKfN`Ky}1G$tks(P zHeS~Sgh~4`%W?!?Q2lv*k?tao-rRc_59Rvw*guf((Raax?+CnyRUeq;=@I($w>kqB zNzXN!KTF>i;1Jt*o2j*4QmZpdGS2HKCt!d3RP{-46mX^@-nnj;RLt0?zV3iMjj8oy zk6!UbD-N#YJ6<%kB1D+zc1XAheG_e9W3N96`GffSK6W2& zCI0C$ABrO*Ky&St5-hj_umPGvoGrbW!-U>8%IU|9zFE*focN*cDN=)n2H+}D#vbHv zKzy?UW-EpHI>K0yi=T2%X(GV;G6Tus4YqCJ4bF{W*F4{>qVNWsy5*ZiF6scq8;i@e zmgF|#-(-LXq~j?Rq(JP$yY4r>ql`Y2CSx6M)latt|dZ@ofmXi#qTWxZXs8B)^1M*|sB%A)O(KJjIMg?|^# zx1&61jZ)dXY^iJkB$<*Xr<(=&XG2aF2lw>zACsJy`uX2$!_3;%ulhZnwU_!;-*^Qn znijzMiCDKXsD2wWW7RjzYt^ThOBD&LEn-NkwMzeCv`CBQIi+6-W@%ZaQ~J2zt)2n> zmcY9=y6Q)F92?UAgdzR+wIUW#MCmr;R02rHy%k!jelKtqtClU2dY-S9WP=0=TnW2= z$e-FXHSu==CLo0^K4>UY3nf#a%cF*LS;J6LWBRPQMm!PHYD8#5=!_4V`-K&){biCZ z0VPU0<0Brjmg7A&-oyUPed?k2$@dYIL#*OzLvBw^X`^>FZ73^K zGvpA%@4RdIvvP)1m!H;izSghSsh>m-gfwczpFri0>SLPYcSVwqluPB)gR$2jjwU- zQhcdV>(Je(AlHQV^?K4WJ=Lwe{Pm{sdjm!$$z^e1C-Sk;njh$RxL&&KGh@cdH*1KO ztaT-EZN!r}enPUAoC#uFF`n8moAEtZA-)K?F`f!pIJCrnj&c0i0picKmPd7e#3?3& z7^@N(Mk96oED&>(1}A^Xt!r@mm)t&nfZNG?-o01W;df-^Wm(ZjpfWWJsTBKPeDV#J zJmPdxdwuWvh9jXB^OMgxr4!jUv~+Az$W&)jX8S9(cto@#M2(>6yY1)dA!er%-aI za74+Ec-(*9Ecp%j4}~9kpLi~;X;?c=Sn%{Z_nMk&p{jnhueN5ru-g4(opY99E zrFT6%eD&VCwZh8!HLKlgnRAWzNpGFIcA7A@wsxIR^;}hLO@ol*e&)&Xn0RK)>02w5 zPnq&S?@}!5_tvds6_yHC=dNR>iA9qpPnr67@z>lds;X;N())MaELNM{A(C58-MU)r zLWjXE;J!<^@4C<1bf33CLKXmJw&gZwS(#;_BX+-VNhR*&`*LT6W1(gK5(-nXsN6BX z)F!(xEQQf4+{zsb7FcFE7BF^)c@AB7p6=XiG2DBp?_7$zMZVBh{n8Z+V3jZa0T=lT zB=G<7Ra&>w3w3#p#HesrFsr+^w!TVeSXWh5Q&U}2P40!?sIOhOx~8G9+P&6Y_*6~Z zDsNpw;j{H?RyFwCRW*gPDi%#xTVG#Wg)yJHLid{06DAfFHms>CtgCsp*1KXt)pJii z3y;`SAzZ2o8`f5X7FMsVcNbLs@5Zs(x2C2_j)cn0BfvjVgmn449)IqAi|ye$O-sws zZQ_laGe1{3i_421HQk*3?DO9&2{dfD=p3;w7MRv=&_!!E=vrcyiTOt{y5;>CeG1lg zxcr+7EJTCJRWjZD+kcU1%|{;*6_)Set)N$*nf*A?ecRJlgPMQ;ifJEcKRw+mM`WPsewjnEZ^5ovL_p^@&3^He;*nv&ENe<`-MmKYDnCz4A|=e)3uRT>DGy z<=Wqb{q`^BT>5~Y&|-h<)pO6;)=PF&2d4*3hGnHMSv3Sn2lG z)>KbJtx;PeyW|VC-ntrLZ1vcNu?=$Vz?gmG;06xw3OG9%WMVTO=$vEXM|BkKg@WSt zmDm4w{~qk=vbRS5|EvS+=NbVT8~JM(yyv2Chh>4SwDc>|>aY1j&oH(`3@F`)_Sg9e zhJRNtSt$@>!laK5O-^2Y4BOF(73X~93 z&;d&5Qg9WN5LOTcC7=Mv(18*R3erFcf`S}S0@m`#C!5*iik043~Ea1fNxq~Ith z;kbfRpoE};4p2guf~%l}u!1Nk0S!GFI#7Z^K^iCly=56WpaheGVo-uv!E8`MrGmwv z1eXF2D8Z*-Jt)DiU<)XrQNa#S!X5<&K?zL?j)D@7D>wy82rB3RC3Gpc3Q7noh=LN( zfR&*GB^VT>ff57-IiLiSf?`mDS;1^jLZyPmpaho!4=BN>U_B_ouV4!(p;5sOP{JMs R2SEw1mtn1ejwNY;{{{#JmmmND diff --git a/envs/m85-an555/test.elf b/envs/m85-an555/test.elf deleted file mode 100755 index db6ee859a4f00432c29db04ef7bbde87eec0c4d0..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 159928 zcmeFZc|cQF`Zs>=z1b5WfC~cJkhrk81cBDAV+`TCVC&Y!YFndriM0!@ovBl2N&;wY zZD)X>g+!|@ZO4imT4!9w;#8;0w9YrEZQZDJ3Zy`_H-Jq9B)`wO0o(b`yzf8n@6T&{ zxi{yY^>d!*InQ~X=iI`)xje%#WZ)l5QVIQ)6GdXPai8=wO9)Hk*~CC(c&8^Z^nHQu zWrpZpCb9HVAs@YD{a2(m@oy}WVht}7LN6tL5R{I*N4Hf=lq-O~M2>}TK2>}TK34#AN1h`qU z{9-HXp7n+Ut7z}qPa-jFZ1aSDdw8i z<{k+L&xd9C@wKv9`#On}F_{YV$3N|_OZ75|RUMM&C)LVl%f<;e;XeTUJsL$O?(U=KWosOxI|948IHzV;;Wn))qf`jk|HBf zBx)8(WqrC?GU3^7nXoQoAZdgMZ}bqs+e6z?(Wd(UF`2Y)P7TTY{aqyX=JO<0d~s-A z+rsF(g_sj%z2vsqUc+rE*Oty}i~f{;whlySn0Ua{_W z+5Uo=%q&h=d6d+tke5C`gl8R|DQpzaO3+%^3_CSKF8i@&B{ z{m+n|{4+F2zy0b%8QZTuJoWXf>>S)t&$6WKYx-k@ujy-YuIar)uIZQDeodc0^qO9K z$2EP@KSK$I=zyL*-75Rn;yM2eC7b67dQbqu>xl!O) zwkpw9)D~J1YjX(pZ4gvq`cGD- ztkf1G+ttyA6#(kA6Q_#PKoX3<;!uK#9D;vHqh4CfwoBRzJ z@gJ^|+lsD|=`s6{{~+50n)xHCqui^;94T?pJBN$yzf%s9;ZpRc(L4Q3-w))Map5zVS z`fGiR=^Z}h?hN~1zmFj?A8*Cq4Y=sL#=duWi!mubd2V9l?J4@U1ebcha6D4}pU4w; z_s?b7J80<%b~2%%exR-M^)+a<2Cbg0Z^u)+{eS!eDaZO#{^F;74}k{-TnTutstp~I z@tV6e^RgUy3Awe*)b0nf?6MG(vD+G~Cx+1RaIh?#<>>Ge1C?d894+7H2mjzw9Wvz? zcd9oU@mI64xKq242(v?mxrY2|>!iA^{GcffANMC zTYoyM3+Qz3kGHcKAL}}w`J+5vF(=VZgu4zy_EMQa z!291Z^qy&1`+FvIsPPC%SMo#n*ajUh=VgtWRIQK6*=yDLh;Vl(D%HdTRc&Xz#+X>( zNS5c{W)B^g<(qkWUp!`>5H@!K&v#TLlbOV%>d%w(f6K8rp5W zQBRDtRHD7$_+lfQPs`zXCM|a%g)=hw!Mi^SD3f;AQWJgU5TBX+g}4{3gdgl$6ddZ&z>& z+qRpA&I$)--V6tC^!~Dp?0pt>9qTXC-YerDNiyClqeYo3DDJU~|o)cE}aaaYo+RB**UMc(( z+5zlhK2Xi3(qH9})wmpsLb43TO$1Ho=M{j|(KQ0|Ez0{V`;+-%UR4$qIIPx`bHbl` zIY)Fw0zb^n32)p~852rL$}F@M)q5;JVr0Hcp=>kBV1yG+m1oh@4N6NO&voF2&QK;( zo>NtU<`(8Z{i@5n%j7~HSyZ%*5qriUmdhg^JYc<*-(>(6YlAq$a8%9`-}P0AyT?y zCQ0RZB7EpP(eZ&Z-7_kk@r|&3@7H+cd@#`>{26ULpA4#+lE}Xv+HTgl*Q`~PGuhPv z1G^-hbcCH(zK;`DlbiVOFD(6gle{6`=d){YU&_9OziZj0&5YsJ zG4#KI)D(_Bo$Rmo8q%Z0l#R-P90T`A`B7}7{1fabwjQ~cqvRrPpVGq7U)PudQY@A{ zh`$A-I$&mrX&BG6yBKiPRJMHvL(|mh*DvVGUiWXqcq!sDHkj6~_p7!o=F>2P-`pUn zGXzt~AS}R5pkM-jFTT}3`qoe?pML!>dx!ItPjSLCJ#@VLTIl*oX`cSm zL)PitnhKKfXCE;#V_pvp%FL?)l^E1-#JJ1(sPZA@r(0u<3U9o?a;#5X#b&C@6jiZ8 ziA-Lvt5R1+CzA1`v%WZc+l#fzO2)|KL*VJZt zQ+Fw=*c>KHV`q)ZjO@)>?m2eS$&{!vm1WMcL?|lS)vi>~I!_iUyx+)4V0e+zbEhrF zd%r#2p5;ZYBBG>odNIXR;eA*s|5E2wd)^d#(o?LiYG~C?eAT+U&YQ3_A%gePeZ-XI z{uw>xO^WZmEqXWjVq&ycQ=!Z<2}Zzukhvy z22SBNtu?gdW{9&UjNPE{UR01k>V{}9GbV2jm2&DgFpDYCZeontOlLC5AN{7@O-z~! z%Ki6+WK(0@lqdSu>%EI`ccG4WLC?jyEyB7DM+4#MjPRm!As<^6Q^^U7yQ4};DkCHm zM|opB4I$f(?TOJ_=(xtT`PZBdE#7fDG;4?2y~cDpl)Xdm_W1Q~1@PRaumlvYXm149 zME(KC!jNiebh(BnX(u{1me6*Hu(DWDl_EG9McGlvREw~*cu^b6460*NNQy`pF~^EB=C8^w;H;PB~PpX?zS)4NXX8R|`1OP}iY zpv5ldN;-cQv2{jleM)mU{9~c2n8fOEMX<}oY$$+YykKC7Ft&IA#bNvs#{!BXI4}k4 zGD#r@iX+B^z0*q9NWn=Ir9hAXNf!_$33Uu|P-GXmNSZ}>uV|3KC||{Daj3`)EMCZr z>UbeO#(gvoBjy(HbY2hFP3EKAL)>agmstnEBXmB|C)pseKi7*uYblX=O+b&Jy-kfS zJtESlN*t4~=~fwW9NQGruXpcNcrSxei;5NA#h{H}ZHV?vydeWcTpSw9ceND<6XVNhLsPL8wLFVj&g1^=dWH7^+!J3`cs+7U z;KKT7ueOR(N^C*2`_U%7_t6YX=QK!JVggSy3Sm+&m84glkjyz-=0KKIWpCEll^OG} zCyL0T22mEJdE-59tFt4jH~|uUSk-VFq<guOlA7tTgmgNH$@&$ z$vaRlT}g8GFv5Le&EM^&y+GY5{>$?HB1hP)sGT zY9bl0EH4b++26j|Wok$VU+6W;t*UL2wi6mKripp8l^L)GB%<_0SwJPDqe{c5a&|Mq zP_gxQ23nuka(OVXtoU!O?sic$<;{w%Od1guyV$gc-zNg=iULCCx+9F$x!*;o-B@y;<@_#tS(T~dN&1P7YcI!mm@Fgei1!S(tq^fq z)Q>SEybEojuxsM(oSN)yO81|e62Ql^aLvUP?`1%LtSjqHCP(Rs@iMn(Z;rPs$Aj9q ztSyVsm$g{M2Js1aGt4vGrW_9>O+fNd(z&cymHji_UIvmDp~Fe*z?jBoZ86#7z1US{ z4Y#edXKyB*SBq5HNq9?tE60zmp|zOoHJ33boYMV7lYab_kp5*#1E~%AQKwr+WaskA zt%v#lO-bm%uTeII&gCsi!dzyG7ppR#WDDmnkZiCBE4l;Xd<*+Zyy^8d{ZeZgGdmbr{pj=uytV zm=^TY=Or0|CttaMU)nj{qf#qBgllpV8TBAgexKY zdQ0bZ7v((2lIi*hmdqnb!>5vP=^lx@W zFb@akAB(K}7GaN5;fWJ0SVNq}(N&2&_VZJ{2cpY9ZyJP^D7w-jyx|-qSQsMw(V56& zk3N-2!TZv>rr53V&{K+F3>N!pw7o-J1HlEQNa`s|>R1P|rQ!IUrAh&CC~B|lzIbw%V& z(tEy6{0rK?gC`So4OB~{8shzSCM$b0)w)UN39(f&VQ1Y>XtOZ*{jLa|=t?Y|i;MnK zO3y9&KTByM>n;oyuT(H^?4Mb0?ELvXt%O)O<2h)v63ZyG&I^x zF$f3y4%0P^<+ac-72qI?aOVcS*CZGeGsatleMQT$ikZ3Bv1;n-)|wWZjs{l7ydHYX zOgcX)&Z~(kjd7!V97%U}*w)i0Hb8PmmW^_cGmUVkVdWXgutHHEG0qbT7s-XGcp@ex zzY(o%3#n=uQ<{7498DlRWpBQ|D(21QX$f_S*n!U4pUtbw<&8AALf9Az9l6L8<9qp& z_!)d`g;sTW}nc|$Bc znr}eu3Sk2%*CJG`r!^8|X}&_ZH>9s^gWP|0J=JbbbnIBK2{4@5m}OJI&UMh*Qy=5; zCdNZ^VN&!S=oDPcM1?!ed*(o_o0*#C{`J4irWg-9Nm0Rum}%h8(%TZl!QUOzR6ujt z8V^FOsPs;g>!imFeQsxM~pglgo9&I zo2DWh9D@79Vy$QUxsnJQuu~}PihqBAE1gG}+wZiUf(%uTf1FCuM+<}Tdm{3O2+xSQ zdv`{7TM3!MWaoB~HB_q%2L~S(fA{Ya!ojG+1LYQpY}T>b@k_;J7v31o#H%x;z#aA!dlW#_1M3KZuL2 z)8^eVUOHL>pW$YO=|9wy2}IcIS_%2=E&gpX-3@%z7lXYM5fV#cyix8*{j6}cdn%u?D+c-*sbGZ< zdtbI%gr7T zMG9r_&*h`8|Mhnt-n|jr^Wc6(#p%$<9m>WJ1DY7DQ0$bmdwC=OSAIl03t1!Ar~-^i z7YN^B!EWKb9x`a8z%fjF1v5|hBHNPqVLM?~1c+SL!*nTn z-*Se7A04@+vD>5FeZkj0j7aEd(4%iSXHs4F-}LCK&RY_i>bw7@M-$-+wELI3p4|7d zgpRI?hfF4d+(qRu%U{B}X2fnI+EXQ;TNj5N$j_2E9IOQg4V1Xz{JBIp$RCN&#BHvb z<7M?!>Zs}t?YdF|H-R-GJ;R#nL9sItsD@0&>xO+eLj`in2VpgbK}cf-L|hv{mp;h7Z7eQ7xO z=;6ql+nR4to|97EyyG#I(OX#It?o4@YUfaj{nERsllj?EJknx+=e14AXm-+LnxLR~~9#Du1C zSbK?vT5bA*H$EI^j68kw;SlU;*voPFJFOv&zY@BzWw{;_j3ooi96wdTC$x-dbOp#S+iK(vq_)yDDL=*B>kH-A{D%I$=SC zF8L2vtT(~4Jwf4pNX~s33%VnnpSh@3^dEr)5dyBXZ8Ys(8_F=?f23Lh$!rl;Li1BJ_wE;#8N>!cjU$tRT6jH=?_M&fTZ&EyBux5|F)xS9L z?e^XAUi$XUBx-*R2PYn-@^WO{2t0p?=O}j`_DZXo{x^a$Wx~A-opxNSAd6zzYMG2pe*Zazwat`;( zrx@XA7#89zE#%7We`T_AA{t%BWBoRkPo1jlzTmk!lD4+q3@$%lhzcBJ%~_DAQ^(ev zO-yn>(5&*nn`F)vES%fb7u5bWesh`!w&e6!_@)@n()nlSqBafpQ2k@3*zJ1vA3*iv zj3=Q5M>WTK&2F`K2qgD~by<9wRb6j3a3Mx#_AVJ=E{EhETW0;qH#waM?>U)?XAa=q z*cfjo&=Kp93h#W; zzD;F&0(8o|^RbTqb&ctM@WRL)MWP&y-lBI8DUB=b3%>f-sK5bsqd#{0?JWsjL@=~# zl#gMC#`)BcozHsXW)nn3>FQhRWx*DlVQ$%!e)YdcAfnlhfsoD~*M3 zV#tCao;a_fBHotlVSBh2C=EsC_@7M_joWOr)<}w`g&#k?>%o0#Ub+i-x@pwB|7niM z*loACUh2xPiOg@Sh z_D6eGuGR6-Dw1Zjefwf`d0%k$QPz~?E@=7(w1BkD+7^|^{mQ_+4mn2G!#KCsug97- zs9zry4&BCYdSy??lXnf~P1wCC_#r!W{GO1`od5JN_vf$d2}PSfU!`>CKcRLf))2SM zJr64^>y}el2K&md`_lQT{3`6S`hsOQu7J*jpL zsbqu)`%?KRv?&}P*BY9a|8yz8RW|NRw2xq6N|gPNPQY>UItHDSnd-?D7mh z_!*(3`{$elF-8@R279%P zzXl)dG5BDq$LPuARcl`l<(t)Rc{%H5vfm0AxFs2+BLE+4yzm715&uo#=-(enjChY> zb8t^&C$K+~wEyCRjpY61-xJJ654fV;XK7-;aBoPvK4WeBMfKPx4N*FG7z1=z80QVvoh`TkPItFKBB{t z^q+?iXV`hZNW;_dF`1WY<7-$k4?)Zr%SH0Y%$Ck?i)i1!FCG|+3S;rq)A2=G(0QHk zn++?HdzITBN}_E)U8MBV)`%s+hBnQ#E=#9etY_q{dgz{|ukAnx45eD^A59s}V z&6`wHgC!!LqK7(#uS*ueTaZ#h(bLNTOhdHJfRDGR!0G)lks5LiB#V??fzs=v5x@ero!?pLs5M z6=Ir%yIrj4{bb9iCCMW2E_o$00zSJGs{2iMX?b}Wznk`2>3*hB)9(RQ!vi|1SnXM8 zJ>9A{ML0?Zj{3SeWi9J<0FT>C=<33{|JfXQ>%{Y?%|<>B{F#pTw)K=l2k~l19vQsW z|BZL$<@-@uLH&LxP4T6^$PbGK#x&rI)ReJR@=ACkmosLz3_2kL6!aG1OcB33;<+*0 z^4!3>M?E)k>bVhdh6V3h5nn@6l9H5eP#g9ahq1q)SZdv75y^q}sSok~eXQt1{IrM_ zeTe@m)`&htjp##MFp!_<#l9~6FMTDj%z*|(^j|1pQVfDq##W$DGK=uOOTKL^MSmn2 zXA#~O{iVr8dgxlFDP)??kfyC()dXPUOCyVwD5KYY%2i$3T4J<XMT@3|DtB_X;Xw27|iBJ zO&O*XaUD9|G%jVSm106|&J-8SOFvBxcr8Z;x5_}Nxf39zY{hW9n0&-`7fJz8(+z%+Ws0) z#y~PTA(!T(%n3`U$3^Aa#9MNyQfx1MkK9Y^p)!rOOva&YrZ8BJo?=Js9NQqT(2wS} z>KduUe@QH(|F=F46Z<$e{};WyeV~`u(90A)($8;x(a+|~@~z9!OHq0b^ifYbuR9}B zdSUb0+K3#Eh4!5*w)*96T2HJuB8?N9X|%>8F4hoV!=%tRBT9$62c&#zGsSsS@fG-* z^`m1eC?+U(Vg*0AF=F$iF+Q+M#r_T#9a*f{Mm4-cKck=Ue?_0Gij0fm&%y_c85`Gf zIhS97S&iL_v8De4f2=_##IY?%e!q!c9rnRhET=?jCq zchBmdk#KOGSbx;+%E~2E7cdfMj*zD5srtb784GhI%?wO z75wgDywY9XOj4(JJ-&BfwOt_A`CUzth-z{rlR3XBsT}t0WBhJ;*}7&uB(A1vM!jCJ z$fh6?np&&82Q+0nfDet+nO9G>1M01N#-+LCt<#iEX&xBWh$c$^FFSvxtB4=|{hD9W zRXBL@7+d*!tQs0n-jSmeAsskk1hLWCx4hkT8UPfAp}waqLnakL-QgQ}>gZ(=JI6{DID zWSx14M$w5Y=RkakMvQ9F_}I!*O68ZYhwj1(!!m3c^|Z{@Lt=F~?Y8E3#X2q{ z7LQ38x0^&m6D~K>WEcY&k z4)L{vRP2kmLpWVANcjhBe&KxJEgG|N95jL z%~fP*sJ{uThzx1YqjV+xr7b=}Md9ER$M8n9-rwgNu*7Q+NAL+vucTIe-bWEO!lth& zz1EUYGk6zz1>FJhph8?|Ku6o{99*IFY7ef=xbHI6zO|w!CSfNLQu~zKXep}I5qAdN zQ-s&U8&ikfn$n}$QhkryZEX7&yfOKD+Sl)oD!uesvqcLFwk#_gymEAa=fXk9(a3zi z-$Hv!d-~k5v5>jc??i3V)Zav9Fnufb89DxP>}PHmLpdv=0XQjIpC5i8nQFubZ^(F( zI?JamecQkBDI&by2TvuCjPwc|<<)GFBY)M=7_Yvn3AQ=|MiSlUfvrwc3rn+xQdvU_ zTROPOz{@vlwh&=Lm|C^Pxh&Ogrek{%);C(qpARQ_Ad{#?%o*F9RCOifbTXMwHAiGg zU$6r{@abJoAP#)v5$+cjs&KIQ2nS2p1z5s3Si(H8gmJKh5h16KTEbMcjU%tYcAf1U z30W%3Rl2jlsms?6;wh(E+y3>O2D{>L@cF~sE&E6~IPVC>jILk*Z#EGkv}3}Vgea3voD63Go5nG-B&zzezu3o*{Aum1M z%BGNv3BHw2#|umh-M29HmGYD?t7M9EE1m|gEHWZ8yxO(mG5A|Rc|_p4S3XTrb#-^fEijwoEBZ)ET$gj<`4F*x3eI0q&zQbl zW#M{o6{%nMBiSZS8lf&m@xmJbB4;>rv7$JpIn|6Z7L|V5Hon z`jk~G?X+Bq-Gmbamt=3b7Mx32joR%qYGnZnj4-Uv0?wQrCMMHLVn7tNYxF8I{)Vqi zIm$8GuB%(QikS8!Rok_V8T>?^_UG{@4$s^lAi8(kb@In-iUu++aQR#Rx2s8-_i}@u zwg${Z8)>zR)@OX`9Wj*{SI}Opc-v)MWwc+E9sSbq=wUzl3|mC0Z+Hgd8(Rc-ES?jUL>8*5iSeM!ljbID^cA1Tl{7KT?oE$0VjNrlu& z$K~*yO%iA7cyGI(&e*YDngV}h=#qc_!AXaP=F9luj;PwztB=+Gd%f}f^^X3xdUe03 zm(KLD(05|JulM@J`=;J&er}Fzwrmy=?!~z^I_Eg&l$mKEu}(a*bGX?Y-p#~2nm+F7 zNV+b6hNH@=#~BdD$0GLNEr&Xv&WizZrm*i@hXTK3_;vhyy@eHi^PwJREF3sfGnKEZz37s#M1s)_Phg`D$GwI-s>2$WfZ zG7e4_I6+Q@O3bHPjg_ZbNB)Kgs@|a}DZ9RZ;N9k;oo`lj< zQQF7Ju2&SC3JImBLcR7=p&wAooiD>8iKxG3!~@VXB6MswfZ5-kBeAWx)P%vICYC&e zXBRGtlmm%{6u&VQZKKb(&C*CTJu@xn~XZQ{%%+^e|4}(elDciZwZt*59Tf5 zNykB_zN_>Z3(mS29J6_*9R9oRY6s&vz*o0x8IyyIj<>yKH`$4&-sW~tE3X@;*8WcA z2qELYuK%)?Y*FQ$@#FrUDyP~{`vBdcZzJWhe*UGL9O)rFmo|bw%2)9Th)yJ%A3I1= zC;H~_Z#fQFSFSQR;CkaKK5}`AC^H^?Vkeau+WLkg?OOF` zKE`kn2rRR`*REBdx7VC|iM%Fd67Zq%mPFX< zWXwC-mFg3IgTu6{(mLe)id6@!@#mirNNz+v9fU_Ti+A^Qbi(`4IfobJmB!uhUb{x| zg8yLNMSeDB=Ky92`qGVRhfzdlp{?3(1VZig5hPB+a@Ooy>puvpn8@GdpDbd$_y*&t zcDQ+OVzu4mh_{*SoBbw7%BpINedVh7_FN8UpR?w*t<=CD$7R_T@{97A3m!!F z84HXVtQ{nEs&5IOvWh7s!tw66Utzq)uF@4&!O6bu;@Y*uw0sv(naCgHPlX2UsJ0vI zQD`5tE+pJQYp7)@S`G)FMNQMY?%g-aZ-KP<2W(&@^&3ci+R8G~LZ>sD5K|ASY8dQ) zF1d^!?bjLU{L&PipY(@Lil@__TRYbtUqP-0=kN^Wsr~bO-m1bLyX3iB*nPOB#h+X zElj!KPXJvv-8|%18>tO0-A<{x+BNAo{T1?4J})>_v|j;#`XC_u%nc@$tzSK&FJtrt zNK!7HXWlm>{q^qLWP@*QZM9=tNTx?DM@F?>cJb|1Zma5&CQacpIhG_F?ZXOEd3Jxn zaU*;?EY6%T=AY_mdo=BJSVOM(2fRC-SlW#ui+gq~TW znu!r($KLwiRwSyAaQ*to+#3VWOc@b!Z%`zds*7l2qc!@%ri5av_5x{VbtL_1-wGc7 z#?x41@b^Z2!OWQSnDZPwN#SWEeVGsYO&v4!H5<-fWGv5vA94cCV`*bJuW~{C+c+!A z1Pr=@Lna|c&II()I6W#$Qt!4s9MV6CQ_<1-Y+K0i;KQNh2W_|>4kh58fcu~aTk>r9 z+g8tH#R<9?prd2jxmeQ9#i#jp2V}bB+y20h6*$GbwynAms86=k7LqOd0OujmCmD;= zq+ELm$F!FqZXrBvPVIl~DNat(w;b>Bui)D-{(?grlyiLdj3DVVeE&GW@k}E#ZQ5Ck zD`T7TMm;xn!iCcZV#dnb<)d)rO}g+-K(FI4jzsh&27O6R{{a+;%&&a)PtBuoxsV))_!&%*@hOAwj!nA_#=Eg zPsVoD-|)j%#UJ-4VjtP)k07i+#ME=?L(iA+Od$L#W#b4$^V~CnmQAZQ@>}_ph<@*? zhYm2d*H?lPQO?$n=%)x8kc_U{ImR7 zKa=kcr7{IGsSS$E!WmT1e@uSCp_sdBtkvyuh0^zDJ$#7Su&qz6ocSJNn>2TELPD=P z|KX$>>%6x26bj$Jzb`m`7-y9St`m4ht#!Y=mUX*&T}qpe;}?uD9#J=XlxiO{S$;Nw z%9XV%@_qt^(G@0v=hAfNNm>Xw85`5M$=c~lpwybNA}>_`gI|A0*RC7M^0IcBjct@o z=F$=_=&4_K>Pky>fce$@}J0EvFVB86`wzG za)ov^w=#cq?#fN8i&uWW`s7NB*^+6Qa_NkDqV_-P(j)RlKc(}GZle4w-EUfO$)?;m zBb%u5olz#&(eqA-?3&7))o$dRcAny=N{>IgNm=XDqlbHaCqZ9l5-lERL+n%o zuI+x=lnZB>DFe0KIHO(5738nYE!ecSxPaDXfJFlJkQ1;`(Dn&9ZeH6LI;@*)a-FeT z*Ab6(fYHru(~ZS7x4K=&8rwKFuI-CBSsg7QGk5tuLPR~=#;^yvi#k3E+{TjjGo%1# zmF0ZN{-w3a{HogM{8avIsL^iX_haW0D$40(ktzDvSFM||`!W&;1{_ddquQ%$fKk{a5cGY0!RL>BE2 z2RHT6UE7vg;HRd#;PZ}0M`+J6*^2fO*$cJjLL~E;RpVpslucWm{8N2$eu6{Ps^*1* z1s|?~{=rPGvFhr64UL%WAn9#Bb+^3ptI)>SuTZ_tsgs$;1O_1B^V;-z&I5`b-Mw$O z6MTE31}TH{PPQ9!%MWnvvANbZ4id$bdpAlqoS7loksDCoVnjl$IZ}gYhVyx{v&u&N zg>{WPgNTO7BqQkiETp&0!is+!UefWGh;asAfj-N$TE}hO&c1zC4{h1_twd~vx!ugmMID_OB6o_L zYxTDMrO<`bwsEj7$X=OU`_XE6j(N9bj(w9IzR6x!k5f+pfyiE%;%m_VufZb)0Y{q~IO3j49Ozt_pX0xg3T-n==hKaK07yCMtp zWmcFezUPGZL&R9j6BAA4gh#}40D|jSU`Pz<@VKsoh)dP}xeEPd&Z7lRScDue`hJrH z{~=Bn{28{dL2eRw%|-o*^wj>Ei&`(;18Ci!2M9dk(FaF+!9!Yg#?$BJ7|}TwWYy@= z?$L;%<=`D%wSmTE=wxbHm}}w~(-?uw%&?Ehm%ni=x3Fv8zE3*imDY}!HL0Ha+f>SU z`!HWR&Hz5?q$fV_+ouX}T)M}?$J-;=e%=|+4c_u@M>r?(O~``<^Z6NFme|#PT4tJ9 zW*W+E$!i4}NOjl=|w&(|dlqeSW;{>~tzCX2ae?NN74eV@5WR*zZe7 z>}6cPNuuK=fndp?Quw@07j3ZG?vKG()ET80z>T9%w~o%Dl1_)1ea;JQb*0*y;N~h3 zv(v2-z2x2N)n1h+I^(rEG8*<`1?Tmw#9197!4-|zQ%ffg%w@&Urnq{!s{JH4ug#c% zxqaLc$SnRKUnRt-xKU?YRfx({;4F!)?{ur4ISkz6ATd8EPk=>C<)de&nVbtEMW}Bo zaK3k0n?psTgRa$2goUmFWmsV}%EmE+y!bwdaXZdm$BE~!Sz!h2j}~EZ38fupFVaC* zExm8#!;3A9?NNIb{9A$fFVLOzyEw6-M;u%-uS0wX5yqhJvx@r{vci*LCH~IqWqI=*160}FjVS1Kw8xZE8<3Y3CW-y=mC$oA zpd{7l&?7tEqVH-l8fk3(r+LnfOC|I92XJzw0Nha8ON>wF7x9~*vrk3~8++;L7*@y* z%kXz7+SBsp?8v*1x)<_Gd9;}?wkb400$Bu43GFd4JTRLTYQ-F^@HO6i;!{HHQQl+a z^KrMMzQQiWUdqpG%SpsbsUT_o(aek<+d_mRIE#EGRN!1@Re~ldMN-O(NSjMN(_<<9 z5n(O7(AkiTEKWVfG2O6gGC5RUu)=3OS$v@$Jg4fcr|(EcVb|aHZo8_-R{-MH)wP^o zPkBuM-zCH0)`3rjJshCQ`#4 z6l+#EbOUAAeVRvWX|B)1S~d&P-h(;Pp$|K6L}>a8F&?LDFEO}h;|<1J)&HgeZ!E%x zt|V{$ABw! zSUt!DMBEfBSpLo=RJb5bGK(o`6ktF{VYfc;Plq1ji}8Op9({oDq6!jtSNeUTJQ~ug2*~ZrFG$M z3qP!1PDkX$9WEOA^a)lQI+E{-4j{5!y;Tl6lzAVt(HLBKqW7O})#hrpP8Poj^;%CR z?-HYq{sg|9%sUb3&Il*FKXNd<8WA^;pu&4DMX7;TBZ6adx1vf>N^?vMjrJ-$3)?tG z?X?IW_j5~k5tW=LIy#GhKflYWqB^_qbrF09CXJDnGlcHTlOcu<91JejurxaPEf?aY zLJJ^M269D(^7tZtkN-+&J|rS|L^TFjf&Ocp9o;Swx6Yq%Uvq>QHThIhf&aTB5{^po zgW`$#Mi&|LD}EmKF%cf1yb$4o$-D-rj)%Rh?x)-L;aeI;O%*eht3*tSi3tA;>%5v$ zI$zt0=uBBUUvVbTx$0bxyV`B^Gq5$$d6|A=v{%V9$kFViYx2}p%JyO|VHM5iN#4~E-;Afw1zHl-U}m-k;e3KyjqkD$)O11o8}OeDBy(@JOy=cAu967%VcpPFVug#G3lIKfuPAvDVWx89 zy9rr(ccy70H0ymW5gBdH==PZrO{{j)e9S1&)Q*N$b%nAqGSBc#3S+}ejtulfzb&?; z`>s3^Fl#{#$F5$&?2V#dZ~3CGlE!7I5PNO|`sfKG1VWgZQbt$~ORuH#VVBCoHd^zD z;BY0tAOAOA6$nq7oN*m11nm!9D=EjoPu53Q{28uswTG;n58ul{+P8hQcZh^7U3_a6 z;?_-Qit%baJ&7@I4CcuguTAYu1Q%<|sT3f>3tPTgJX9 zKesg#Um~C}de%@wDAlRJvlY}#gpJ`D;2rA4e77$Kk}=hJDJjN1z00@f)8}c<-Hcqv$O~sn)2hez;AB~GthwOBKt0&|8TRARu zAXEE3AMMS7H61b)Cp6djiLjU7ushmA*Z5?fnG%(+N6Z-3 zT8t=$cO+vSdK|iPC?dnxLRu;E5p&oVS&75JO860|9vI=nWD8xxi+JdKXh(Kzu(;Be zIJu6jT&>4ri`hPyr*bvhPsY$W{1tz=1#9HvRI`X^8PDFpi5;P<+liU@I!x2S|Dl0X zx(xh6D#$KqAye^nnv8YuKQLmo znVdoI-}@-tO%LwhyT9cwKDG&50hv#GHu`1)bX=_3;<`6I_z=$UsNL^36?Sdi8^Nf2 z?-uZ=UX+w^>%zf@j`?$%2^#i8QmR{HjV9>yHwo1UbfLvxw=nfRK`GJCuuDJkAggcZGv zm^V^WF0IK>oEVdFu^|_$H8*>aoqi?mM54-Gk9F)YY!vUnRJ&CclU5+B_ zYM-v1lbs968h>z%041EWzm0yq;+%o@SKGgU&4PLtMxOax#TaL+nXFeWzjbJG?+<0^ zU$iURPqL~he{W^uW`?rnpSGK+x5I*%lw+uqUFbNY9E(%t3`c8#t$G@*?c&+m>xO6M zwJk6$KC?;91tt?NFo``$LJ2*KRg)^MT>WE>UPmQ-7E3S^j(yb{Jb&JSHq7f$e}g^J z!pFx7W)8E_anLV27F%G!Jn)&58ye9C3#w6@FW<3H?WcA&~WVQ-)R zzF%vrN8ATRV^3nd?aP0AKW5Hxdz`~#-Ew|3G}~c_Quv{-)IZGjsecfBKKEn{Kd;qt zfoIV3r_uAb{2V;WTwp1ly?)yMoy&i}e5d_;`)d2D`Oo{AlyW`O&2~`SUbI*F4_@C@ zs5t*QYFLX>FQAmuZ|S&aAx|6BU4A75PB!E0;( zmA&eDtm~gSSBP`GC@BWo)j}7^__$Se&DoLbIsl4s3g0;wspqww(J@Bq%kWs%IPMEg zBZ;|+_AfZq)aUG{T3L<{O|zV`Z_+b{N=vELc3y9l*QdaiYay~sy|u@uX#bmfBR`iv z7t)=0Bu`b}+DP|zeNNB|eiPwec#;_?rO{pp4Z3zYGl|a=7&&O>dGO5qUi+UY-N?|l z*tvEczA^K+kgD--AzkC#5NpO4NV-?}b*8CyGV)=}7+GUUnI}xw{l@=+LmapI{dPGQ zlfg8+pSSSL(tMTSO6YgB`0`^wuVRF?H!eYwk+rMYoi+yX`EnoCadNFr8;4bU20V*w zIV0S0V?Vw(QN*(zB3yJ9VNSku&dz755*+Y?WLj(`e1hE~gq(?O4D|jk=cl%>LaH6J zFt7i>*ev;!?simTaoFXD(PpHLp5GP4O+BYj31m7gnYkMNj({Um;ogN|~M z9dcm$1=uDD_M7WkT2=Ta6}PO#Xq&{VvIp|ujK+*{kQ{ z`$w45B`t2s8DtX1_?rI%tI_v?V}&D+$_ey%DSG^P%L5_R&IEHlIz^#^|WO0-g!&16Y${Le^x?`;>1-yM~ZU?L88RiLqW-?sz|-WAWt$oe}jjSY2XsW73z~W$i5Uw~*`v#(!eo z`HLv^S&LWC7%877=00aP`Zev5UT7?y#7NtSO9|(q?Z4GGAX?)i|2-K)&c}GfT%H8?!{hoJ=a`$(r?}HX9bqm46 zFX<|p@m%ndX;;e{2eeUf{Vm!*yLL(K>e~3VT6^4DJw1J4fcJVP^FB%Mtk?7O`5A#B z^gFtUh%rMtBc`oNxn;lVOIaY7V~i%qE!o5&RMNT-KcL47o;Lc@6X3reVL7_(60>a{ zzKsD%^0>UKAn)Dl#skho&mjte#7>A8jod092l;mZAKTbl?*|GpjGS{RpYs3BIDw-u~xFhjO2!u@3pQ zi$WubvzB$_!jnKuC+*zwNyk%+vJ}|=5$Wo9grr_YgwLeoObVhCx@Fg&7+y{oL5|%{ zr|(27lh@?Wtj&FL4!(nHBr1Du$f&12KZ>UkU-!k?j8*W+L=yw`t1#wG2iC)T5jz7h zExGgB;^%q~uszHdT)Pgxl5OyyOl`Y7t@fa!#;R;kw6EuwoClZZETQ;0GF^h5|r zK+tGG^E><{Bm(}HI+-c};Ix zOUIdTR&RMrOv!lHR2rG{u&^EYCW70|Ro)N#%(!Mj5ZxBc!i+|r%j|h4bNqqLnFR;G zX*apE8zXZzKeJ`K%k$1NSua3x*4Rwbe#jm15Mp!b!}fPFEeF177spH@qlet%94`0K zKCU1$kB9AMMTY2RUGZ6F!S)-1+fY*&1xzqatns~)s4For`>e&&sMOU6Wz1cZV@GLo z&*o1s;oCD}U@B&lkM)=)!0R-wYy4xMZUc@=_7A$cHf$Rca(aH~!9C{P$)3fY8qfFe z6~7BTwtlN`>xO<#I#O;Pv5m|L)B-|n#;8R00} zzpWezYT}brVHGY(WVO;{4Sd(zQ0iE3)3;5 zL9X#FPX13SEr-7; zOw)c+dHN7N-8-1^ZRo)&;YhRzhx71=V>UDPcNEmeZ+8m3yNi}F?mp}odHPtc<;T@l z{1$}r2FBYh__Fepp1sH1BR=FjW7gQ9`IP;W%6OKU>!{qh{b1#UQzK4&(wjZbdTfk& z%s13_b1z0n)`_))mMxsW75isW}}Y zQa2mV=;d7R$qtFo?y=`%Crqv*cbcPro!NxhLDzGpBlp+X-^8wrI>@=;?#dHAwl}#4 zN_}q}cM`XHrue%0(V=zQMuzz1%d=xe$(a4{c^SFW&5)J>X{IBORIYdva_S&Yk(Kf0 z6c4_ui+8F_M^3&g{w9Q{mzl6+XHRj+b}TpB4xiaO=qiOxvwqWPJLWPu``3-g?!N3a8)|Dnro1I`o~l^S6llhRzHkulWMEt)~Y#Z;UD(? ztlw_49~)yojMQUh|J)wRze~Ja1A7)^_nWNOR6kZ({PsxCMO=mc#`sg~2LB1Zdi28P z>OGL;>@R{O`st!tO8dCibmgdV9FgZ#PkaG-%r;#mFz1-z)Am~8`($r?z6I|Si-~A; z6V#ZUk9=Se?3u)|caZrl0Tv3d5yn=gL# zBipEApD22F|KnJhw!!+CeF!5k4u+_ugs5PdCVxm$Z1Od|6Fflksaca z@}3ObZC%*^EvC%I9{$6bmtuF~X4{nq3ou?db7Jw}`?uIUmu`PARO;D`eYQK5Zz()} z*Vv42V6S^}pF6+1a8TLj?h6-eC)x+p>_2nj_(|;cpYU$xV{0!wxTW>XiO0^IIWd6! zbdg;g7l#j6h4Ud3#+2VxrUYQsr*E z!7ea2|47#lDsO)0r3EAUZQ?DAK-c40?!m4ew7)4Bp;)TF8GJX>e_#s!;KhcP`~AwQ zcizg(;q(5ipJ7D^qt#OMdh>nNPgdVpQ~s$~YwI^>^X&IsS70^u?1^gz@!cWZk*2JL zCbPC+JUtO|7gXPeC+TOqCXb(l@6f;3H5$LqV4ueXPfqabjn@QkZ2V#Hp+@dN<=hrw3#7w;CU6d=gKb6MElQ8d>tI(6U?m5 z$U5dS7X)(}eZgxQ9}V8y_+)>EISaa$dpP^>VB`Eo)(EX|X%)x$XHOhFb$cJ~p0nti zFQ^V-7xB2smd55_RpVv77|ELjp3tiMG(AcuO@=1pf7|$A+rYl zLzCQ6SC0#}$g{kF9m!`;G@qK?H!);AmS?v3nz4@#Uj??j^Wnr$-i+4=^GYf0b^Loy z7JSa5eOG#J)^z%ZQ)0o*$~@4c&1XPfC$(p5He?mEdL{f&@1tn6E}(!~Q_* z(s#P>oi~iKE$>cjwNy_GxOQP=f4pn)<7>H(jN=J%Ib7rM4-Qz!N>By91P;%tzJ3ztiHG zxBcv7b1&ZfibUpo4_^~g|4YPL2iYb@uWDeV?DGIlH~DYzJ07>Z~5SXEJ6H_I**6FM7&$O~D%yr+X$gmhIY#=cEYi zkNZNMxY+H6mmf_$FgUUEBjkK-A$Au@;mit@?X^Cws^N@Jbdca`nP ztjnw|8n@MD>oLv3SYbBqSVbrI?>bsmE(XFc?ZMtz?w{qe#rL<}fv1SWAC0+9l-$%a zCMYgh+QU`$P*3c_h0kL2eDUGI2E3DMxz~noE}3r5K4N(|`&Xui^Ma!AvYtaV*S~3+ z)#7Q+AKhbm5W8tMW1La)&ft*-)4iEL`@=_K^AmF_^Dpla=;4QI-h2BuxKirLR%b^G zorkRtXa3Celf3GiE)>{@%w46}Eq--J&B^_y2dtZH504F6r;lw^bKAS$@}TV@({>l; z4$odDCX{-vY}6!ucE?I;{uuYbUz&U0qkS3J7hym0y#jLW!*i|9nfpk0Ukmm^vwyAM zW~eL2Al9URnUy@w5~X7#V$}6_J`8O9Ao1x$SoNd1&)ND7UEv`vb`t2z92t1 zf>I@j}f+Dt}UxB7ker# zp5own9KJ3-Vt}>n=^R6Dcdo=aP)!;f*1B49L&SN`!sD| zH0Ul}u-!Dp)^j8F_=*H0n`{Ooz`Eyg;JaT;ZlDKeO8#g=F z#QjUx!~;t?|6LznvgAPjwb$NX-B{r~^3!8(n-g57$+-S`VR!9-*+JUwyleX3ztuD# zYBCSv8O3=BZ&!Y|Ux>DTlWF^BUE{}LMM;czyfx46jhBl0_$-GN`PIj(7mfO?Yh>5J zLtWVMTyO98#v2?B*xg;xZ@s1>ZZnnc^2Tc&W13aiO<5nkdG7$e&DCwfZnf>F z)_!X5c3=89R#~3K9n|rTW4isY8BcSsU_R7z%7O0(nvXe-;EK7o`xU(PfmNPk?o*D> z4|P|>9i~m?AB=sh+l1UMsVd#o9=BS@&&q$tF~f=P+vi&MK%yJ}sfaY6md7ok;FR^# z8tkVj-L>bKL!)98hpVJs;wjw)jseE^L&J4=Hp%E-d%Poz-pntShH<7j0cL z@I^(%@d)0~$Jd8emi5UK5!! zeP9j#bt^K5SFGL4SIw(2M!gGAntaHBc4l%x3*?@4B#F_2Ys?4LR$H%W=#AgI^y=fi zPuIh{c;km0dEJ)jdTj1LWJc{;#3{???o+H2%NONwCu$5eU%l%?zC(k4>!>|`B%B}5 zS#m*leth(`_s1P0A3zWTe;h2TMt{mEh}T{-F{fyLt3pkd^=vt9-bX#( zJ8k|`#(=0ia@z7I;EeZ9Tdw;tESlGi|4+nwD`&0GV`a?$S$WL*rZ>LO*;?W0&&a5Z z|4Ttd{Mu{kV{3`*p zGdU+0k~cLZ$;2@|6GaD{>||!Cwf%;tq5f(tsLpBNQdt;pR(%dlr;1Y zL;2mzGkFJW3YDVvwP!rD?QxkUu>f_>KCv^P@`p%Ck ztB;+rTG5KkZ+_EWAD?c1er%xMVpGrSk9WDS`cU+IzuDQ@Z*flOetvAHr#$Wu8gI&V ztmUaPQm+5_9M=ztYvUQtyq@~F4VsRH@zvvgN-J(9C4A$n%tB;>~n%5@uCEEq};{sXN z96#?DXbHA8$NQfi)dvqTD+Bjt>_u|$9?iIJ(E2;vrExb7FU6f2UWaGndQ97p-r-;? zbI`Y%mAg0R?_HRgd{|3>EjBlAB?2X`R6Xn*k$%yiPvGq)|rd|IL!7 z({+A5J!LKlL}ORF7vI|H>j?PW&F!s`aA1|Y)wirI6zlW{irg)sRUky#Lv1m4OM6GF zFDB|cLd!yJzVMaqi^Abfck}Aza3Ja~^sQKS2^OA(sE@?lbu(wCCC|fa{h_uNv9MyH zX!Es+(rME#oiX#W%ge6tH8uMKEli)hR#911UE|@k!nV#ZzCwq|ErgWAopPj*OH;^h zDp`zBTji~)udS_EQlp+Pxv>Ff%6VP=!kQ%&i*6*(!fWel7R{?tt}mH~O|Q6CSF?C= z#hjYOLL`J2@<^b>$wt@E!>lMm@vFflxWvEmRbDD#6Hx>Rre9UWW38A^HrWy-< z3q_?b9ByxRM?0IF0|9@)&+8@EwTC-f1JM${FXk(`CD67!)D|sS+1{}{8u2v;O6DxQ zc51A>J=}~rpSBWTN9)vS(@UZq%_VJtmElm+)aKR8R)V0S5?pC6iN^ehOZ>5RUvcyQ z-aJ|(9f4*g6ZV;}BYcM$^5{Bp`RZR+RDG?@wqwW6`#hhWy!skF{KeiQd`JaAV`WWI(7w&oS_lRd){Mm9xm9D4Q(!1gM87r#9 z-i6CrUw*Pm%)YnrgB#wfD!+TjstbNGrn=!j|Lyl5j9pOu!-I7h|6`!5`p@&<`;B?( zj_T)r^7iVgRq<*#gwD3+Sg5`2N;mQpzGyViLFk@vQFMa4#TN<({8z$jgagWzB6m2{ z7I0tWzbJZ9RPhc%RIfd|#)PzP{C6^eyl^}JQ`5` z-&ypo6FHj@j^6&4!`UH+5GGNvxN6?KFR|5MqT_$GO=21a(C$45Wp)*RiO~NxlK$7% z&P#vy5HlRr!T@51K`jU?%8y{z zf(tQ2o)+AQ849&fikP8H3+0Fz=4zn(=t9h}Q45<8Gi=tv7Q_rsXyIqIB4?^y51pyCP?%NOu9g@maDr?#@ni3g6d7YO z>qc0z?4zu?rZJ|kS@KL{P2)@xOc$GqO{JzeRaIBI3oANW-E$UR>t52{9&Qf$LT&D8 z(~GCO3oAQAVZYng(K?lQ#wC|1*NdmQr?6zol4Nw-WHfjy-V7gZYgOCtPF>mrndRo+7;F$ zH5Tk_Z7Nw&Hnp|573xiCiIha!TVjZiU|G058fAi#XrMI|Trs`%uTv}9(X6Y`yn5Nn zP}>mUqPZhhTDp__4fNNkFuZiCG#0YCq}A6P47CNQiD(V9Hb+*oepX!G z`oFrw%bJ^~f93MBa+{m~`tl;NQiq6|+uK?~%arnk0)fEgGiOZox1eysf&Rkzb(1iE z+vzw@mF-`GjFcbRk_dU zsQC3OFFQdm)%ZV$T&4qCmsL$s>CC;yR^fEdbLJw2iz#+9$*??U;M$K73SC&YfpV43 zN$W>!u-}n+XU0a`UDoeez7I_+oLelGTvL{FQf0Pzp%Ue3=@1E?S2(TfO>_1rdANZo zb7zsr=|WuL%w2EWV7*Al zfiID7brv$N%&x{S%R~BJ(5mS?xwlC71eAyFQt+k@gj=Jn2Q7y4$|UU;r}*`rSN9FikNP~t{E&vSnLn|sLgbX= zyxM7Bk9-$p?Ndyyv~ALPkXLWpxj0XEoo$@g8z650=ao+PI{W&}4HS+y&WY z-@e!O{^nm^-Sca-COvk;UJ-VV(?0NNHezW(r*MGYYjVeuFNzKEJJg^odHr0rnYMiIB z{V=^)ju(;U0MhKnx!k@|<*IPGO?m(TQ+FG*UhqC7>zbLE5>zE^Q znsDw(_K&=qr2e7U6f${45YUEonZtA~0zGI0A7p-k^Kg>aT&q}XUrqAT@H${#SPK%7 zxgzHYoJ;x-%Zu&%0C@A}iJTzLwSF;-kL~w!@bS0-A+`O^z%`w|6?|#^ZxDRX;|(9 zq<@@2st~L;9p8=gb_tWNV>x{xt`}vJD#TI*PqMxt*%96pox3}iFcNpU(Ir%E1)dm}nWt@RXX zmV)lU4PYbayq~A-@0HFn^ADAt4l?fVTAJMpMb0+dP0o>Kvni@NiKmpv@}^mdhwJEu zyiL#{@-30G6X)sf1LYb|;msIFp@CKnz3-amXYJE&(UsZl`e_PNs@-X4J8V|ei17$~ zN=`9ue(Cl!Z@EaI%;a3?q>uL9) z4u*_RF=)Ou%bBa(sqh<9&eb-er@2R!x!$~iZF+H*(_W$arJIwd$V$IDB)?xOX|=H+ z8tV1Ry$&)GBSNoo&N|ogD?e8|sc@ElN3l-3@;gWQ^`x~|5#+6h5#NAK4&gjy99iXz z%*}GMsJa}t*O1?zLt zc^Kn?UaRJp7{^i;X^v$gXB5s^zr=3ftvIht856rTIwG+RAzk~{W$u9t4|sI!2#TDW zfpvYtz)OKq{)9e7z|4OyX#R86O%?j^7Nsf@`WD?Ha-Kul6x$28ChrL7M{y2)*V{K_ z-jQ*qZKL%r%lF_r6FhsDi=0aE)bV+1A?}8iJgHI!nH)uf0nMb%)9}h~;c2FFnJoFljQt?}ewB+Tn1fkNIKXjyD z;VeXJ&!-Y{o@EMea-1jASQqu`{p32{``o~K9*sD^j^>yTb9i_%Pq7Qjm)h1gjYJ!7 z6FIMdCYO)Ar141Cu4&^pKp#NbTQHtK$8##GK5?VKlR{;7st&o3lZy4qdb)kK?Xn9hG>`Tc@ z+5pnB-!bUC?0XAo=A|(6H9S#e>-0q_{CuvT=d|8qYIX{Y#BqYMN%FpnbL7p-!*pc6 z)!3Efh>D!26uzO)fz*?99xO__GOh=GC(`4VihPp&IUVmd@E!thk;;3{XAau#5P0pe zuO`V-cq|;#rcyIWMyp7;p{}pw(8@x|^1-aB|CwLElm*=TEvhI38KLR@I z$8(N(EjMUlG@c{C$M!7QzHXoTm_JN5eZUN)nbawAPT@T5n5V+oaGT=9)DZ*jqKZSK zE==oQq1TBv{>)pD7=Mz^L)9m7KX%up8GpLLw;TM`kasuE)A{FgwmFV8JxDVPY38t9 zqw~^cyv~F30NQ-Qu=2BA%>c7wrO0Vh_|Dy4S%yXm#=){fraEirxW|Ga#vG>ubAWct z0b=Y}3Oaee4Y^v`^*Kg3a?Uck=X1_*PjZ@Zm~PCf2R~aO10rFM^z)f4!=R=Y^Ofpd zxFy;`oL5TE{}@+bG=#dBcor!<2^%cUQX7tP|8h%wewHejt$tc}Xh+&5HBXw5;D+&}V^u zJI3W%s9kls&vKZ4!g%u*FmIv4(XOqR~TI{sO1tC1F;Uv}EE z&;1&F&Ki;PJ?c^Gw9ZE&@y+Z-9N4pbr2D%>j+^v(PJ0&yPjg@wHbc_Zg9eu+ve0Pn zG0(}?uWvrb^{38pefK%8A4s{LT7Se2D0&opMLb84$0jb|BG7Z!iX8TftYyRPz%&iu z@q>r_sWONopvS(1z6td8Uqaso`u$%*?*V=5m(T}5f9^}@uCEKR2Xyy2^e+Yd1JKcI z&UGkvdyC*`0XDc(-Lq^_W@})-1Xd?q-3iQIr5w}vYZ2}VZIbA7 z9ov<}`j16u6BF76i{jOFD&?c`NrwfsrE08Rw2EVAc{?aNPv3KG;;mAYrr;S8Mwb$M zQZOzl_TXDIFkRxHeD+_vR(=!cl)pPO$$vTU%NjO=PWgCuCLxdfha}9U;lnBXgA$$m zR(mpiEwH|)T$k^b=uM#O_BV+PCI8)8eITW116{`^V9GC&(mw#qed~x&`>Ot!D@*AA zD6YSu$p-+#V7_pwd zN21@3CirIwe-};e9SOgmCOw*ALO#=1OPJ~168<7h`dW!jef<))qQM@8z1Z?;KiWSi z(aFC>!lZANFzMpgB;X?2a zLZ5PlKSPuP4*=71>GsE5$`JeD8F&aLKiF}o{fY8qO#Ng`{ZwocoR`!Oz5It#2?7{Uu4v z`uLr;7eTKNae0zXdxj7^-ny@aXX zQ3*enPQFCng8S`BU<`Q^_XBIi(EPZ6Lf2;(N5WU34ff1pX=y4KkdhLa~X4;T*h4Y)-lWXgw&ph`OdmZSKHe*3D>5{ zpX=nhyqnU{@1CEe-;##TwR1iFsx)-2o$GY!mtUL2oR`rLz5Q`5T)k!z_oSh7?OLZ( z-ew7JPQ%Z&Yn}fI(E08z=Q8yB!E+Lw@^&W1f9mrS*Shueq{r%p;^!p$lD-A+^sD?f z@qqeBtw6neT<=!tCh>e4eYxJP(|3V>7cj?}y1s|jC+Tc2Tj#w30tjl3d>PI+^ill+wDmoVu+4kYQsJ5%cSNDBTS1#gw=pXraL(9s5`;W7^O z{;*BLzfYq-*XC7zlXx!;o$J&(y*~||>&iNv`fQUh%g=RYnf{`LKTRW#>&`kq({rs^ z$CUqZO8ey6vrKhzK{bgoP4be5-A!ewdrxi%^DTd}sLm@q*H%YiC4gdWTo%~xQJTnddR*6pjZ4#c9hW|N?h zN1~Jekc1b2|JzuaT=mFR3=$0Zy{!~d~FC;y;?!zujK zpWjVGgk$*n4e9#&4e6aouVdcd*QJs7ZS6Z~`gp8M!gr_P-zxbNMGftTw6pU{~4WVaX3GO5Gmm=gj%4`y2xZhl>FPF#^Zr~B{zmrHu zc_#z^9Q<6zCI1z`8(?3~ACg`HT!;CeA8U9X@JHycw`u%~fN|tt4Gcy%iKW0(Az#TK zH~>8DJaz4H0`Lmp*$#EDEah(mmgQ{%Ru*9qxnQp!6Kl`kE{unt1ilge4tOAe*1tA;`y^yqkjVYGR7YlYxop! z1?Ep?Y1oYUsq-*icuK<#;HgsiE(G3z@vlRp7X#OgNa}MHuzs+7^}x5zQQxtkzSjfG z_G(GN9l$R`zOL_W!20^E3r~P|0KcuR>C-;@pg*5C(M(WYo45z`gRqaT-;aRx^;{QL zSRVoA{FBO`A)W<3jPe#CJ@wlJ%=WF*vDPPYU|-JHk^WoY&MvL}ApduPllob>_A;b= z7G+6TeTbkvoWMVrFLFNCJ2QU2RCeh=^k zS;_J)0saW>S#RGr0?YOe1LH{8>)XJ0Iq`f=c_`oafM)~i_Wyg}PN}~<0xZkh4!j@Z z1HJup19QB-7ZLl*8!7z%3e4w0-M$0BSD<|pGyhY-UFWO*m;E^d3yJ#rstfHo7x-n^ zgZouzj{@KyVEzV?lD-5dt#}?UrvqQ0Q?#Jo!)+y0?YMt z3-GJZ$F1>4fM3J&qf%bTTM2v#`qxt${Z(v%_z3)fioWYWKLh(Sf9n4L@J#rlCp7#x zu-qP>Pr9Ffiv2b$@v#CA}+4{XQt?pXg7=0L%3;0a)&j zlYyt=`T035znSDm{oJDER|)(8+N17Ixc)BRzvhE}4E2NRR>~U&mdm#WSoaUSU)&9R zP-=e<1LK9?MF0P3O8OUozk~c8n*29_KSO>$((qfrNqsWJzXA((P8{@K9}`0*{t%x4 zz6A48T;FE>oCRJBdAj`3sLuk-A97qmx*PaqN#7FSU7+vQ=;gqCUfHDCdp>Xt+W&2o zkNUbEShok|Vat=?e3R-=Y~nj;|8jf2P2%VM`3JyqeLM`jL+US&0l$a(JF4mPBybJ< z+ua)O2EIn?FO>IN;Md@9{Lq*B{2uu0D1T7n|06Kx&kBLreg=VCF(0y9qkjRcmxuCb zxEF9gnxxUk0?YpYV&IRY_Td3OSd)~u2w3(X-vs9TnjHq>{V@p4`C8Rp8RAx8*?-&y zd=l+PZ=ZY*ubY10;Um3|8ODjpCtLSf%ijy+MDU;0L$_h1Am0_ z*fsi6@=N}&75L?a>bp}+-w8amQGExAcnvVWVXTk8?g5s^YYzec6#D7*`zPRJf65fk z0AC{AuU`Sa74yA2HGTF2U!?70W`6$)EZ>ig10ROH8a4j2!1DMKbuB)V{O2vO-*@Vg z{V@}cy8!K3Z(n18e^!j=g^zWd3>g{76@M}}yzc658`geiP3&Xxz`Q8VX?-v8W zKb8E&pMkqzA2;~fzbwe_F3G>14~#>(pV40y0Q33Hr0|<~?;G+wGXwZfkdJCq^tlo^ zS^pMM2Q1s?X5in!-iI{)<-jJiFTK30fLCMuzf_~I0q%tTmGL3`>E8j%gjI@m~lWkm}@^7H#D(O73o3odxQHFLe5MHO{5-pZOe^A>u&!XM}eEDJ?r zfevr1)!Q7#hhw7P^S68P`IjbN*z1qAcSOCu&Q+qhy)_aJ!~*`}Y3Vt=NH8Ihtj6m4dN&*C6j(GuwhwZ&8;Mw?V1ZVPm@Lwu=d z3j5lYM`4#0Eg_jvIrnxHM<}yYL^@+srllp^84dCz)aGxA;G{Vm@U@}tqA`D{U1*k9 zfMOp#{fdO_mK7^ILa_j4tn`JKBVHK|p!RtIxvW%F545(zBDjPv&wz@PXsn|R-=M)+ zxV_mICQUuTXIxeQurd(wOEtlcsc1G$tBhZTG5(snvwjxD)vvSrHLv_M} zytq7oisKoGEOKk4Bfx@2N@v2jEoi(^mZ&3;NUoR=msU1YM@7s^RffV$u`TE~F`k5? zzF2HE&%$UZ%XtDxe#X(J4h1VV=Y>E=M_W4ytE2n?4$fE_M6f|y!U*1EseRFCKxtMq zhFUaVQ@e5{8d}z-v|jau;?W;ky*|G`+S%l9xfCW>F0^)r{h<}o1#}UA%XFRG+1B3S zmvBpHmA@q#0$C{{#4ZcOA|XFc)S}{Kc__?&qZAl1%ZVr$@I}yp{p_n+t)Y@B2~iuP z*y>2Y3r?@*O==ca!Dxd4}`eB>SHprTbvQ+w_0pV2(?d`zDBu^Lw)hxox-l>pPyMpgJI4pQS;Z+O- zS~XhWvby#*?_7A7aG*o>GG4Uo0J99-idtaSH5Cos#q(~e!8t$Uh29M}06W17;dURo zy1zwuSp|@+wItBkxV?JQfzam8j%a8_z}wo1E+o8Ke^V4iqY5lXk|?din2j5hxxNm+ z$JZH-CG_;dld|D?JKHEm&Mf5D_w*%Xszf**QPWlZ%{LyS815gw?(mm;;_A)Wk{q} zP%~?hHRNq>kF55_d`;m%!ie+QaL?-KjKnTYD&$>MUzboHeom6qp^I8lSLIckt5ywB zr5e0pMQ=YV*o%8c%-gPNM7;BGsBNVJQTcsVak=KS=l{U~6jV5fQ z8?rMr9SNW+Ts}R$k|;F23AskKJBe~h>lYpHJHkt6pj_o$PBT8LnC+J(14%O0@ZJ|j zrK>I(oa!Bqg;ZN%t8bMeKj8PGq2W5(2xLcF)V8DP0dEHq{2+Xu!oyaUk`tVw-nA9I zlt#`zuH3bfUPB=3P+QfaC5zQz5|mT!lS&Cy(>H~dDb`mUv)r>GP3f?mZE9PgWpJ$( zO-eKc+M0tHqF|t+XPR_zVo?D1-9(KiTGHjiFHKKe@_K`z(xy<1!!XTDB83E0`$pPV&V=FEEmc!@hS5N@(FHDc1YjhzFE!VW07fU#K++W@ zTk{MlJH^Fm4Mfit7Fre)98oD=UM!|J8`iR;EfF;*U-PY0Nv{pr(6rX~osQ9BuM|(4ktqxH*KGitc z4XOhz=^d%kJ(_rDNf`eM$r@UNQGL8}`+(eVG#J8=lLpqt+`PRqe{{E)5B^wl}akxH&Oft0${c=VeLq(#+%q>@e@ zZX`4Jt>Dc|wG*#mK!TpBR4-Z!omf+_2&U;fU4y&^-0#-==nMDHUb#j-=Qdey@4?lJ*F^r7}$OMq;4D|MOXd z9Lj@)H@Ym`Nfp)7v?{}IvN1I;E2ZLnIdCg4E8L03H%&5~>R!R4WUSwvqWUdTE^92< z5r_uc!+s0|v?nWX;z3GLzSSRUk-XXT&~56#DNz_YAL^rf63u7f5{Cq;r%*?&xTKGi z)zK|3NW%-YYAQRUskh&#_WS{T)QZzIRiB0+L^X{vW>IXjOmnt$RoW;psyz=TxLbH% z_Nv2tfikNyl_H1R)tdmGY=h`?hY(|qETxD#tvVIqhGZl zD}6Sm+zcw|IE5&Vu5QJ^3URDMjf47`GH@%#bV9K*z7xeXQ?W7%6UA(e`nh8KV)$xN z%tr^xVrx;7;+d)=)P$ich?qcVh3^u@ssY3JFe(~WhmMt50nBcEwPPIEICp|&1|FD) z!nZ{4Q~`f6J)#^1_yu3&cEo%i(SeOToFipg9-L3(QG~#2+=jw+&s^|v?oK~+{xV$P zIRodf_GEBID|zUA)wslSZCeZG9nDiJ$P@o_$J4>#)Th16(>l<&$Rn+VT&>^a{ou2fHzUtL}pi07pm?F;agYv+t~ zojSF1o$soznZ?VJX>`5~8Xwat$5^vykY3i+RVBZaTr^y27H&+wVp>wX1S0uZ$5__V jZHzM%=v!uS8421!IXpN~DVbILL#tW*P@$+lI^X{U>1Iq_ diff --git a/envs/m85-an555/test_loaded_sqmag b/envs/m85-an555/test_loaded_sqmag deleted file mode 100644 index e69de29..0000000 diff --git a/tests/common/misc.c b/tests/common/misc.c deleted file mode 100644 index 80afd37..0000000 --- a/tests/common/misc.c +++ /dev/null @@ -1,137 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - - #include - #include - - #define GEN_FILL_RANDOM( bits ) \ - void fill_random_u ## bits ( uint(bits) *buf, unsigned int len ) \ - { \ - unsigned byte_len = len * sizeof(*buf); \ - uint8_t *byte_buf = (uint8_t*) buf; \ - for( ; byte_len; byte_buf++, byte_len-- ) \ - { \ - uint8_t cur_byte; \ - cur_byte = get_random_byte(); \ - *byte_buf = cur_byte; \ - } \ - } - GEN_FILL_RANDOM(8) - GEN_FILL_RANDOM(16) - GEN_FILL_RANDOM(32) - #undef GEN_FILL_RANDOM - - #define GEN_COPY( bits ) \ - void copy_buf_u ## bits ( uint(bits) *dst, \ - uint(bits) const *src, unsigned int len ) \ - { \ - for( ; len; dst++, src++, len-- ) \ - *dst = *src; \ - } - GEN_COPY(8) - GEN_COPY(16) - GEN_COPY(32) - #undef GEN_COPY - - #define GEN_COMPARE_BUF( bits ) \ - int compare_buf_u ## bits ( uint(bits) const *src_a, \ - uint(bits) const *src_b, \ - unsigned len ) \ - { \ - uint(bits) res = 0; \ - for( ; len; src_a++, src_b++, len-- ) \ - res |= ( (*src_a) ^ (*src_b) ); \ - return( res ); \ - } - GEN_COMPARE_BUF(8) - GEN_COMPARE_BUF(16) - GEN_COMPARE_BUF(32) - GEN_COMPARE_BUF(64) - #undef GEN_COMPARE_BUF - - #define GEN_PRINT_BUF( bits ) \ - void debug_print_buf_u ## bits ( uint(bits) const *buf, \ - unsigned entries, \ - const char *prefix ) \ - { \ - unsigned idx; \ - for( idx = 0; idx < entries; idx += 8 ) \ - { \ - debug_printf( "%s [%#04x-%#04x]: %#04x %#04x %#04x %#04x %#04x %#04x %#04x %#04x\n", \ - prefix, idx, idx+8, \ - buf[idx+0], buf[idx+1], buf[idx+2], buf[idx+3], \ - buf[idx+4], buf[idx+5], buf[idx+6], buf[idx+7] ); \ - } \ - } - GEN_PRINT_BUF(8) - GEN_PRINT_BUF(16) - GEN_PRINT_BUF(32) - GEN_PRINT_BUF(64) - #undef GEN_PRINT_BUF - - #define GEN_PRINT_BUF_S( bits ) \ - void debug_print_buf_s ## bits ( sint(bits) const *buf, \ - unsigned entries, \ - const char *prefix ) \ - { \ - unsigned idx; \ - for( idx = 0; idx < entries; idx += 8 ) \ - { \ - debug_printf( "%s [%u-%u]: %d %d %d %d %d %d %d %d\n", \ - prefix, idx, idx+8, \ - buf[idx+0], buf[idx+1], buf[idx+2], buf[idx+3], \ - buf[idx+4], buf[idx+5], buf[idx+6], buf[idx+7] ); \ - } \ - } -GEN_PRINT_BUF_S(8) -GEN_PRINT_BUF_S(16) -GEN_PRINT_BUF_S(32) -GEN_PRINT_BUF_S(64) -#undef GEN_PRINT_BUF_S - -/* Helper to transpose buffers in case this is needed for input preparation. */ -#define GEN_BUFFER_TRANSPOSE(bitwidth) \ -void CONCAT3(buffer_transpose_, u, bitwidth) \ - ( uint(bitwidth) *dst, uint(bitwidth) const *src, \ - unsigned block_length, unsigned dim_x, unsigned dim_y ) \ -{ \ - unsigned i,j,k,idx_load,idx_store; \ - \ - for( i=0; i -#include -#include - -void random_poly( uint16_t *poly, unsigned int len ) -{ - fill_random_u16( poly, len ); -} - -void zero_poly( uint16_t *poly, unsigned int len ) -{ - for( ; len; len--, poly++ ) - *poly = 0; -} - -int compare_poly( uint16_t const *a, uint16_t const *b, unsigned int len ) -{ - return( compare_buf_u16( a, b, len ) ); -} - -void mask_poly( uint16_t *poly, unsigned int len, unsigned bitwidth ) -{ - uint16_t mask = (1u << bitwidth) - 1; - for( ; len; len--, poly++ ) - *poly &= mask; -} - -void copy_poly( uint16_t *dst, uint16_t const *src, unsigned int len ) -{ - for( ; len; len--, dst++, src++ ) - *dst = *src; -} - -void debug_print_poly(uint16_t *poly, unsigned int len, const char *prefix ) -{ - unsigned idx; - for( idx=0; idx < len; idx += 16 ) - { - unsigned sub_idx; - debug_printf( "%s[%03u-%03u]: ", prefix, idx, idx+15 ); - for( sub_idx=0; sub_idx<16; sub_idx++ ) - debug_printf( "%02x ", (unsigned) poly[idx + sub_idx] ); - debug_printf( "\n" ); - } -} - -/* - * Things related to modular arithmetic - */ - -/* Scalar operations */ - -int32_t mod_red_s32( int64_t a, int32_t mod ) -{ - int32_t tmp = a % mod; - if( tmp < 0 ) - tmp += mod; - return( tmp ); -} - -int32_t mod_mul_s32( int32_t a, int32_t b, int32_t mod ) -{ - int64_t tmp = (int64_t) a * (int64_t) b; - int32_t res = (int32_t)( tmp % mod ); - return( res ); -} - -int32_t mod_add_s32( int32_t a, int32_t b, int32_t mod ) -{ - int64_t tmp = (int64_t) a + (int64_t) b; - int32_t res = tmp % mod; - return( res); -} - -int32_t mod_sub_s32( int32_t a, int32_t b, int32_t mod ) -{ - int64_t tmp = (int64_t) a - (int64_t) b; - int32_t res = tmp % mod; - return( res); -} - -int32_t mod_pow_s32( int32_t base, unsigned exp, int32_t mod ) -{ - int32_t base_pow = base; - int32_t tmp = 1; - while( exp != 0 ) - { - if( exp & 1 ) - tmp = mod_mul_s32( tmp, base_pow, mod ); - - base_pow = mod_mul_s32( base_pow, base_pow, mod ); - exp >>= 1; - } - - return( tmp ); -} - -/* Scalar operations */ - -int16_t mod_red_s16( int64_t a, int16_t mod ) -{ - int16_t tmp = a % mod; - if( tmp < 0 ) - tmp += mod; - return( tmp ); -} - -int16_t mod_mul_s16( int16_t a, int16_t b, int16_t mod ) -{ - int64_t tmp = (int64_t) a * (int64_t) b; - int16_t res = (int16_t)( tmp % mod ); - return( res ); -} - -int16_t mod_add_s16( int16_t a, int16_t b, int16_t mod ) -{ - int64_t tmp = (int64_t) a + (int64_t) b; - int16_t res = tmp % mod; - return( res); -} - -int16_t mod_sub_s16( int16_t a, int16_t b, int16_t mod ) -{ - int64_t tmp = (int64_t) a - (int64_t) b; - int16_t res = tmp % mod; - return( res); -} - -int16_t mod_pow_s16( int16_t base, unsigned exp, int16_t mod ) -{ - int16_t base_pow = base; - int16_t tmp = 1; - while( exp != 0 ) - { - if( exp & 1 ) - tmp = mod_mul_s16( tmp, base_pow, mod ); - - base_pow = mod_mul_s16( base_pow, base_pow, mod ); - exp >>= 1; - } - - return( tmp ); -} - -/* Buffer operations */ - -void mod_add_buf_u16( uint16_t *src_a, uint16_t *src_b, uint16_t *dst, - unsigned size ) -{ - for( unsigned i=0; i < size; i++ ) - dst[i] = src_a[i] + src_b[i]; -} - -void mod_add_buf_s32( int32_t *src_a, int32_t *src_b, int32_t *dst, - unsigned size, int32_t modulus ) -{ - for( unsigned i=0; i < size; i++ ) - dst[i] = mod_add_s32( src_a[i], src_b[i], modulus ); -} - -void mod_reduce_buf_s32( int32_t *src, unsigned size, int32_t mod ) -{ - for( unsigned i=0; i < size; i++ ) - { - src[i] = src[i] % mod; - if( src[i] < 0 ) - src[i] += mod; - } -} - -void mod_reduce_buf_s32_signed( int32_t *src, unsigned size, int32_t mod ) -{ - mod_reduce_buf_s32( src, size, mod ); - for( unsigned i=0; i < size; i++ ) - { - if( src[i] >= ( mod / 2 ) ) - src[i] -= mod; - } -} - -void mod_mul_buf_const_s32( int32_t *src, int32_t factor, int32_t *dst, - unsigned size, int32_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s32( src[idx], factor, mod ); -} - -void mod_mul_buf_s32( int32_t *src_a, int32_t *src_b, int32_t *dst, - unsigned size, int32_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s32( src_a[idx], src_b[idx], mod ); -} - -/* Buffer operations */ - -void mod_add_buf_s16( int16_t *src_a, int16_t *src_b, int16_t *dst, - unsigned size, int16_t modulus ) -{ - for( unsigned i=0; i < size; i++ ) - dst[i] = mod_add_s16( src_a[i], src_b[i], modulus ); -} - -void mod_reduce_buf_s16( int16_t *src, unsigned size, int16_t mod ) -{ - for( unsigned i=0; i < size; i++ ) - { - src[i] = src[i] % mod; - if( src[i] < 0 ) - src[i] += mod; - } -} - -void mod_reduce_buf_s16_signed( int16_t *src, unsigned size, int16_t mod ) -{ - mod_reduce_buf_s16( src, size, mod ); - for( unsigned i=0; i < size; i++ ) - { - if( src[i] >= ( mod / 2 ) ) - src[i] -= mod; - } -} - -void mod_mul_buf_const_s16( int16_t *src, int16_t factor, int16_t *dst, - unsigned size, int16_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s16( src[idx], factor, mod ); -} - -void mod_mul_buf_s16( int16_t *src_a, int16_t *src_b, int16_t *dst, - unsigned size, int16_t mod ) -{ - unsigned idx; - for( idx = 0; idx < size; idx++ ) - dst[idx] = mod_mul_s16( src_a[idx], src_b[idx], mod ); -} diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78.s b/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78.s deleted file mode 100644 index 87e6373..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78.s +++ /dev/null @@ -1,322 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -.p2align 4 -roots: -#include "ntt_dilithium_123_456_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.macro qsave loc, a // slothy:no-unfold - vstrw.32 \a, [sp, #\loc\()] -.endm -.macro qrestore a, loc // slothy:no-unfold - vldrw.32 \a, [sp, #\loc\()] -.endm -.macro restored a, b, loc // slothy:no-unfold - ldrd \a, \b, [sp, #\loc\()] -.endm -.macro saved loc, a, b // slothy:no-unfold - strd \a, \b, [sp, #\loc\()] -.endm -.macro restore a, loc // slothy:no-unfold - ldr \a, [sp, #\loc\()] -.endm -.macro save loc, a // slothy:no-unfold - str \a, [sp, #\loc\()] -.endm - -// Aligns stack =0 mod 16 -.macro align_stack_do // slothy:no-unfold - mov r11, sp - and r12, r11, #0xC // 8 of ==8 mod 16, 0 otherwise - sub sp, sp, r12 // Align stack to 16 byte - sub sp, sp, #16 - str r12, [sp] -.endm - -// Reverts initial stack correction -.macro align_stack_undo // slothy:no-unfold - ldr r12, [sp] - add sp, sp, #16 - add sp, sp, r12 -.endm - -#define STACK_SIZE (5*16+8) // +8 is for alignment -#define QSTACK4 (0*16) -#define QSTACK5 (1*16) -#define QSTACK6 (2*16) - -#define ROOT0_STACK (3*16) -#define ROOT1_STACK (3*16 + 8) -#define ROOT4_STACK (4*16) -#define RPTR_STACK (4*16 + 8) - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_123_456_78, %function -.global ntt_dilithium_123_456_78 -ntt_dilithium_123_456_78: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - align_stack_do - sub sp, sp, #STACK_SIZE - - modulus .req r12 - r_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr r_ptr, roots_addr - - in .req r0 - in_low .req in - in_high .req r1 - - add in_high, in, #(4*128) - - root2 .req r2 - root2_tw .req r3 - root3 .req r4 - root3_tw .req r5 - root5 .req r6 - root5_tw .req r7 - root6 .req r8 - root6_tw .req r9 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - data4 .req q1 - data5 .req q2 - data6 .req q3 - data7 .req q4 - - tmp .req q7 - - /* Layers 1-3 */ - - rtmp .req root6 - rtmp_tw .req root6_tw - - ldrd rtmp, rtmp_tw, [r_ptr], #(7*8) - saved ROOT0_STACK, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #(1*8 - 7*8)] - saved ROOT1_STACK, rtmp, rtmp_tw - ldrd root2, root2_tw, [r_ptr, #(2*8 - 7*8)] - ldrd root3, root3_tw, [r_ptr, #(3*8 - 7*8)] - ldrd rtmp, rtmp_tw, [r_ptr, #(4*8 - 7*8)] - saved ROOT4_STACK, rtmp, rtmp_tw - ldrd root5, root5_tw, [r_ptr, #(5*8 - 7*8)] - ldrd root6, root6_tw, [r_ptr, #(6*8 - 7*8)] - save RPTR_STACK, r_ptr - - .unreq rtmp - .unreq rtmp_tw - rtmp .req r10 - rtmp_tw .req r11 - - mov lr, #8 - .p2align 2 -layer123_loop: - vldrw.32 data0, [in_low] - vldrw.32 data4, [in_high] - restored rtmp, rtmp_tw, ROOT0_STACK - ct_butterfly data0, data4, rtmp, rtmp_tw - qsave QSTACK4, data4 - vldrw.32 data1, [in_low, #128] - vldrw.32 data5, [in_high, #128] - ct_butterfly data1, data5, rtmp, rtmp_tw - qsave QSTACK5, data5 - vldrw.32 data2, [in_low, #256] - vldrw.32 data6, [in_high, #256] - ct_butterfly data2, data6, rtmp, rtmp_tw - qsave QSTACK6, data6 - vldrw.32 data3, [in_low, #384] - vldrw.32 data7, [in_high, #384] - ct_butterfly data3, data7, rtmp, rtmp_tw - - restored rtmp, rtmp_tw, ROOT1_STACK - ct_butterfly data0, data2, rtmp, rtmp_tw - ct_butterfly data1, data3, rtmp, rtmp_tw - ct_butterfly data0, data1, root2, root2_tw - ct_butterfly data2, data3, root3, root3_tw - vstrw.32 data0, [in_low], #16 - vstrw.32 data1, [in_low, #(128-16)] - vstrw.32 data2, [in_low, #(256-16)] - vstrw.32 data3, [in_low, #(384-16)] - - qrestore data4, QSTACK4 - qrestore data5, QSTACK5 - qrestore data6, QSTACK6 - - restored rtmp, rtmp_tw, ROOT4_STACK - ct_butterfly data4, data6, rtmp, rtmp_tw - ct_butterfly data5, data7, rtmp, rtmp_tw - ct_butterfly data4, data5, root5, root5_tw - ct_butterfly data6, data7, root6, root6_tw - - vstrw.32 data4, [in_high], #16 - vstrw.32 data5, [in_high, #(128-16)] - vstrw.32 data6, [in_high, #(256-16)] - vstrw.32 data7, [in_high, #(384-16)] - - le lr, layer123_loop - .unreq in_high - .unreq in_low - - sub in, in, #(128) - restore r_ptr, RPTR_STACK - - /* Layers 4,5,6 */ - - .unreq rtmp - .unreq rtmp_tw - rtmp .req r3 - rtmp_tw .req r4 - - mov lr, #8 - .p2align 2 -layer456_loop: - ldrd rtmp, rtmp_tw, [r_ptr], #(7*8) - - vldrw.32 data0, [in] - vldrw.32 data4, [in, #64] - ct_butterfly data0, data4, rtmp, rtmp_tw - qsave QSTACK4, data4 - vldrw.32 data1, [in, #16] - vldrw.32 data5, [in, #80] - ct_butterfly data1, data5, rtmp, rtmp_tw - qsave QSTACK5, data5 - vldrw.32 data2, [in, #32] - vldrw.32 data6, [in, #96] - ct_butterfly data2, data6, rtmp, rtmp_tw - qsave QSTACK6, data6 - vldrw.32 data3, [in, #48] - vldrw.32 data7, [in, #112] - ct_butterfly data3, data7, rtmp, rtmp_tw - - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + 1)*8)] - ct_butterfly data0, data2, rtmp, rtmp_tw - ct_butterfly data1, data3, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + 2)*8)] - ct_butterfly data0, data1, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + 3)*8)] - ct_butterfly data2, data3, rtmp, rtmp_tw - - vstrw.32 data0, [in], #128 - vstrw.32 data1, [in, #(-128+16)] - vstrw.32 data2, [in, #(-128+32)] - vstrw.32 data3, [in, #(-128+48)] - - qrestore data4, QSTACK4 - qrestore data5, QSTACK5 - qrestore data6, QSTACK6 - - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + 4)*8)] - ct_butterfly data4, data6, rtmp, rtmp_tw - ct_butterfly data5, data7, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + 5)*8)] - ct_butterfly data4, data5, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + 6)*8)] - ct_butterfly data6, data7, rtmp, rtmp_tw - - vstrw.32 data4, [in, #(-128+64)] - vstrw.32 data5, [in, #(-128+80)] - vstrw.32 data6, [in, #(-128+96)] - vstrw.32 data7, [in, #(-128+112)] - - le lr, layer456_loop - - sub in, in, #(4*256) - - .unreq rtmp - .unreq rtmp_tw - .unreq root2 - .unreq root2_tw - - /* Layers 7,8 */ - - root0 .req q5 - root0_tw .req q6 - root1 .req q5 - root1_tw .req q6 - root2 .req q5 - root2_tw .req q6 - - mov lr, #16 - .p2align 2 -layer78_loop: - vld40.32 {data0, data1, data2, data3}, [in] - vld41.32 {data0, data1, data2, data3}, [in] - vld42.32 {data0, data1, data2, data3}, [in] - vld43.32 {data0, data1, data2, data3}, [in]! - - vldrw.32 root0, [r_ptr], #+96 - vldrw.32 root0_tw, [r_ptr, #(+16-96)] - ct_butterfly data0, data2, root0, root0_tw - ct_butterfly data1, data3, root0, root0_tw - - vldrw.32 root1, [r_ptr, #(32 - 96)] - vldrw.32 root1_tw, [r_ptr, #(48 - 96)] - ct_butterfly data0, data1, root1, root1_tw - - vldrw.32 root2, [r_ptr, #(64-96)] - vldrw.32 root2_tw, [r_ptr, #(80-96)] - ct_butterfly data2, data3, root2, root2_tw - - vstrw.32 data0, [in, #( 0 - 64)] - vstrw.32 data1, [in, #(16 - 64)] - vstrw.32 data2, [in, #(32 - 64)] - vstrw.32 data3, [in, #(48 - 64)] - le lr, layer78_loop - - add sp, sp, #STACK_SIZE - align_stack_undo - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m55.s b/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m55.s deleted file mode 100644 index 21312ae..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m55.s +++ /dev/null @@ -1,717 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -.p2align 4 -roots: -#include "ntt_dilithium_123_456_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.macro qsave loc, a // slothy:no-unfold - vstrw.32 \a, [sp, #\loc\()] -.endm -.macro qrestore a, loc // slothy:no-unfold - vldrw.32 \a, [sp, #\loc\()] -.endm -.macro restored a, b, loc // slothy:no-unfold - ldrd \a, \b, [sp, #\loc\()] -.endm -.macro saved loc, a, b // slothy:no-unfold - strd \a, \b, [sp, #\loc\()] -.endm -.macro restore a, loc // slothy:no-unfold - ldr \a, [sp, #\loc\()] -.endm -.macro save loc, a // slothy:no-unfold - str \a, [sp, #\loc\()] -.endm - -// Aligns stack =0 mod 16 -.macro align_stack_do // slothy:no-unfold - mov r11, sp - and r12, r11, #0xC // 8 of ==8 mod 16, 0 otherwise - sub sp, sp, r12 // Align stack to 16 byte - sub sp, sp, #16 - str r12, [sp] -.endm - -// Reverts initial stack correction -.macro align_stack_undo // slothy:no-unfold - ldr r12, [sp] - add sp, sp, #16 - add sp, sp, r12 -.endm - -#define STACK_SIZE (5*16+8) // +8 is for alignment -#define QSTACK4 (0*16) -#define QSTACK5 (1*16) -#define QSTACK6 (2*16) - -#define ROOT0_STACK (3*16) -#define ROOT1_STACK (3*16 + 8) -#define ROOT4_STACK (4*16) -#define RPTR_STACK (4*16 + 8) - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_123_456_78_opt_size_m55, %function -.global ntt_dilithium_123_456_78_opt_size_m55 -ntt_dilithium_123_456_78_opt_size_m55: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - align_stack_do - sub sp, sp, #STACK_SIZE - - modulus .req r12 - r_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr r_ptr, roots_addr - - in .req r0 - in_low .req in - in_high .req r1 - - add in_high, in, #(4*128) - - root2 .req r2 - root2_tw .req r3 - root3 .req r4 - root3_tw .req r5 - root5 .req r6 - root5_tw .req r7 - root6 .req r8 - root6_tw .req r9 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - data4 .req q1 - data5 .req q2 - data6 .req q3 - data7 .req q4 - - tmp .req q7 - - /* Layers 1-3 */ - - rtmp .req root6 - rtmp_tw .req root6_tw - - ldrd rtmp, rtmp_tw, [r_ptr], #(7*8) - saved ROOT0_STACK, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #(1*8 - 7*8)] - saved ROOT1_STACK, rtmp, rtmp_tw - ldrd root2, root2_tw, [r_ptr, #(2*8 - 7*8)] - ldrd root3, root3_tw, [r_ptr, #(3*8 - 7*8)] - ldrd rtmp, rtmp_tw, [r_ptr, #(4*8 - 7*8)] - saved ROOT4_STACK, rtmp, rtmp_tw - ldrd root5, root5_tw, [r_ptr, #(5*8 - 7*8)] - ldrd root6, root6_tw, [r_ptr, #(6*8 - 7*8)] - save RPTR_STACK, r_ptr - - .unreq rtmp - .unreq rtmp_tw - rtmp .req r10 - rtmp_tw .req r11 - - mov lr, #8 - .p2align 2 -.p2align 2 -layer123_loop: - restored r10, r11, ROOT0_STACK // ..*.................................................................................. - vldrw.32 q4, [r1, #128] // ..........*.......................................................................... - vqrdmulh.s32 q1, q4, r11 // ............*........................................................................ - vldrw.32 q3, [r1] // .*................................................................................... - vmul.s32 q4, q4, r10 // ...........*......................................................................... - vldrw.32 q6, [r0, #128] // .........*........................................................................... - vmla.s32 q4, q1, r12 // .............*....................................................................... - vldrw.32 q2, [r1, #256] // ..................*.................................................................. - vsub.u32 q5, q6, q4 // ..............*...................................................................... - vqrdmulh.s32 q0, q2, r11 // ....................*................................................................ - vadd.u32 q4, q6, q4 // ...............*..................................................................... - vmul.s32 q6, q2, r10 // ...................*................................................................. - vldrw.32 q1, [r0, #256] // .................*................................................................... - vmla.s32 q6, q0, r12 // .....................*............................................................... - vldrw.32 q2, [r0] // *.................................................................................... - vadd.u32 q7, q1, q6 // .......................*............................................................. - vmul.s32 q0, q3, r10 // ...*................................................................................. - qsave QSTACK5, q5 // ................*.................................................................... - vqrdmulh.s32 q3, q3, r11 // ....*................................................................................ - vsub.u32 q5, q1, q6 // ......................*.............................................................. - vmla.s32 q0, q3, r12 // .....*............................................................................... - vldrw.32 q1, [r1, #384] // ..........................*.......................................................... - vadd.u32 q6, q2, q0 // .......*............................................................................. - vqrdmulh.s32 q3, q1, r11 // ............................*........................................................ - vsub.u32 q2, q2, q0 // ......*.............................................................................. - vmul.s32 q1, q1, r10 // ...........................*......................................................... - restored r11, r10, ROOT1_STACK // ................................*.................................................... - vmla.s32 q1, q3, r12 // .............................*....................................................... - vldrw.32 q0, [r0, #384] // .........................*........................................................... - vqrdmulh.s32 q3, q7, r10 // ..................................*.................................................. - qsave QSTACK6, q5 // ........................*............................................................ - vmul.s32 q5, q7, r11 // .................................*................................................... - vsub.u32 q7, q0, q1 // ..............................*...................................................... - vmla.s32 q5, q3, r12 // ...................................*................................................. - vadd.u32 q0, q0, q1 // ...............................*..................................................... - vmul.s32 q3, q0, r11 // ......................................*.............................................. - vsub.u32 q1, q6, q5 // ....................................*................................................ - vqrdmulh.s32 q0, q0, r10 // .......................................*............................................. - vadd.u32 q6, q6, q5 // .....................................*............................................... - vmla.s32 q3, q0, r12 // ........................................*............................................ - qsave QSTACK4, q2 // ........*............................................................................ - vsub.u32 q5, q4, q3 // .........................................*........................................... - vqrdmulh.s32 q2, q5, r5 // .................................................*................................... - vadd.u32 q0, q4, q3 // ..........................................*.......................................... - vmul.s32 q4, q5, r4 // ................................................*.................................... - qrestore q3, QSTACK5 // ..........................................................*.......................... - vmla.s32 q4, q2, r12 // ..................................................*.................................. - restored r10, r11, ROOT4_STACK // ............................................................*........................ - vsub.u32 q2, q1, q4 // ...................................................*................................. - vmul.s32 q5, q0, r2 // ...........................................*......................................... - vadd.u32 q4, q1, q4 // ....................................................*................................ - vqrdmulh.s32 q0, q0, r3 // ............................................*........................................ - qrestore q1, QSTACK6 // ...........................................................*......................... - vmla.s32 q5, q0, r12 // .............................................*....................................... - vstrw.u32 q4, [r0, #256] // .......................................................*............................. - vmul.s32 q0, q1, r10 // .............................................................*....................... - vsub.u32 q4, q6, q5 // ..............................................*...................................... - vqrdmulh.s32 q1, q1, r11 // ..............................................................*...................... - vadd.u32 q5, q6, q5 // ...............................................*..................................... - vmul.s32 q6, q7, r10 // ..................................................................*.................. - vstrw.u32 q5, [r0] , #16 // .....................................................*............................... - vqrdmulh.s32 q5, q7, r11 // ...................................................................*................. - vstrw.u32 q2, [r0, #368] // ........................................................*............................ - vmla.s32 q6, q5, r12 // ....................................................................*................ - vstrw.u32 q4, [r0, #112] // ......................................................*.............................. - vmla.s32 q0, q1, r12 // ...............................................................*..................... - vadd.u32 q7, q3, q6 // ......................................................................*.............. - vqrdmulh.s32 q2, q7, r7 // ........................................................................*............ - vsub.u32 q6, q3, q6 // .....................................................................*............... - vmul.s32 q3, q6, r8 // ............................................................................*........ - qrestore q5, QSTACK4 // .........................................................*........................... - vmul.s32 q4, q7, r6 // .......................................................................*............. - vsub.u32 q7, q5, q0 // ................................................................*.................... - vmla.s32 q4, q2, r12 // .........................................................................*........... - vadd.u32 q0, q5, q0 // .................................................................*................... - vqrdmulh.s32 q5, q6, r9 // .............................................................................*....... - vadd.u32 q1, q0, q4 // ...........................................................................*......... - vstrw.u32 q1, [r1] , #16 // .................................................................................*... - vsub.u32 q4, q0, q4 // ..........................................................................*.......... - vmla.s32 q3, q5, r12 // ..............................................................................*...... - vstrw.u32 q4, [r1, #112] // ..................................................................................*.. - vadd.u32 q5, q7, q3 // ................................................................................*.... - vstrw.u32 q5, [r1, #240] // ...................................................................................*. - vsub.u32 q4, q7, q3 // ...............................................................................*..... - vstrw.u32 q4, [r1, #368] // ....................................................................................* - - // original source code - // vldrw.32 q0, [r0] // ..............*...................................................................... - // vldrw.32 q1, [r1] // ...*................................................................................. - // restored r10, r11, ROOT0_STACK // *.................................................................................... - // vmul.s32 q7, q1, r10 // ................*.................................................................... - // vqrdmulh.s32 q1, q1, r11 // ..................*.................................................................. - // vmla.s32 q7, q1, r12 // ....................*................................................................ - // vsub.u32 q1, q0, q7 // ........................*............................................................ - // vadd.u32 q0, q0, q7 // ......................*.............................................................. - // qsave QSTACK4, q1 // ........................................*............................................ - // vldrw.32 q1, [r0, #128] // .....*............................................................................... - // vldrw.32 q2, [r1, #128] // .*................................................................................... - // vmul.s32 q7, q2, r10 // ....*................................................................................ - // vqrdmulh.s32 q2, q2, r11 // ..*.................................................................................. - // vmla.s32 q7, q2, r12 // ......*.............................................................................. - // vsub.u32 q2, q1, q7 // ........*............................................................................ - // vadd.u32 q1, q1, q7 // ..........*.......................................................................... - // qsave QSTACK5, q2 // .................*................................................................... - // vldrw.32 q2, [r0, #256] // ............*........................................................................ - // vldrw.32 q3, [r1, #256] // .......*............................................................................. - // vmul.s32 q7, q3, r10 // ...........*......................................................................... - // vqrdmulh.s32 q3, q3, r11 // .........*........................................................................... - // vmla.s32 q7, q3, r12 // .............*....................................................................... - // vsub.u32 q3, q2, q7 // ...................*................................................................. - // vadd.u32 q2, q2, q7 // ...............*..................................................................... - // qsave QSTACK6, q3 // ..............................*...................................................... - // vldrw.32 q3, [r0, #384] // ............................*........................................................ - // vldrw.32 q4, [r1, #384] // .....................*............................................................... - // vmul.s32 q7, q4, r10 // .........................*........................................................... - // vqrdmulh.s32 q4, q4, r11 // .......................*............................................................. - // vmla.s32 q7, q4, r12 // ...........................*......................................................... - // vsub.u32 q4, q3, q7 // ................................*.................................................... - // vadd.u32 q3, q3, q7 // ..................................*.................................................. - // restored r10, r11, ROOT1_STACK // ..........................*.......................................................... - // vmul.s32 q7, q2, r10 // ...............................*..................................................... - // vqrdmulh.s32 q2, q2, r11 // .............................*....................................................... - // vmla.s32 q7, q2, r12 // .................................*................................................... - // vsub.u32 q2, q0, q7 // ....................................*................................................ - // vadd.u32 q0, q0, q7 // ......................................*.............................................. - // vmul.s32 q7, q3, r10 // ...................................*................................................. - // vqrdmulh.s32 q3, q3, r11 // .....................................*............................................... - // vmla.s32 q7, q3, r12 // .......................................*............................................. - // vsub.u32 q3, q1, q7 // .........................................*........................................... - // vadd.u32 q1, q1, q7 // ...........................................*......................................... - // vmul.s32 q7, q1, r2 // .................................................*................................... - // vqrdmulh.s32 q1, q1, r3 // ...................................................*................................. - // vmla.s32 q7, q1, r12 // .....................................................*............................... - // vsub.u32 q1, q0, q7 // ........................................................*............................ - // vadd.u32 q0, q0, q7 // ..........................................................*.......................... - // vmul.s32 q7, q3, r4 // ............................................*........................................ - // vqrdmulh.s32 q3, q3, r5 // ..........................................*.......................................... - // vmla.s32 q7, q3, r12 // ..............................................*...................................... - // vsub.u32 q3, q2, q7 // ................................................*.................................... - // vadd.u32 q2, q2, q7 // ..................................................*.................................. - // vstrw.32 q0, [r0], #16 // ............................................................*........................ - // vstrw.32 q1, [r0, #(128-16)] // ................................................................*.................... - // vstrw.32 q2, [r0, #(256-16)] // ......................................................*.............................. - // vstrw.32 q3, [r0, #(384-16)] // ..............................................................*...................... - // qrestore q1, QSTACK4 // ......................................................................*.............. - // qrestore q2, QSTACK5 // .............................................*....................................... - // qrestore q3, QSTACK6 // ....................................................*................................ - // restored r10, r11, ROOT4_STACK // ...............................................*..................................... - // vmul.s32 q7, q3, r10 // .......................................................*............................. - // vqrdmulh.s32 q3, q3, r11 // .........................................................*........................... - // vmla.s32 q7, q3, r12 // .................................................................*................... - // vsub.u32 q3, q1, q7 // ........................................................................*............ - // vadd.u32 q1, q1, q7 // ..........................................................................*.......... - // vmul.s32 q7, q4, r10 // ...........................................................*......................... - // vqrdmulh.s32 q4, q4, r11 // .............................................................*....................... - // vmla.s32 q7, q4, r12 // ...............................................................*..................... - // vsub.u32 q4, q2, q7 // ....................................................................*................ - // vadd.u32 q2, q2, q7 // ..................................................................*.................. - // vmul.s32 q7, q2, r6 // .......................................................................*............. - // vqrdmulh.s32 q2, q2, r7 // ...................................................................*................. - // vmla.s32 q7, q2, r12 // .........................................................................*........... - // vsub.u32 q2, q1, q7 // ..............................................................................*...... - // vadd.u32 q1, q1, q7 // ............................................................................*........ - // vmul.s32 q7, q4, r8 // .....................................................................*............... - // vqrdmulh.s32 q4, q4, r9 // ...........................................................................*......... - // vmla.s32 q7, q4, r12 // ...............................................................................*..... - // vsub.u32 q4, q3, q7 // ...................................................................................*. - // vadd.u32 q3, q3, q7 // .................................................................................*... - // vstrw.32 q1, [r1], #16 // .............................................................................*....... - // vstrw.32 q2, [r1, #(128-16)] // ................................................................................*.... - // vstrw.32 q3, [r1, #(256-16)] // ..................................................................................*.. - // vstrw.32 q4, [r1, #(384-16)] // ....................................................................................* - - le lr, layer123_loop - .unreq in_high - .unreq in_low - - sub in, in, #(128) - restore r_ptr, RPTR_STACK - - /* Layers 4,5,6 */ - - .unreq rtmp - .unreq rtmp_tw - rtmp .req r3 - rtmp_tw .req r4 - - mov lr, #8 - .p2align 2 -.p2align 2 -layer456_loop: - ldrd r6, r2, [r11] , #56 // *........................................................................................ - vldrw.32 q1, [r0, #96] // ..................*...................................................................... - vmul.s32 q4, q1, r6 // ...................*..................................................................... - vldrw.32 q0, [r0, #112] // ..........................*.............................................................. - vqrdmulh.s32 q7, q0, r2 // ............................*............................................................ - vldrw.32 q5, [r0, #80] // ..........*.............................................................................. - vmul.s32 q3, q0, r6 // ...........................*............................................................. - vldrw.32 q2, [r0, #32] // .................*....................................................................... - vmla.s32 q3, q7, r12 // .............................*........................................................... - vldrw.32 q0, [r0, #48] // .........................*............................................................... - vqrdmulh.s32 q6, q1, r2 // ....................*.................................................................... - vadd.u32 q7, q0, q3 // ...............................*......................................................... - vmla.s32 q4, q6, r12 // .....................*................................................................... - ldrd r3, r1, [r11, #-48] // ................................*........................................................ - vsub.u32 q1, q2, q4 // ......................*.................................................................. - qsave QSTACK6, q1 // ........................*................................................................ - vmul.s32 q1, q5, r6 // ...........*............................................................................. - vadd.u32 q6, q2, q4 // .......................*................................................................. - vqrdmulh.s32 q5, q5, r2 // ............*............................................................................ - vsub.u32 q4, q0, q3 // ..............................*.......................................................... - vmla.s32 q1, q5, r12 // .............*........................................................................... - vldrw.32 q0, [r0, #16] // .........*............................................................................... - vmul.s32 q3, q7, r3 // ......................................*.................................................. - vadd.u32 q5, q0, q1 // ...............*......................................................................... - vqrdmulh.s32 q7, q7, r1 // .......................................*................................................. - vsub.u32 q0, q0, q1 // ..............*.......................................................................... - vmla.s32 q3, q7, r12 // ........................................*................................................ - vldrw.32 q7, [r0, #64] // ..*...................................................................................... - vsub.u32 q2, q5, q3 // .........................................*............................................... - vqrdmulh.s32 q1, q7, r2 // ....*.................................................................................... - vadd.u32 q3, q5, q3 // ..........................................*.............................................. - vmul.s32 q5, q7, r6 // ...*..................................................................................... - vldrw.32 q7, [r0] // .*....................................................................................... - vmla.s32 q5, q1, r12 // .....*................................................................................... - ldrd r8, r7, [r11, #-40] // ...........................................*............................................. - vmul.s32 q1, q6, r3 // .................................*....................................................... - qsave QSTACK5, q0 // ................*........................................................................ - vqrdmulh.s32 q0, q6, r1 // ..................................*...................................................... - vsub.u32 q6, q7, q5 // ......*.................................................................................. - vmla.s32 q1, q0, r12 // ...................................*..................................................... - qsave QSTACK4, q6 // ........*................................................................................ - vmul.s32 q6, q3, r8 // ............................................*............................................ - vadd.u32 q7, q7, q5 // .......*................................................................................. - vqrdmulh.s32 q0, q3, r7 // .............................................*........................................... - ldrd r5, r10, [r11, #-32] // .................................................*....................................... - vmla.s32 q6, q0, r12 // ..............................................*.......................................... - vsub.u32 q5, q7, q1 // ....................................*.................................................... - vqrdmulh.s32 q0, q2, r10 // ...................................................*..................................... - ldrd r10, r9, [r11, #-24] // ..............................................................*.......................... - vmul.s32 q2, q2, r5 // ..................................................*...................................... - vadd.u32 q1, q7, q1 // .....................................*................................................... - qrestore q7, QSTACK6 // .............................................................*........................... - vsub.u32 q3, q1, q6 // ...............................................*......................................... - vmla.s32 q2, q0, r12 // ....................................................*.................................... - vstrw.u32 q3, [r0, #16] // ........................................................*................................ - vmul.s32 q0, q4, r10 // ....................................................................*.................... - vadd.u32 q6, q1, q6 // ................................................*........................................ - vstrw.u32 q6, [r0] , #128 // .......................................................*................................. - vadd.u32 q6, q5, q2 // ......................................................*.................................. - vqrdmulh.s32 q3, q7, r9 // ................................................................*........................ - vsub.u32 q5, q5, q2 // .....................................................*................................... - vmul.s32 q7, q7, r10 // ...............................................................*......................... - ldrd r1, r6, [r11, #-16] // .........................................................................*............... - vmla.s32 q7, q3, r12 // .................................................................*....................... - qrestore q3, QSTACK4 // ...........................................................*............................. - vqrdmulh.s32 q2, q4, r9 // .....................................................................*................... - ldrd r4, r2, [r11, #-8] // ...............................................................................*......... - vmla.s32 q0, q2, r12 // ......................................................................*.................. - qrestore q1, QSTACK5 // ............................................................*............................ - vadd.u32 q2, q1, q0 // ........................................................................*................ - vmul.s32 q4, q2, r1 // ..........................................................................*.............. - vstrw.u32 q6, [r0, #-96] // .........................................................*............................... - vqrdmulh.s32 q6, q2, r6 // ...........................................................................*............. - vadd.u32 q2, q3, q7 // ...................................................................*..................... - vmla.s32 q4, q6, r12 // ............................................................................*............ - vstrw.u32 q5, [r0, #-80] // ..........................................................*.............................. - vadd.u32 q6, q2, q4 // ..............................................................................*.......... - vstrw.u32 q6, [r0, #-64] // .....................................................................................*... - vsub.u32 q5, q1, q0 // .......................................................................*................. - vmul.s32 q6, q5, r4 // ................................................................................*........ - vsub.u32 q0, q3, q7 // ..................................................................*...................... - vqrdmulh.s32 q3, q5, r2 // .................................................................................*....... - vsub.u32 q5, q2, q4 // .............................................................................*........... - vmla.s32 q6, q3, r12 // ..................................................................................*...... - vstrw.u32 q5, [r0, #-48] // ......................................................................................*.. - vadd.u32 q1, q0, q6 // ....................................................................................*.... - vstrw.u32 q1, [r0, #-32] // .......................................................................................*. - vsub.u32 q6, q0, q6 // ...................................................................................*..... - vstrw.u32 q6, [r0, #-16] // ........................................................................................* - - // original source code - // ldrd r3, r4, [r11], #(7*8) // *........................................................................................ - // vldrw.32 q0, [r0] // ................................*........................................................ - // vldrw.32 q1, [r0, #64] // ...........................*............................................................. - // vmul.s32 q7, q1, r3 // ...............................*......................................................... - // vqrdmulh.s32 q1, q1, r4 // .............................*........................................................... - // vmla.s32 q7, q1, r12 // .................................*....................................................... - // vsub.u32 q1, q0, q7 // ......................................*.................................................. - // vadd.u32 q0, q0, q7 // ..........................................*.............................................. - // qsave QSTACK4, q1 // ........................................*................................................ - // vldrw.32 q1, [r0, #16] // .....................*................................................................... - // vldrw.32 q2, [r0, #80] // .....*................................................................................... - // vmul.s32 q7, q2, r3 // ................*........................................................................ - // vqrdmulh.s32 q2, q2, r4 // ..................*...................................................................... - // vmla.s32 q7, q2, r12 // ....................*.................................................................... - // vsub.u32 q2, q1, q7 // .........................*............................................................... - // vadd.u32 q1, q1, q7 // .......................*................................................................. - // qsave QSTACK5, q2 // ....................................*.................................................... - // vldrw.32 q2, [r0, #32] // .......*................................................................................. - // vldrw.32 q3, [r0, #96] // .*....................................................................................... - // vmul.s32 q7, q3, r3 // ..*...................................................................................... - // vqrdmulh.s32 q3, q3, r4 // ..........*.............................................................................. - // vmla.s32 q7, q3, r12 // ............*............................................................................ - // vsub.u32 q3, q2, q7 // ..............*.......................................................................... - // vadd.u32 q2, q2, q7 // .................*....................................................................... - // qsave QSTACK6, q3 // ...............*......................................................................... - // vldrw.32 q3, [r0, #48] // .........*............................................................................... - // vldrw.32 q4, [r0, #112] // ...*..................................................................................... - // vmul.s32 q7, q4, r3 // ......*.................................................................................. - // vqrdmulh.s32 q4, q4, r4 // ....*.................................................................................... - // vmla.s32 q7, q4, r12 // ........*................................................................................ - // vsub.u32 q4, q3, q7 // ...................*..................................................................... - // vadd.u32 q3, q3, q7 // ...........*............................................................................. - // ldrd r3, r4, [r11, #((-7 + 1)*8)] // .............*........................................................................... - // vmul.s32 q7, q2, r3 // ...................................*..................................................... - // vqrdmulh.s32 q2, q2, r4 // .....................................*................................................... - // vmla.s32 q7, q2, r12 // .......................................*................................................. - // vsub.u32 q2, q0, q7 // ..............................................*.......................................... - // vadd.u32 q0, q0, q7 // ..................................................*...................................... - // vmul.s32 q7, q3, r3 // ......................*.................................................................. - // vqrdmulh.s32 q3, q3, r4 // ........................*................................................................ - // vmla.s32 q7, q3, r12 // ..........................*.............................................................. - // vsub.u32 q3, q1, q7 // ............................*............................................................ - // vadd.u32 q1, q1, q7 // ..............................*.......................................................... - // ldrd r3, r4, [r11, #((-7 + 2)*8)] // ..................................*...................................................... - // vmul.s32 q7, q1, r3 // .........................................*............................................... - // vqrdmulh.s32 q1, q1, r4 // ...........................................*............................................. - // vmla.s32 q7, q1, r12 // .............................................*........................................... - // vsub.u32 q1, q0, q7 // ....................................................*.................................... - // vadd.u32 q0, q0, q7 // ........................................................*................................ - // ldrd r3, r4, [r11, #((-7 + 3)*8)] // ............................................*............................................ - // vmul.s32 q7, q3, r3 // .................................................*....................................... - // vqrdmulh.s32 q3, q3, r4 // ...............................................*......................................... - // vmla.s32 q7, q3, r12 // .....................................................*................................... - // vsub.u32 q3, q2, q7 // ............................................................*............................ - // vadd.u32 q2, q2, q7 // ..........................................................*.............................. - // vstrw.32 q0, [r0], #128 // .........................................................*............................... - // vstrw.32 q1, [r0, #(-128+16)] // ......................................................*.................................. - // vstrw.32 q2, [r0, #(-128+32)] // .......................................................................*................. - // vstrw.32 q3, [r0, #(-128+48)] // ...........................................................................*............. - // qrestore q1, QSTACK4 // ................................................................*........................ - // qrestore q2, QSTACK5 // ....................................................................*.................... - // qrestore q3, QSTACK6 // ...................................................*..................................... - // ldrd r3, r4, [r11, #((-7 + 4)*8)] // ................................................*........................................ - // vmul.s32 q7, q3, r3 // .............................................................*........................... - // vqrdmulh.s32 q3, q3, r4 // ...........................................................*............................. - // vmla.s32 q7, q3, r12 // ...............................................................*......................... - // vsub.u32 q3, q1, q7 // ................................................................................*........ - // vadd.u32 q1, q1, q7 // .........................................................................*............... - // vmul.s32 q7, q4, r3 // .......................................................*................................. - // vqrdmulh.s32 q4, q4, r4 // .................................................................*....................... - // vmla.s32 q7, q4, r12 // ...................................................................*..................... - // vsub.u32 q4, q2, q7 // ..............................................................................*.......... - // vadd.u32 q2, q2, q7 // .....................................................................*................... - // ldrd r3, r4, [r11, #((-7 + 5)*8)] // ..............................................................*.......................... - // vmul.s32 q7, q2, r3 // ......................................................................*.................. - // vqrdmulh.s32 q2, q2, r4 // ........................................................................*................ - // vmla.s32 q7, q2, r12 // ..........................................................................*.............. - // vsub.u32 q2, q1, q7 // ..................................................................................*...... - // vadd.u32 q1, q1, q7 // ............................................................................*............ - // ldrd r3, r4, [r11, #((-7 + 6)*8)] // ..................................................................*...................... - // vmul.s32 q7, q4, r3 // ...............................................................................*......... - // vqrdmulh.s32 q4, q4, r4 // .................................................................................*....... - // vmla.s32 q7, q4, r12 // ...................................................................................*..... - // vsub.u32 q4, q3, q7 // .......................................................................................*. - // vadd.u32 q3, q3, q7 // .....................................................................................*... - // vstrw.32 q1, [r0, #(-128+64)] // .............................................................................*........... - // vstrw.32 q2, [r0, #(-128+80)] // ....................................................................................*.... - // vstrw.32 q3, [r0, #(-128+96)] // ......................................................................................*.. - // vstrw.32 q4, [r0, #(-128+112)] // ........................................................................................* - - le lr, layer456_loop - - sub in, in, #(4*256) - - .unreq rtmp - .unreq rtmp_tw - .unreq root2 - .unreq root2_tw - - /* Layers 7,8 */ - - root0 .req q5 - root0_tw .req q6 - root1 .req q5 - root1_tw .req q6 - root2 .req q5 - root2_tw .req q6 - - mov lr, #16 - .p2align 2 - vld40.32 {q2,q3,q4,q5}, [r0] // *..... - // gap // ...... - vld41.32 {q2,q3,q4,q5}, [r0] // .*.... - // gap // ...... - vld42.32 {q2,q3,q4,q5}, [r0] // ..*... - // gap // ...... - vld43.32 {q2,q3,q4,q5}, [r0]! // ...*.. - // gap // ...... - vldrw.32 q6, [r11, #16] // ....*. - vqrdmulh.s32 q1, q4, q6 // .....* - - // original source code - // vld40.32 {q2,q3,q4,q5}, [r0] // *..... - // vld41.32 {q2,q3,q4,q5}, [r0] // .*.... - // vld42.32 {q2,q3,q4,q5}, [r0] // ..*... - // vld43.32 {q2,q3,q4,q5}, [r0]! // ...*.. - // vldrw.32 q6, [r11, #16] // ....*. - // vqrdmulh.s32 q1, q4, q6 // .....* - - sub lr, lr, #1 -.p2align 2 -layer78_loop: - vqrdmulh.s32 q0, q5, q6 // ............*..................... - vldrw.32 q6, [r11] , #96 // ....*............................. - vmul.s32 q7, q5, q6 // ...........*...................... - vldrw.32 q5, [r11, #-16] // ........................*......... - vmla.s32 q7, q0, r12 // .............*.................... - vldrw.32 q0, [r11, #-48] // .................*................ - vmul.s32 q4, q4, q6 // ......*........................... - vsub.u32 q6, q3, q7 // ..............*................... - vqrdmulh.s32 q5, q6, q5 // ..........................*....... - vadd.u32 q7, q3, q7 // ...............*.................. - vmla.s32 q4, q1, r12 // ........*......................... - vldrw.32 q1, [r11, #-32] // .......................*.......... - vmul.s32 q6, q6, q1 // .........................*........ - vsub.u32 q3, q2, q4 // .........*........................ - vmla.s32 q6, q5, r12 // ...........................*...... - vldrw.32 q1, [r11, #-64] // ................*................. - vsub.u32 q5, q3, q6 // ............................*..... - vstrw.u32 q5, [r0, #-16] // .................................* - vadd.u32 q5, q3, q6 // .............................*.... - vstrw.u32 q5, [r0, #-32] // ................................*. - vadd.u32 q6, q2, q4 // ..........*....................... - vld40.32 {q2,q3,q4,q5}, [r0] // e................................. - vmul.s32 q1, q7, q1 // ..................*............... - vld41.32 {q2,q3,q4,q5}, [r0] // .e................................ - vqrdmulh.s32 q0, q7, q0 // ...................*.............. - vld42.32 {q2,q3,q4,q5}, [r0] // ..e............................... - vmla.s32 q1, q0, r12 // ....................*............. - vld43.32 {q2,q3,q4,q5}, [r0]! // ...e.............................. - vsub.u32 q0, q6, q1 // .....................*............ - vstrw.u32 q0, [r0, #-112] // ...............................*.. - vadd.u32 q0, q6, q1 // ......................*........... - vldrw.32 q6, [r11, #16] // .....e............................ - vqrdmulh.s32 q1, q4, q6 // .......e.......................... - vstrw.u32 q0, [r0, #-128] // ..............................*... - - // original source code - // vld40.32 {q0, q1, q2, q3}, [r0] // e............|....................e............ - // vld41.32 {q0, q1, q2, q3}, [r0] // ..e..........|......................e.......... - // vld42.32 {q0, q1, q2, q3}, [r0] // ....e........|........................e........ - // vld43.32 {q0, q1, q2, q3}, [r0]! // ......e......|..........................e...... - // vldrw.32 q5, [r11], #+96 // .............|*................................ - // vldrw.32 q6, [r11, #(+16-96)] // ..........e..|..............................e.. - // vmul.s32 q7, q2, q5 // .............|.....*........................... - // vqrdmulh.s32 q2, q2, q6 // ...........e.|...............................e. - // vmla.s32 q7, q2, r12 // .............|.........*....................... - // vsub.u32 q2, q0, q7 // .............|............*.................... - // vadd.u32 q0, q0, q7 // .............|...................*............. - // vmul.s32 q7, q3, q5 // .............|.*............................... - // vqrdmulh.s32 q3, q3, q6 // .............*................................. - // vmla.s32 q7, q3, r12 // .............|...*............................. - // vsub.u32 q3, q1, q7 // .............|......*.......................... - // vadd.u32 q1, q1, q7 // .............|........*........................ - // vldrw.32 q5, [r11, #(32 - 96)] // .............|..............*.................. - // vldrw.32 q6, [r11, #(48 - 96)] // .............|....*............................ - // vmul.s32 q7, q1, q5 // .*...........|.....................*........... - // vqrdmulh.s32 q1, q1, q6 // ...*.........|.......................*......... - // vmla.s32 q7, q1, r12 // .....*.......|.........................*....... - // vsub.u32 q1, q0, q7 // .......*.....|...........................*..... - // vadd.u32 q0, q0, q7 // .........*...|.............................*... - // vldrw.32 q5, [r11, #(64-96)] // .............|..........*...................... - // vldrw.32 q6, [r11, #(80-96)] // .............|..*.............................. - // vmul.s32 q7, q3, q5 // .............|...........*..................... - // vqrdmulh.s32 q3, q3, q6 // .............|.......*......................... - // vmla.s32 q7, q3, r12 // .............|.............*................... - // vsub.u32 q3, q2, q7 // .............|...............*................. - // vadd.u32 q2, q2, q7 // .............|.................*............... - // vstrw.32 q0, [r0, #( 0 - 64)] // ............*|................................* - // vstrw.32 q1, [r0, #(16 - 64)] // ........*....|............................*.... - // vstrw.32 q2, [r0, #(32 - 64)] // .............|..................*.............. - // vstrw.32 q3, [r0, #(48 - 64)] // .............|................*................ - - le lr, layer78_loop - vqrdmulh.s32 q6, q5, q6 // *........................... - vldrw.32 q0, [r11] , #96 // .*.......................... - vmul.s32 q7, q5, q0 // ..*......................... - vldrw.32 q5, [r11, #-64] // ...............*............ - vmul.s32 q0, q4, q0 // ......*..................... - vldrw.32 q4, [r11, #-32] // ...........*................ - vmla.s32 q7, q6, r12 // ....*....................... - vldrw.32 q6, [r11, #-16] // ...*........................ - vmla.s32 q0, q1, r12 // ..........*................. - vsub.u32 q1, q3, q7 // .......*.................... - vqrdmulh.s32 q6, q1, q6 // ........*................... - vadd.u32 q7, q3, q7 // .........*.................. - vmul.s32 q1, q1, q4 // ............*............... - vldrw.32 q4, [r11, #-48] // .....*...................... - vmul.s32 q3, q7, q5 // .....................*...... - vadd.u32 q5, q2, q0 // ....................*....... - vmla.s32 q1, q6, r12 // ..............*............. - vsub.u32 q2, q2, q0 // .............*.............. - vqrdmulh.s32 q4, q7, q4 // ......................*..... - vadd.u32 q6, q2, q1 // ..................*......... - vmla.s32 q3, q4, r12 // .......................*.... - vstrw.u32 q6, [r0, #-32] // ...................*........ - vadd.u32 q4, q5, q3 // ..........................*. - vstrw.u32 q4, [r0, #-64] // ...........................* - vsub.u32 q5, q5, q3 // ........................*... - vstrw.u32 q5, [r0, #-48] // .........................*.. - vsub.u32 q0, q2, q1 // ................*........... - vstrw.u32 q0, [r0, #-16] // .................*.......... - - // original source code - // vqrdmulh.s32 q0, q5, q6 // *........................... - // vldrw.32 q6, [r11] , #96 // .*.......................... - // vmul.s32 q7, q5, q6 // ..*......................... - // vldrw.32 q5, [r11, #-16] // .......*.................... - // vmla.s32 q7, q0, r12 // ......*..................... - // vldrw.32 q0, [r11, #-48] // .............*.............. - // vmul.s32 q4, q4, q6 // ....*....................... - // vsub.u32 q6, q3, q7 // .........*.................. - // vqrdmulh.s32 q5, q6, q5 // ..........*................. - // vadd.u32 q7, q3, q7 // ...........*................ - // vmla.s32 q4, q1, r12 // ........*................... - // vldrw.32 q1, [r11, #-32] // .....*...................... - // vmul.s32 q6, q6, q1 // ............*............... - // vsub.u32 q3, q2, q4 // .................*.......... - // vmla.s32 q6, q5, r12 // ................*........... - // vldrw.32 q1, [r11, #-64] // ...*........................ - // vsub.u32 q5, q3, q6 // ..........................*. - // vstrw.u32 q5, [r0, #-16] // ...........................* - // vadd.u32 q5, q3, q6 // ...................*........ - // vstrw.u32 q5, [r0, #-32] // .....................*...... - // vadd.u32 q6, q2, q4 // ...............*............ - // vmul.s32 q1, q7, q1 // ..............*............. - // vqrdmulh.s32 q0, q7, q0 // ..................*......... - // vmla.s32 q1, q0, r12 // ....................*....... - // vsub.u32 q0, q6, q1 // ........................*... - // vstrw.u32 q0, [r0, #-48] // .........................*.. - // vadd.u32 q0, q6, q1 // ......................*..... - // vstrw.u32 q0, [r0, #-64] // .......................*.... - - - add sp, sp, #STACK_SIZE - align_stack_undo - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m85.s b/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m85.s deleted file mode 100644 index ef75d4e..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_opt_size_m85.s +++ /dev/null @@ -1,718 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -.p2align 4 -roots: -#include "ntt_dilithium_123_456_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.macro qsave loc, a // slothy:no-unfold - vstrw.32 \a, [sp, #\loc\()] -.endm -.macro qrestore a, loc // slothy:no-unfold - vldrw.32 \a, [sp, #\loc\()] -.endm -.macro restored a, b, loc // slothy:no-unfold - ldrd \a, \b, [sp, #\loc\()] -.endm -.macro saved loc, a, b // slothy:no-unfold - strd \a, \b, [sp, #\loc\()] -.endm -.macro restore a, loc // slothy:no-unfold - ldr \a, [sp, #\loc\()] -.endm -.macro save loc, a // slothy:no-unfold - str \a, [sp, #\loc\()] -.endm - -// Aligns stack =0 mod 16 -.macro align_stack_do // slothy:no-unfold - mov r11, sp - and r12, r11, #0xC // 8 of ==8 mod 16, 0 otherwise - sub sp, sp, r12 // Align stack to 16 byte - sub sp, sp, #16 - str r12, [sp] -.endm - -// Reverts initial stack correction -.macro align_stack_undo // slothy:no-unfold - ldr r12, [sp] - add sp, sp, #16 - add sp, sp, r12 -.endm - -#define STACK_SIZE (5*16+8) // +8 is for alignment -#define QSTACK4 (0*16) -#define QSTACK5 (1*16) -#define QSTACK6 (2*16) - -#define ROOT0_STACK (3*16) -#define ROOT1_STACK (3*16 + 8) -#define ROOT4_STACK (4*16) -#define RPTR_STACK (4*16 + 8) - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_123_456_78_opt_size_m85, %function -.global ntt_dilithium_123_456_78_opt_size_m85 -ntt_dilithium_123_456_78_opt_size_m85: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - align_stack_do - sub sp, sp, #STACK_SIZE - - modulus .req r12 - r_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr r_ptr, roots_addr - - in .req r0 - in_low .req in - in_high .req r1 - - add in_high, in, #(4*128) - - root2 .req r2 - root2_tw .req r3 - root3 .req r4 - root3_tw .req r5 - root5 .req r6 - root5_tw .req r7 - root6 .req r8 - root6_tw .req r9 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - data4 .req q1 - data5 .req q2 - data6 .req q3 - data7 .req q4 - - tmp .req q7 - - /* Layers 1-3 */ - - rtmp .req root6 - rtmp_tw .req root6_tw - - ldrd rtmp, rtmp_tw, [r_ptr], #(7*8) - saved ROOT0_STACK, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #(1*8 - 7*8)] - saved ROOT1_STACK, rtmp, rtmp_tw - ldrd root2, root2_tw, [r_ptr, #(2*8 - 7*8)] - ldrd root3, root3_tw, [r_ptr, #(3*8 - 7*8)] - ldrd rtmp, rtmp_tw, [r_ptr, #(4*8 - 7*8)] - saved ROOT4_STACK, rtmp, rtmp_tw - ldrd root5, root5_tw, [r_ptr, #(5*8 - 7*8)] - ldrd root6, root6_tw, [r_ptr, #(6*8 - 7*8)] - save RPTR_STACK, r_ptr - - .unreq rtmp - .unreq rtmp_tw - rtmp .req r10 - rtmp_tw .req r11 - - mov lr, #8 - .p2align 2 -.p2align 2 -layer123_loop: - restored r10, r11, ROOT0_STACK // ..*.................................................................................. - vldrw.32 q4, [r1, #128] // ..........*.......................................................................... - vmul.s32 q2, q4, r10 // ...........*......................................................................... - vldrw.32 q5, [r1] // .*................................................................................... - vqrdmulh.s32 q0, q4, r11 // ............*........................................................................ - vldrw.32 q3, [r0, #128] // .........*........................................................................... - vmla.s32 q2, q0, r12 // .............*....................................................................... - vldrw.32 q1, [r1, #256] // ..................*.................................................................. - vadd.u32 q4, q3, q2 // ...............*..................................................................... - vqrdmulh.s32 q7, q5, r11 // ....*................................................................................ - vldrw.32 q6, [r1, #384] // ..........................*.......................................................... - vmul.s32 q5, q5, r10 // ...*................................................................................. - vsub.u32 q2, q3, q2 // ..............*...................................................................... - vmla.s32 q5, q7, r12 // .....*............................................................................... - vldrw.32 q7, [r0] // *.................................................................................... - vqrdmulh.s32 q0, q1, r11 // ....................*................................................................ - vadd.u32 q3, q7, q5 // .......*............................................................................. - vmul.s32 q1, q1, r10 // ...................*................................................................. - vsub.u32 q7, q7, q5 // ......*.............................................................................. - vmul.s32 q5, q6, r10 // ...........................*......................................................... - qsave QSTACK4, q7 // ........*............................................................................ - vqrdmulh.s32 q7, q6, r11 // ............................*........................................................ - restored r11, r10, ROOT1_STACK // ................................*.................................................... - vmla.s32 q5, q7, r12 // .............................*....................................................... - vldrw.32 q7, [r0, #384] // .........................*........................................................... - vmla.s32 q1, q0, r12 // .....................*............................................................... - vadd.u32 q6, q7, q5 // ...............................*..................................................... - vmul.s32 q0, q6, r11 // ......................................*.............................................. - qsave QSTACK5, q2 // ................*.................................................................... - vqrdmulh.s32 q2, q6, r10 // .......................................*............................................. - vsub.u32 q6, q7, q5 // ..............................*...................................................... - vmla.s32 q0, q2, r12 // ........................................*............................................ - vldrw.32 q7, [r0, #256] // .................*................................................................... - vsub.u32 q5, q4, q0 // .........................................*........................................... - vqrdmulh.s32 q2, q5, r5 // .................................................*................................... - vadd.u32 q0, q4, q0 // ..........................................*.......................................... - vmul.s32 q5, q5, r4 // ................................................*.................................... - vadd.u32 q4, q7, q1 // .......................*............................................................. - vmla.s32 q5, q2, r12 // ..................................................*.................................. - vsub.u32 q1, q7, q1 // ......................*.............................................................. - vqrdmulh.s32 q7, q4, r10 // ..................................*.................................................. - qsave QSTACK6, q1 // ........................*............................................................ - vmul.s32 q2, q4, r11 // .................................*................................................... - restored r11, r10, ROOT4_STACK // ............................................................*........................ - vmla.s32 q2, q7, r12 // ...................................*................................................. - qrestore q4, QSTACK6 // ...........................................................*......................... - vsub.u32 q7, q3, q2 // ....................................*................................................ - vqrdmulh.s32 q1, q6, r10 // ...................................................................*................. - vadd.u32 q3, q3, q2 // .....................................*............................................... - vmul.s32 q2, q6, r11 // ..................................................................*.................. - vsub.u32 q6, q7, q5 // ...................................................*................................. - vstrw.u32 q6, [r0, #384] // ........................................................*............................ - vqrdmulh.s32 q6, q0, r3 // ............................................*........................................ - vadd.u32 q7, q7, q5 // ....................................................*................................ - vmul.s32 q5, q0, r2 // ...........................................*......................................... - vstrw.u32 q7, [r0, #256] // .......................................................*............................. - vmla.s32 q5, q6, r12 // .............................................*....................................... - qrestore q0, QSTACK5 // ..........................................................*.......................... - vqrdmulh.s32 q7, q4, r10 // ..............................................................*...................... - vsub.u32 q6, q3, q5 // ..............................................*...................................... - vstrw.u32 q6, [r0, #128] // ......................................................*.............................. - vmul.s32 q6, q4, r11 // .............................................................*....................... - vadd.u32 q3, q3, q5 // ...............................................*..................................... - vmla.s32 q2, q1, r12 // ....................................................................*................ - vstrw.u32 q3, [r0] , #16 // .....................................................*............................... - vmla.s32 q6, q7, r12 // ...............................................................*..................... - vsub.u32 q7, q0, q2 // .....................................................................*............... - qrestore q4, QSTACK4 // .........................................................*........................... - vmul.s32 q1, q7, r8 // ............................................................................*........ - vadd.u32 q2, q0, q2 // ......................................................................*.............. - vqrdmulh.s32 q5, q7, r9 // .............................................................................*....... - vsub.u32 q0, q4, q6 // ................................................................*.................... - vmla.s32 q1, q5, r12 // ..............................................................................*...... - vadd.u32 q6, q4, q6 // .................................................................*................... - vmul.s32 q3, q2, r6 // .......................................................................*............. - vadd.u32 q4, q0, q1 // ................................................................................*.... - vqrdmulh.s32 q5, q2, r7 // ........................................................................*............ - vsub.u32 q2, q0, q1 // ...............................................................................*..... - vstrw.u32 q4, [r1, #256] // ...................................................................................*. - vmla.s32 q3, q5, r12 // .........................................................................*........... - vstrw.u32 q2, [r1, #384] // ....................................................................................* - vadd.u32 q1, q6, q3 // ...........................................................................*......... - vstrw.u32 q1, [r1] , #16 // .................................................................................*... - vsub.u32 q6, q6, q3 // ..........................................................................*.......... - vstrw.u32 q6, [r1, #112] // ..................................................................................*.. - - // original source code - // vldrw.32 q0, [r0] // ..............*...................................................................... - // vldrw.32 q1, [r1] // ...*................................................................................. - // restored r10, r11, ROOT0_STACK // *.................................................................................... - // vmul.s32 q7, q1, r10 // ...........*......................................................................... - // vqrdmulh.s32 q1, q1, r11 // .........*........................................................................... - // vmla.s32 q7, q1, r12 // .............*....................................................................... - // vsub.u32 q1, q0, q7 // ..................*.................................................................. - // vadd.u32 q0, q0, q7 // ................*.................................................................... - // qsave QSTACK4, q1 // ....................*................................................................ - // vldrw.32 q1, [r0, #128] // .....*............................................................................... - // vldrw.32 q2, [r1, #128] // .*................................................................................... - // vmul.s32 q7, q2, r10 // ..*.................................................................................. - // vqrdmulh.s32 q2, q2, r11 // ....*................................................................................ - // vmla.s32 q7, q2, r12 // ......*.............................................................................. - // vsub.u32 q2, q1, q7 // ............*........................................................................ - // vadd.u32 q1, q1, q7 // ........*............................................................................ - // qsave QSTACK5, q2 // ............................*........................................................ - // vldrw.32 q2, [r0, #256] // ................................*.................................................... - // vldrw.32 q3, [r1, #256] // .......*............................................................................. - // vmul.s32 q7, q3, r10 // .................*................................................................... - // vqrdmulh.s32 q3, q3, r11 // ...............*..................................................................... - // vmla.s32 q7, q3, r12 // .........................*........................................................... - // vsub.u32 q3, q2, q7 // .......................................*............................................. - // vadd.u32 q2, q2, q7 // .....................................*............................................... - // qsave QSTACK6, q3 // .........................................*........................................... - // vldrw.32 q3, [r0, #384] // ........................*............................................................ - // vldrw.32 q4, [r1, #384] // ..........*.......................................................................... - // vmul.s32 q7, q4, r10 // ...................*................................................................. - // vqrdmulh.s32 q4, q4, r11 // .....................*............................................................... - // vmla.s32 q7, q4, r12 // .......................*............................................................. - // vsub.u32 q4, q3, q7 // ..............................*...................................................... - // vadd.u32 q3, q3, q7 // ..........................*.......................................................... - // restored r10, r11, ROOT1_STACK // ......................*.............................................................. - // vmul.s32 q7, q2, r10 // ..........................................*.......................................... - // vqrdmulh.s32 q2, q2, r11 // ........................................*............................................ - // vmla.s32 q7, q2, r12 // ............................................*........................................ - // vsub.u32 q2, q0, q7 // ..............................................*...................................... - // vadd.u32 q0, q0, q7 // ................................................*.................................... - // vmul.s32 q7, q3, r10 // ...........................*......................................................... - // vqrdmulh.s32 q3, q3, r11 // .............................*....................................................... - // vmla.s32 q7, q3, r12 // ...............................*..................................................... - // vsub.u32 q3, q1, q7 // .................................*................................................... - // vadd.u32 q1, q1, q7 // ...................................*................................................. - // vmul.s32 q7, q1, r2 // ......................................................*.............................. - // vqrdmulh.s32 q1, q1, r3 // ....................................................*................................ - // vmla.s32 q7, q1, r12 // ........................................................*............................ - // vsub.u32 q1, q0, q7 // ...........................................................*......................... - // vadd.u32 q0, q0, q7 // ..............................................................*...................... - // vmul.s32 q7, q3, r4 // ....................................*................................................ - // vqrdmulh.s32 q3, q3, r5 // ..................................*.................................................. - // vmla.s32 q7, q3, r12 // ......................................*.............................................. - // vsub.u32 q3, q2, q7 // ..................................................*.................................. - // vadd.u32 q2, q2, q7 // .....................................................*............................... - // vstrw.32 q0, [r0], #16 // ................................................................*.................... - // vstrw.32 q1, [r0, #(128-16)] // ............................................................*........................ - // vstrw.32 q2, [r0, #(256-16)] // .......................................................*............................. - // vstrw.32 q3, [r0, #(384-16)] // ...................................................*................................. - // qrestore q1, QSTACK4 // ...................................................................*................. - // qrestore q2, QSTACK5 // .........................................................*........................... - // qrestore q3, QSTACK6 // .............................................*....................................... - // restored r10, r11, ROOT4_STACK // ...........................................*......................................... - // vmul.s32 q7, q3, r10 // .............................................................*....................... - // vqrdmulh.s32 q3, q3, r11 // ..........................................................*.......................... - // vmla.s32 q7, q3, r12 // .................................................................*................... - // vsub.u32 q3, q1, q7 // .......................................................................*............. - // vadd.u32 q1, q1, q7 // .........................................................................*........... - // vmul.s32 q7, q4, r10 // .................................................*................................... - // vqrdmulh.s32 q4, q4, r11 // ...............................................*..................................... - // vmla.s32 q7, q4, r12 // ...............................................................*..................... - // vsub.u32 q4, q2, q7 // ..................................................................*.................. - // vadd.u32 q2, q2, q7 // .....................................................................*............... - // vmul.s32 q7, q2, r6 // ..........................................................................*.......... - // vqrdmulh.s32 q2, q2, r7 // ............................................................................*........ - // vmla.s32 q7, q2, r12 // ...............................................................................*..... - // vsub.u32 q2, q1, q7 // ...................................................................................*. - // vadd.u32 q1, q1, q7 // .................................................................................*... - // vmul.s32 q7, q4, r8 // ....................................................................*................ - // vqrdmulh.s32 q4, q4, r9 // ......................................................................*.............. - // vmla.s32 q7, q4, r12 // ........................................................................*............ - // vsub.u32 q4, q3, q7 // .............................................................................*....... - // vadd.u32 q3, q3, q7 // ...........................................................................*......... - // vstrw.32 q1, [r1], #16 // ..................................................................................*.. - // vstrw.32 q2, [r1, #(128-16)] // ....................................................................................* - // vstrw.32 q3, [r1, #(256-16)] // ..............................................................................*...... - // vstrw.32 q4, [r1, #(384-16)] // ................................................................................*.... - - le lr, layer123_loop - .unreq in_high - .unreq in_low - - sub in, in, #(128) - restore r_ptr, RPTR_STACK - - /* Layers 4,5,6 */ - - .unreq rtmp - .unreq rtmp_tw - rtmp .req r3 - rtmp_tw .req r4 - - mov lr, #8 - .p2align 2 -.p2align 2 -layer456_loop: - ldrd r2, r7, [r11] , #56 // *........................................................................................ - vldrw.32 q2, [r0, #96] // ..................*...................................................................... - vmul.s32 q6, q2, r2 // ...................*..................................................................... - vldrw.32 q7, [r0, #112] // ..........................*.............................................................. - vqrdmulh.s32 q3, q2, r7 // ....................*.................................................................... - vldrw.32 q0, [r0, #64] // ..*...................................................................................... - vmla.s32 q6, q3, r12 // .....................*................................................................... - vldrw.32 q2, [r0, #32] // .................*....................................................................... - vsub.u32 q4, q2, q6 // ......................*.................................................................. - vqrdmulh.s32 q3, q0, r7 // ....*.................................................................................... - vadd.u32 q6, q2, q6 // .......................*................................................................. - vmul.s32 q0, q0, r2 // ...*..................................................................................... - vldrw.32 q2, [r0, #80] // ..........*.............................................................................. - vmla.s32 q0, q3, r12 // .....*................................................................................... - vldrw.32 q1, [r0] // .*....................................................................................... - vmul.s32 q5, q7, r2 // ...........................*............................................................. - vsub.u32 q3, q1, q0 // ......*.................................................................................. - vqrdmulh.s32 q7, q7, r7 // ............................*............................................................ - vadd.u32 q1, q1, q0 // .......*................................................................................. - vmul.s32 q0, q2, r2 // ...........*............................................................................. - qsave QSTACK6, q4 // ........................*................................................................ - vqrdmulh.s32 q4, q2, r7 // ............*............................................................................ - ldrd r9, r4, [r11, #-48] // ................................*........................................................ - vmla.s32 q0, q4, r12 // .............*........................................................................... - vldrw.32 q4, [r0, #16] // .........*............................................................................... - vmla.s32 q5, q7, r12 // .............................*........................................................... - vadd.u32 q7, q4, q0 // ...............*......................................................................... - vldrw.32 q2, [r0, #48] // .........................*............................................................... - vsub.u32 q4, q4, q0 // ..............*.......................................................................... - qsave QSTACK5, q4 // ................*........................................................................ - vadd.u32 q4, q2, q5 // ...............................*......................................................... - vqrdmulh.s32 q0, q4, r4 // .......................................*................................................. - qsave QSTACK4, q3 // ........*................................................................................ - vmul.s32 q4, q4, r9 // ......................................*.................................................. - vsub.u32 q5, q2, q5 // ..............................*.......................................................... - vmla.s32 q4, q0, r12 // ........................................*................................................ - ldrd r5, r1, [r11, #-32] // .................................................*....................................... - vsub.u32 q3, q7, q4 // .........................................*............................................... - vqrdmulh.s32 q0, q3, r1 // ...................................................*..................................... - vadd.u32 q2, q7, q4 // ..........................................*.............................................. - vmul.s32 q7, q3, r5 // ..................................................*...................................... - ldrd r8, r3, [r11, #-8] // ...............................................................................*......... - vmul.s32 q4, q6, r9 // .................................*....................................................... - ldrd r2, r6, [r11, #-40] // ...........................................*............................................. - vqrdmulh.s32 q6, q6, r4 // ..................................*...................................................... - ldrd r7, r10, [r11, #-16] // .........................................................................*............... - vmla.s32 q4, q6, r12 // ...................................*..................................................... - ldrd r9, r5, [r11, #-24] // ..............................................................*.......................... - vmla.s32 q7, q0, r12 // ....................................................*.................................... - vsub.u32 q6, q1, q4 // ....................................*.................................................... - vqrdmulh.s32 q0, q2, r6 // .............................................*........................................... - vadd.u32 q3, q6, q7 // ......................................................*.................................. - vstrw.u32 q3, [r0, #32] // .........................................................*............................... - vsub.u32 q3, q6, q7 // .....................................................*................................... - vmul.s32 q7, q2, r2 // ............................................*............................................ - qrestore q2, QSTACK6 // .............................................................*........................... - vqrdmulh.s32 q6, q2, r5 // ................................................................*........................ - vstrw.u32 q3, [r0, #48] // ..........................................................*.............................. - vmla.s32 q7, q0, r12 // ..............................................*.......................................... - qrestore q3, QSTACK4 // ...........................................................*............................. - vmul.s32 q0, q2, r9 // ...............................................................*......................... - vadd.u32 q2, q1, q4 // .....................................*................................................... - vmla.s32 q0, q6, r12 // .................................................................*....................... - qrestore q6, QSTACK5 // ............................................................*............................ - vadd.u32 q1, q3, q0 // ...................................................................*..................... - vqrdmulh.s32 q4, q5, r5 // .....................................................................*................... - vsub.u32 q3, q3, q0 // ..................................................................*...................... - vmul.s32 q0, q5, r9 // ....................................................................*.................... - vadd.u32 q5, q2, q7 // ................................................*........................................ - vmla.s32 q0, q4, r12 // ......................................................................*.................. - vstrw.u32 q5, [r0] , #128 // .......................................................*................................. - vadd.u32 q5, q6, q0 // ........................................................................*................ - vmul.s32 q4, q5, r7 // ..........................................................................*.............. - vsub.u32 q2, q2, q7 // ...............................................*......................................... - vqrdmulh.s32 q7, q5, r10 // ...........................................................................*............. - vsub.u32 q5, q6, q0 // .......................................................................*................. - vmla.s32 q4, q7, r12 // ............................................................................*............ - vstrw.u32 q2, [r0, #-112] // ........................................................*................................ - vsub.u32 q7, q1, q4 // .............................................................................*........... - vqrdmulh.s32 q2, q5, r3 // .................................................................................*....... - vadd.u32 q1, q1, q4 // ..............................................................................*.......... - vmul.s32 q5, q5, r8 // ................................................................................*........ - vstrw.u32 q7, [r0, #-48] // ......................................................................................*.. - vmla.s32 q5, q2, r12 // ..................................................................................*...... - vstrw.u32 q1, [r0, #-64] // .....................................................................................*... - vsub.u32 q7, q3, q5 // ...................................................................................*..... - vstrw.u32 q7, [r0, #-16] // ........................................................................................* - vadd.u32 q0, q3, q5 // ....................................................................................*.... - vstrw.u32 q0, [r0, #-32] // .......................................................................................*. - - // original source code - // ldrd r3, r4, [r11], #(7*8) // *........................................................................................ - // vldrw.32 q0, [r0] // ..............*.......................................................................... - // vldrw.32 q1, [r0, #64] // .....*................................................................................... - // vmul.s32 q7, q1, r3 // ...........*............................................................................. - // vqrdmulh.s32 q1, q1, r4 // .........*............................................................................... - // vmla.s32 q7, q1, r12 // .............*........................................................................... - // vsub.u32 q1, q0, q7 // ................*........................................................................ - // vadd.u32 q0, q0, q7 // ..................*...................................................................... - // qsave QSTACK4, q1 // ................................*........................................................ - // vldrw.32 q1, [r0, #16] // ........................*................................................................ - // vldrw.32 q2, [r0, #80] // ............*............................................................................ - // vmul.s32 q7, q2, r3 // ...................*..................................................................... - // vqrdmulh.s32 q2, q2, r4 // .....................*................................................................... - // vmla.s32 q7, q2, r12 // .......................*................................................................. - // vsub.u32 q2, q1, q7 // ............................*............................................................ - // vadd.u32 q1, q1, q7 // ..........................*.............................................................. - // qsave QSTACK5, q2 // .............................*........................................................... - // vldrw.32 q2, [r0, #32] // .......*................................................................................. - // vldrw.32 q3, [r0, #96] // .*....................................................................................... - // vmul.s32 q7, q3, r3 // ..*...................................................................................... - // vqrdmulh.s32 q3, q3, r4 // ....*.................................................................................... - // vmla.s32 q7, q3, r12 // ......*.................................................................................. - // vsub.u32 q3, q2, q7 // ........*................................................................................ - // vadd.u32 q2, q2, q7 // ..........*.............................................................................. - // qsave QSTACK6, q3 // ....................*.................................................................... - // vldrw.32 q3, [r0, #48] // ...........................*............................................................. - // vldrw.32 q4, [r0, #112] // ...*..................................................................................... - // vmul.s32 q7, q4, r3 // ...............*......................................................................... - // vqrdmulh.s32 q4, q4, r4 // .................*....................................................................... - // vmla.s32 q7, q4, r12 // .........................*............................................................... - // vsub.u32 q4, q3, q7 // ..................................*...................................................... - // vadd.u32 q3, q3, q7 // ..............................*.......................................................... - // ldrd r3, r4, [r11, #((-7 + 1)*8)] // ......................*.................................................................. - // vmul.s32 q7, q2, r3 // ..........................................*.............................................. - // vqrdmulh.s32 q2, q2, r4 // ............................................*............................................ - // vmla.s32 q7, q2, r12 // ..............................................*.......................................... - // vsub.u32 q2, q0, q7 // .................................................*....................................... - // vadd.u32 q0, q0, q7 // .............................................................*........................... - // vmul.s32 q7, q3, r3 // .................................*....................................................... - // vqrdmulh.s32 q3, q3, r4 // ...............................*......................................................... - // vmla.s32 q7, q3, r12 // ...................................*..................................................... - // vsub.u32 q3, q1, q7 // .....................................*................................................... - // vadd.u32 q1, q1, q7 // .......................................*................................................. - // ldrd r3, r4, [r11, #((-7 + 2)*8)] // ...........................................*............................................. - // vmul.s32 q7, q1, r3 // ......................................................*.................................. - // vqrdmulh.s32 q1, q1, r4 // ..................................................*...................................... - // vmla.s32 q7, q1, r12 // ..........................................................*.............................. - // vsub.u32 q1, q0, q7 // .........................................................................*............... - // vadd.u32 q0, q0, q7 // ....................................................................*.................... - // ldrd r3, r4, [r11, #((-7 + 3)*8)] // ....................................*.................................................... - // vmul.s32 q7, q3, r3 // ........................................*................................................ - // vqrdmulh.s32 q3, q3, r4 // ......................................*.................................................. - // vmla.s32 q7, q3, r12 // ................................................*........................................ - // vsub.u32 q3, q2, q7 // .....................................................*................................... - // vadd.u32 q2, q2, q7 // ...................................................*..................................... - // vstrw.32 q0, [r0], #128 // ......................................................................*.................. - // vstrw.32 q1, [r0, #(-128+16)] // .............................................................................*........... - // vstrw.32 q2, [r0, #(-128+32)] // ....................................................*.................................... - // vstrw.32 q3, [r0, #(-128+48)] // .........................................................*............................... - // qrestore q1, QSTACK4 // ...........................................................*............................. - // qrestore q2, QSTACK5 // ...............................................................*......................... - // qrestore q3, QSTACK6 // .......................................................*................................. - // ldrd r3, r4, [r11, #((-7 + 4)*8)] // ...............................................*......................................... - // vmul.s32 q7, q3, r3 // ............................................................*............................ - // vqrdmulh.s32 q3, q3, r4 // ........................................................*................................ - // vmla.s32 q7, q3, r12 // ..............................................................*.......................... - // vsub.u32 q3, q1, q7 // ..................................................................*...................... - // vadd.u32 q1, q1, q7 // ................................................................*........................ - // vmul.s32 q7, q4, r3 // ...................................................................*..................... - // vqrdmulh.s32 q4, q4, r4 // .................................................................*....................... - // vmla.s32 q7, q4, r12 // .....................................................................*................... - // vsub.u32 q4, q2, q7 // ...........................................................................*............. - // vadd.u32 q2, q2, q7 // .......................................................................*................. - // ldrd r3, r4, [r11, #((-7 + 5)*8)] // .............................................*........................................... - // vmul.s32 q7, q2, r3 // ........................................................................*................ - // vqrdmulh.s32 q2, q2, r4 // ..........................................................................*.............. - // vmla.s32 q7, q2, r12 // ............................................................................*............ - // vsub.u32 q2, q1, q7 // ..............................................................................*.......... - // vadd.u32 q1, q1, q7 // ................................................................................*........ - // ldrd r3, r4, [r11, #((-7 + 6)*8)] // .........................................*............................................... - // vmul.s32 q7, q4, r3 // .................................................................................*....... - // vqrdmulh.s32 q4, q4, r4 // ...............................................................................*......... - // vmla.s32 q7, q4, r12 // ...................................................................................*..... - // vsub.u32 q4, q3, q7 // .....................................................................................*... - // vadd.u32 q3, q3, q7 // .......................................................................................*. - // vstrw.32 q1, [r0, #(-128+64)] // ....................................................................................*.... - // vstrw.32 q2, [r0, #(-128+80)] // ..................................................................................*...... - // vstrw.32 q3, [r0, #(-128+96)] // ........................................................................................* - // vstrw.32 q4, [r0, #(-128+112)] // ......................................................................................*.. - - le lr, layer456_loop - - sub in, in, #(4*256) - - .unreq rtmp - .unreq rtmp_tw - .unreq root2 - .unreq root2_tw - - /* Layers 7,8 */ - - root0 .req q5 - root0_tw .req q6 - root1 .req q5 - root1_tw .req q6 - root2 .req q5 - root2_tw .req q6 - - mov lr, #16 - .p2align 2 - vld40.32 {q1,q2,q3,q4}, [r0] // *..... - // gap // ...... - vld41.32 {q1,q2,q3,q4}, [r0] // .*.... - // gap // ...... - vld42.32 {q1,q2,q3,q4}, [r0] // ..*... - // gap // ...... - vld43.32 {q1,q2,q3,q4}, [r0]! // ...*.. - // gap // ...... - vldrw.32 q6, [r11] , #96 // ....*. - // gap // ...... - vmul.s32 q0, q4, q6 // .....* - - // original source code - // vld40.32 {q1,q2,q3,q4}, [r0] // *..... - // vld41.32 {q1,q2,q3,q4}, [r0] // .*.... - // vld42.32 {q1,q2,q3,q4}, [r0] // ..*... - // vld43.32 {q1,q2,q3,q4}, [r0]! // ...*.. - // vldrw.32 q6, [r11] , #96 // ....*. - // vmul.s32 q0, q4, q6 // .....* - - sub lr, lr, #1 -.p2align 2 -layer78_loop: - vmul.s32 q7, q3, q6 // ......*........................... - vldrw.32 q6, [r11, #-80] // .....*............................ - vqrdmulh.s32 q4, q4, q6 // ............*..................... - vldrw.32 q5, [r11, #-16] // ........................*......... - vmla.s32 q0, q4, r12 // .............*.................... - vldrw.32 q4, [r11, #-32] // .......................*.......... - vqrdmulh.s32 q3, q3, q6 // .......*.......................... - vadd.u32 q6, q2, q0 // ...............*.................. - vmla.s32 q7, q3, r12 // ........*......................... - vsub.u32 q0, q2, q0 // ..............*................... - vmul.s32 q2, q0, q4 // .........................*........ - vsub.u32 q3, q1, q7 // .........*........................ - vqrdmulh.s32 q4, q0, q5 // ..........................*....... - vadd.u32 q7, q1, q7 // ..........*....................... - vmla.s32 q2, q4, r12 // ...........................*...... - vldrw.32 q5, [r11, #-48] // .................*................ - vadd.u32 q0, q3, q2 // .............................*.... - vstrw.u32 q0, [r0, #-32] // ................................*. - vsub.u32 q4, q3, q2 // ............................*..... - vldrw.32 q3, [r11, #-64] // ................*................. - vmul.s32 q0, q6, q3 // ..................*............... - vstrw.u32 q4, [r0, #-16] // .................................* - vld40.32 {q1,q2,q3,q4}, [r0] // e................................. - vqrdmulh.s32 q6, q6, q5 // ...................*.............. - vld41.32 {q1,q2,q3,q4}, [r0] // .e................................ - vmla.s32 q0, q6, r12 // ....................*............. - vld42.32 {q1,q2,q3,q4}, [r0] // ..e............................... - vadd.u32 q5, q7, q0 // ......................*........... - vld43.32 {q1,q2,q3,q4}, [r0]! // ...e.............................. - vsub.u32 q7, q7, q0 // .....................*............ - vldrw.32 q6, [r11] , #96 // ....e............................. - vstrw.u32 q7, [r0, #-112] // ...............................*.. - vmul.s32 q0, q4, q6 // ...........e...................... - vstrw.u32 q5, [r0, #-128] // ..............................*... - - // original source code - // vld40.32 {q0, q1, q2, q3}, [r0] // e...........|.....................e........... - // vld41.32 {q0, q1, q2, q3}, [r0] // ..e.........|.......................e......... - // vld42.32 {q0, q1, q2, q3}, [r0] // ....e.......|.........................e....... - // vld43.32 {q0, q1, q2, q3}, [r0]! // ......e.....|...........................e..... - // vldrw.32 q5, [r11], #+96 // ........e...|.............................e... - // vldrw.32 q6, [r11, #(+16-96)] // ............|*................................ - // vmul.s32 q7, q2, q5 // ............*................................. - // vqrdmulh.s32 q2, q2, q6 // ............|.....*........................... - // vmla.s32 q7, q2, r12 // ............|.......*......................... - // vsub.u32 q2, q0, q7 // ............|..........*...................... - // vadd.u32 q0, q0, q7 // ............|............*.................... - // vmul.s32 q7, q3, q5 // ..........e.|...............................e. - // vqrdmulh.s32 q3, q3, q6 // ............|.*............................... - // vmla.s32 q7, q3, r12 // ............|...*............................. - // vsub.u32 q3, q1, q7 // ............|........*........................ - // vadd.u32 q1, q1, q7 // ............|......*.......................... - // vldrw.32 q5, [r11, #(32 - 96)] // ............|..................*.............. - // vldrw.32 q6, [r11, #(48 - 96)] // ............|..............*.................. - // vmul.s32 q7, q1, q5 // ............|...................*............. - // vqrdmulh.s32 q1, q1, q6 // .*..........|......................*.......... - // vmla.s32 q7, q1, r12 // ...*........|........................*........ - // vsub.u32 q1, q0, q7 // .......*....|............................*.... - // vadd.u32 q0, q0, q7 // .....*......|..........................*...... - // vldrw.32 q5, [r11, #(64-96)] // ............|....*............................ - // vldrw.32 q6, [r11, #(80-96)] // ............|..*.............................. - // vmul.s32 q7, q3, q5 // ............|.........*....................... - // vqrdmulh.s32 q3, q3, q6 // ............|...........*..................... - // vmla.s32 q7, q3, r12 // ............|.............*................... - // vsub.u32 q3, q2, q7 // ............|.................*............... - // vadd.u32 q2, q2, q7 // ............|...............*................. - // vstrw.32 q0, [r0, #( 0 - 64)] // ...........*|................................* - // vstrw.32 q1, [r0, #(16 - 64)] // .........*..|..............................*.. - // vstrw.32 q2, [r0, #(32 - 64)] // ............|................*................ - // vstrw.32 q3, [r0, #(48 - 64)] // ............|....................*............ - - le lr, layer78_loop - vmul.s32 q6, q3, q6 // *........................... - vldrw.32 q5, [r11, #-80] // .*.......................... - vqrdmulh.s32 q4, q4, q5 // ..*......................... - vldrw.32 q7, [r11, #-32] // .....*...................... - vmla.s32 q0, q4, r12 // ....*....................... - vldrw.32 q4, [r11, #-16] // ...*........................ - vqrdmulh.s32 q5, q3, q5 // ......*..................... - vsub.u32 q3, q2, q0 // .........*.................. - vmla.s32 q6, q5, r12 // ........*................... - vadd.u32 q2, q2, q0 // .......*.................... - vmul.s32 q5, q3, q7 // ..........*................. - vadd.u32 q7, q1, q6 // .............*.............. - vqrdmulh.s32 q3, q3, q4 // ............*............... - vsub.u32 q6, q1, q6 // ...........*................ - vmla.s32 q5, q3, r12 // ..............*............. - vldrw.32 q0, [r11, #-48] // ...............*............ - vadd.u32 q3, q6, q5 // ................*........... - vstrw.u32 q3, [r0, #-32] // .................*.......... - vqrdmulh.s32 q3, q2, q0 // ......................*..... - vldrw.32 q0, [r11, #-64] // ...................*........ - vmul.s32 q0, q2, q0 // ....................*....... - vsub.u32 q6, q6, q5 // ..................*......... - vmla.s32 q0, q3, r12 // .......................*.... - vstrw.u32 q6, [r0, #-16] // .....................*...... - vadd.u32 q3, q7, q0 // ........................*... - vstrw.u32 q3, [r0, #-64] // ...........................* - vsub.u32 q3, q7, q0 // .........................*.. - vstrw.u32 q3, [r0, #-48] // ..........................*. - - // original source code - // vmul.s32 q7, q3, q6 // *........................... - // vldrw.32 q6, [r11, #-80] // .*.......................... - // vqrdmulh.s32 q4, q4, q6 // ..*......................... - // vldrw.32 q5, [r11, #-16] // .....*...................... - // vmla.s32 q0, q4, r12 // ....*....................... - // vldrw.32 q4, [r11, #-32] // ...*........................ - // vqrdmulh.s32 q3, q3, q6 // ......*..................... - // vadd.u32 q6, q2, q0 // .........*.................. - // vmla.s32 q7, q3, r12 // ........*................... - // vsub.u32 q0, q2, q0 // .......*.................... - // vmul.s32 q2, q0, q4 // ..........*................. - // vsub.u32 q3, q1, q7 // .............*.............. - // vqrdmulh.s32 q4, q0, q5 // ............*............... - // vadd.u32 q7, q1, q7 // ...........*................ - // vmla.s32 q2, q4, r12 // ..............*............. - // vldrw.32 q5, [r11, #-48] // ...............*............ - // vadd.u32 q0, q3, q2 // ................*........... - // vstrw.u32 q0, [r0, #-32] // .................*.......... - // vsub.u32 q4, q3, q2 // .....................*...... - // vldrw.32 q3, [r11, #-64] // ...................*........ - // vmul.s32 q0, q6, q3 // ....................*....... - // vstrw.u32 q4, [r0, #-16] // .......................*.... - // vqrdmulh.s32 q6, q6, q5 // ..................*......... - // vmla.s32 q0, q6, r12 // ......................*..... - // vadd.u32 q5, q7, q0 // ........................*... - // vsub.u32 q7, q7, q0 // ..........................*. - // vstrw.u32 q7, [r0, #-48] // ...........................* - // vstrw.u32 q5, [r0, #-64] // .........................*.. - - - add sp, sp, #STACK_SIZE - align_stack_undo - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_twiddles.s b/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_twiddles.s deleted file mode 100644 index 6a715ac..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_123_456_78_twiddles.s +++ /dev/null @@ -1,537 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -3572223 -.word -915382907 -.word 3765607 -.word 964937599 -.word -3201494 -.word -820383522 -.word -2883726 -.word -738955404 -.word 3761513 -.word 963888510 -.word -3145678 -.word -806080660 -.word -3201430 -.word -820367122 -.word -601683 -.word -154181397 -.word -3370349 -.word -863652652 -.word 3602218 -.word 923069133 -.word 3182878 -.word 815613168 -.word -4063053 -.word -1041158200 -.word 2740543 -.word 702264730 -.word -3586446 -.word -919027554 -.word 3542485 -.word 907762539 -.word 2663378 -.word 682491182 -.word -3110818 -.word -797147778 -.word 2101410 -.word 538486762 -.word -1674615 -.word -429120452 -.word 3704823 -.word 949361686 -.word 1159875 -.word 297218217 -.word 2682288 -.word 687336873 -.word -3524442 -.word -903139016 -.word 394148 -.word 101000509 -.word 928749 -.word 237992130 -.word -434125 -.word -111244624 -.word 1095468 -.word 280713909 -.word -3506380 -.word -898510625 -.word 2129892 -.word 545785280 -.word 676590 -.word 173376332 -.word 2071829 -.word 530906624 -.word -4018989 -.word -1029866791 -.word -1335936 -.word -342333886 -.word 3241972 -.word 830756018 -.word 2156050 -.word 552488273 -.word 3764867 -.word 964747974 -.word -3227876 -.word -827143915 -.word 3415069 -.word 875112161 -.word 1759347 -.word 450833045 -.word 1714295 -.word 439288460 -.word -817536 -.word -209493775 -.word -3574466 -.word -915957677 -.word -1005239 -.word -257592709 -.word 2453983 -.word 628833668 -.word 3756790 -.word 962678241 -.word -1935799 -.word -496048908 -.word 1460718 -.word 374309300 -.word -1716988 -.word -439978542 -.word -3950053 -.word -1012201926 -.word 557458 -.word 142848732 -.word -642628 -.word -164673562 -.word -2897314 -.word -742437332 -.word 3192354 -.word 818041395 -.word -3585098 -.word -918682129 -.word 556856 -.word 142694469 -.word 3870317 -.word 991769559 -.word -1221177 -.word -312926867 -.word 2815639 -.word 721508096 -.word 2917338 -.word 747568486 -.word 1853806 -.word 475038184 -.word 2283733 -.word 585207070 -.word 3345963 -.word 857403734 -.word 1858416 -.word 476219497 -// Blocked layers start -.word 3073009 -.word 1277625 -.word -2635473 -.word 3852015 -.word 787459213 -.word 327391679 -.word -675340520 -.word 987079667 -.word 1753 -.word -2659525 -.word 2660408 -.word -59148 -.word 449207 -.word -681503850 -.word 681730119 -.word -15156688 -.word -1935420 -.word -1455890 -.word -1780227 -.word 2772600 -.word -495951789 -.word -373072124 -.word -456183549 -.word 710479343 -.word 4183372 -.word -3222807 -.word -3121440 -.word -274060 -.word 1071989969 -.word -825844983 -.word -799869667 -.word -70227934 -.word 1182243 -.word 636927 -.word -3956745 -.word -3284915 -.word 302950022 -.word 163212680 -.word -1013916752 -.word -841760171 -.word 87208 -.word -3965306 -.word -2296397 -.word -3716946 -.word 22347069 -.word -1016110510 -.word -588452222 -.word -952468207 -.word 2508980 -.word 2028118 -.word 1937570 -.word -3815725 -.word 642926661 -.word 519705671 -.word 496502727 -.word -977780347 -.word -27812 -.word 1009365 -.word -1979497 -.word -3956944 -.word -7126831 -.word 258649997 -.word -507246529 -.word -1013967746 -.word 822541 -.word -2454145 -.word 1596822 -.word -3759465 -.word 210776307 -.word -628875181 -.word 409185979 -.word -963363710 -.word 2811291 -.word -2983781 -.word -1109516 -.word 4158088 -.word 720393920 -.word -764594519 -.word -284313712 -.word 1065510939 -.word -1685153 -.word 2678278 -.word -3551006 -.word -250446 -.word -431820817 -.word 686309310 -.word -909946047 -.word -64176841 -.word -3410568 -.word -3768948 -.word 635956 -.word -2455377 -.word -873958779 -.word -965793731 -.word 162963861 -.word -629190881 -.word 1528066 -.word 482649 -.word 1148858 -.word -2962264 -.word 391567239 -.word 123678909 -.word 294395108 -.word -759080783 -.word -4146264 -.word 2192938 -.word 2387513 -.word -268456 -.word -1062481036 -.word 561940831 -.word 611800717 -.word -68791907 -.word -1772588 -.word -1727088 -.word -3611750 -.word -3180456 -.word -454226054 -.word -442566669 -.word -925511710 -.word -814992530 -.word -565603 -.word 169688 -.word 2462444 -.word -3334383 -.word -144935890 -.word 43482586 -.word 631001801 -.word -854436357 -.word 3747250 -.word 1239911 -.word 3195676 -.word 1254190 -.word 960233614 -.word 317727459 -.word 818892658 -.word 321386456 -.word 2296099 -.word -3838479 -.word 2642980 -.word -12417 -.word 588375860 -.word -983611064 -.word 677264190 -.word -3181859 -.word -4166425 -.word -3488383 -.word 1987814 -.word -3197248 -.word -1067647297 -.word -893898890 -.word 509377762 -.word -819295484 -.word 2998219 -.word -89301 -.word -1354892 -.word -1310261 -.word 768294260 -.word -22883400 -.word -347191365 -.word -335754661 -.word 141835 -.word 2513018 -.word 613238 -.word -2218467 -.word 36345249 -.word 643961400 -.word 157142369 -.word -568482643 -.word 1736313 -.word 235407 -.word -3250154 -.word 3258457 -.word 444930577 -.word 60323094 -.word -832852657 -.word 834980303 -.word -458740 -.word 4040196 -.word 2039144 -.word -818761 -.word -117552223 -.word 1035301089 -.word 522531086 -.word -209807681 -.word -1921994 -.word -3472069 -.word -1879878 -.word -2178965 -.word -492511373 -.word -889718424 -.word -481719139 -.word -558360247 -.word -2579253 -.word 1787943 -.word -2391089 -.word -2254727 -.word -660934133 -.word 458160776 -.word -612717067 -.word -577774276 -.word -1623354 -.word -2374402 -.word 586241 -.word 527981 -.word -415984810 -.word -608441020 -.word 150224382 -.word 135295244 -.word 2105286 -.word -2033807 -.word -1179613 -.word -2743411 -.word 539479988 -.word -521163479 -.word -302276083 -.word -702999655 -.word 3482206 -.word -4182915 -.word -1300016 -.word -2362063 -.word 892316032 -.word -1071872863 -.word -333129378 -.word -605279149 -.word -1476985 -.word 2491325 -.word 507927 -.word -724804 -.word -378477722 -.word 638402564 -.word 130156402 -.word -185731180 -.word 1994046 -.word -1393159 -.word -1187885 -.word -1834526 -.word 510974714 -.word -356997292 -.word -304395785 -.word -470097680 -.word -1317678 -.word 2461387 -.word 3035980 -.word 621164 -.word -337655269 -.word 630730945 -.word 777970524 -.word 159173408 -.word -3033742 -.word 2647994 -.word -2612853 -.word 749577 -.word -777397036 -.word 678549029 -.word -669544140 -.word 192079267 -.word -338420 -.word 3009748 -.word 4148469 -.word -4022750 -.word -86720197 -.word 771248568 -.word 1063046068 -.word -1030830548 -.word 3901472 -.word -1226661 -.word 2925816 -.word 3374250 -.word 999753034 -.word -314332144 -.word 749740976 -.word 864652284 -.word 3980599 -.word -1615530 -.word 1665318 -.word 1163598 -.word 1020029345 -.word -413979908 -.word 426738094 -.word 298172236 -.word 2569011 -.word 1723229 -.word 2028038 -.word -3369273 -.word 658309618 -.word 441577800 -.word 519685171 -.word -863376927 -.word 1356448 -.word -2775755 -.word 2683270 -.word -2778788 -.word 347590090 -.word -711287812 -.word 687588511 -.word -712065019 -.word 3994671 -.word -1370517 -.word 3363542 -.word 545376 -.word 1023635298 -.word -351195274 -.word 861908357 -.word 139752717 -.word -11879 -.word 3020393 -.word 214880 -.word -770441 -.word -3043996 -.word 773976352 -.word 55063046 -.word -197425671 -.word -3467665 -.word 2312838 -.word -653275 -.word -459163 -.word -888589898 -.word 592665232 -.word -167401858 -.word -117660617 -.word 3105558 -.word 508145 -.word 860144 -.word 140244 -.word 795799901 -.word 130212265 -.word 220412084 -.word 35937555 -.word -1103344 -.word -553718 -.word 3430436 -.word -1514152 -.word -282732136 -.word -141890356 -.word 879049958 -.word -388001774 -.word 348812 -.word -327848 -.word 1011223 -.word -2354215 -.word 89383150 -.word -84011120 -.word 259126110 -.word -603268097 -.word -2185084 -.word 2358373 -.word -3014420 -.word 2926054 -.word -559928242 -.word 604333585 -.word -772445769 -.word 749801963 -.word 3123762 -.word -2193087 -.word -1716814 -.word -392707 -.word 800464680 -.word -561979013 -.word -439933955 -.word -100631253 -.word -3818627 -.word -1922253 -.word -2236726 -.word 1744507 -.word -978523985 -.word -492577742 -.word -573161516 -.word 447030292 -.word -303005 -.word -3974485 -.word 1900052 -.word 1054478 -.word -77645096 -.word -1018462631 -.word 486888731 -.word 270210213 -.word 3531229 -.word -3773731 -.word -781875 -.word -731434 -.word 904878186 -.word -967019376 -.word -200355636 -.word -187430119 \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78.s deleted file mode 100644 index 60b6074..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78.s +++ /dev/null @@ -1,210 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_dilithium_12_34_56_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_12_34_56_78, %function -.global ntt_dilithium_12_34_56_78 -ntt_dilithium_12_34_56_78: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 -layer12_loop: - vldrw.u32 data0, [in_low] - vldrw.u32 data1, [in_low, #(4*64)] - vldrw.u32 data2, [in_high] - vldrw.u32 data3, [in_high, #(4*64)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - vstrw.u32 data0, [in_low], #16 - vstrw.u32 data1, [in_low, #(4*64 - 16)] - vstrw.u32 data2, [in_high], #16 - vstrw.u32 data3, [in_high, #(4*64-16)] - le lr, layer12_loop - - .unreq in_high - .unreq in_low - in .req r0 - - /* Layers 3,4 */ - sub in, in, #(64*4) - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 -layer34_loop: - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #(4*1*16)] - vldrw.u32 data2, [in, #(4*2*16)] - vldrw.u32 data3, [in, #(4*3*16)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - vstrw.u32 data0, [in], #16 - vstrw.u32 data1, [in, #(4*1*16 - 16)] - vstrw.u32 data2, [in, #(4*2*16 - 16)] - vstrw.u32 data3, [in, #(4*3*16 - 16)] - le lr, layer34_loop - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - /* Layers 5,6 */ - sub in, in, #(4*256) - - mov lr, #16 -layer56_loop: - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #(4*1*4)] - vldrw.u32 data2, [in, #(4*2*4)] - vldrw.u32 data3, [in, #(4*3*4)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - vst40.u32 {data0, data1, data2, data3}, [in] - vst41.u32 {data0, data1, data2, data3}, [in] - vst42.u32 {data0, data1, data2, data3}, [in] - vst43.u32 {data0, data1, data2, data3}, [in]! - le lr, layer56_loop - - /* Layers 7,8 */ - sub in, in, #(4*256) - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #16 -layer78_loop: - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #16] - vldrw.u32 data2, [in, #32] - vldrw.u32 data3, [in, #48] - vldrw.u32 root0, [root_ptr], #+96 - vldrw.u32 root0_twisted, [root_ptr, #(+16-96)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - vldrw.u32 root1, [root_ptr, #(32 - 96)] - vldrw.u32 root1_twisted, [root_ptr, #(48 - 96)] - ct_butterfly data0, data1, root1, root1_twisted - vldrw.u32 root2, [root_ptr, #(64-96)] - vldrw.u32 root2_twisted, [root_ptr, #(80-96)] - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.u32 data0, [in], #64 - vstrw.u32 data1, [in, #-48] - vstrw.u32 data2, [in, #-32] - vstrw.u32 data3, [in, #-16] - // vst40.u32 {data0, data1, data2, data3}, [in] - // vst41.u32 {data0, data1, data2, data3}, [in] - // vst42.u32 {data0, data1, data2, data3}, [in] - // vst43.u32 {data0, data1, data2, data3}, [in]! - le lr, layer78_loop - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4.s deleted file mode 100644 index 0e44160..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4.s +++ /dev/null @@ -1,209 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_dilithium_12_34_56_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_12_34_56_78_no_trans_vld4, %function -.global ntt_dilithium_12_34_56_78_no_trans_vld4 -ntt_dilithium_12_34_56_78_no_trans_vld4: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 -layer12_loop: - vldrw.u32 data0, [in_low] - vldrw.u32 data1, [in_low, #(4*64)] - vldrw.u32 data2, [in_high] - vldrw.u32 data3, [in_high, #(4*64)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - vstrw.u32 data0, [in_low], #16 - vstrw.u32 data1, [in_low, #(4*64 - 16)] - vstrw.u32 data2, [in_high], #16 - vstrw.u32 data3, [in_high, #(4*64-16)] - le lr, layer12_loop - - .unreq in_high - .unreq in_low - in .req r0 - - /* Layers 3,4 */ - sub in, in, #(64*4) - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 -layer34_loop: - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #(4*1*16)] - vldrw.u32 data2, [in, #(4*2*16)] - vldrw.u32 data3, [in, #(4*3*16)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - vstrw.u32 data0, [in], #16 - vstrw.u32 data1, [in, #(4*1*16 - 16)] - vstrw.u32 data2, [in, #(4*2*16 - 16)] - vstrw.u32 data3, [in, #(4*3*16 - 16)] - le lr, layer34_loop - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - /* Layers 5,6 */ - sub in, in, #(4*256) - - mov lr, #16 -layer56_loop: - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #(4*1*4)] - vldrw.u32 data2, [in, #(4*2*4)] - vldrw.u32 data3, [in, #(4*3*4)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - vstrw.u32 data0, [in], #64 - vstrw.u32 data1, [in, #(-64+16)] - vstrw.u32 data2, [in, #(-64+32)] - vstrw.u32 data3, [in, #(-64+48)] - le lr, layer56_loop - - /* Layers 7,8 */ - sub in, in, #(4*256) - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #16 -layer78_loop: - vld40.u32 {data0, data1, data2, data3}, [in] - vld41.u32 {data0, data1, data2, data3}, [in] - vld42.u32 {data0, data1, data2, data3}, [in] - vld43.u32 {data0, data1, data2, data3}, [in]! - - vldrw.u32 root0, [root_ptr], #+96 - vldrw.u32 root0_twisted, [root_ptr, #(+16-96)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - - vldrw.u32 root1, [root_ptr, #(32 - 96)] - vldrw.u32 root1_twisted, [root_ptr, #(48 - 96)] - ct_butterfly data0, data1, root1, root1_twisted - - vldrw.u32 root2, [root_ptr, #(64-96)] - vldrw.u32 root2_twisted, [root_ptr, #(80-96)] - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.32 data0, [in, #( 0 - 64)] - vstrw.32 data1, [in, #(16 - 64)] - vstrw.32 data2, [in, #(32 - 64)] - vstrw.32 data3, [in, #(48 - 64)] - le lr, layer78_loop - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55.s deleted file mode 100644 index c985263..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55.s +++ /dev/null @@ -1,668 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_dilithium_12_34_56_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55, %function -.global ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55 -ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q1, [r1, #256] // .*... - vmul.s32 q0, q1, r2 // ...*. - // gap // ..... - vqrdmulh.s32 q1, q1, r3 // ..*.. - vldrw.u32 q2, [r0, #256] // *.... - vmla.s32 q0, q1, r12 // ....* - - // original source code - // vldrw.u32 q2, [r0, #256] // ...*. - // vldrw.u32 q0, [r1, #256] // *.... - // vqrdmulh.s32 q3, q0, r3 // ..*.. - // vmul.s32 q0, q0, r2 // .*... - // vmla.s32 q0, q3, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer12_loop: - vsub.u32 q1, q2, q0 // ............*............... - vmul.s32 q5, q1, r6 // ...................*........ - vldrw.u32 q7, [r1] // ..*......................... - vqrdmulh.s32 q6, q7, r3 // .....*...................... - vadd.u32 q3, q2, q0 // .............*.............. - vmul.s32 q0, q7, r2 // ....*....................... - vldrw.u32 q4, [r0] // *........................... - vmla.s32 q0, q6, r12 // ......*..................... - vldrw.u32 q2, [r0, #272] // .e.......................... - vqrdmulh.s32 q6, q3, r5 // ...............*............ - vsub.u32 q7, q4, q0 // .......*.................... - vmul.s32 q3, q3, r4 // ..............*............. - vadd.u32 q4, q4, q0 // ........*................... - vmla.s32 q3, q6, r12 // ................*........... - vldrw.u32 q0, [r1, #272] // ...e........................ - vqrdmulh.s32 q1, q1, r7 // ....................*....... - vadd.u32 q6, q4, q3 // ..................*......... - vmla.s32 q5, q1, r12 // .....................*...... - vsub.u32 q4, q4, q3 // .................*.......... - vqrdmulh.s32 q3, q0, r3 // ..........e................. - vsub.u32 q1, q7, q5 // ......................*..... - vstrw.u32 q1, [r1, #256] // ...........................* - vadd.u32 q7, q7, q5 // .......................*.... - vstrw.u32 q7, [r1] , #16 // ..........................*. - vmul.s32 q0, q0, r2 // .........e.................. - vstrw.u32 q6, [r0] , #16 // ........................*... - vmla.s32 q0, q3, r12 // ...........e................ - vstrw.u32 q4, [r0, #240] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ....................|.....*..................... - // vldrw.u32 q1, [r0, #(4*64)] // e...................|.......e................... - // vldrw.u32 q2, [r1] // ....................|.*......................... - // vldrw.u32 q3, [r1, #(4*64)] // ......e.............|.............e............. - // vmul.s32 q4, q2, r2 // ....................|....*...................... - // vqrdmulh.s32 q2, q2, r3 // ....................|..*........................ - // vmla.s32 q4, q2, r12 // ....................|......*.................... - // vsub.u32 q2, q0, q4 // ..*.................|.........*................. - // vadd.u32 q0, q0, q4 // ....*...............|...........*............... - // vmul.s32 q4, q3, r2 // ................e...|.......................e... - // vqrdmulh.s32 q3, q3, r3 // ...........e........|..................e........ - // vmla.s32 q4, q3, r12 // ..................e.|.........................e. - // vsub.u32 q3, q1, q4 // ....................*........................... - // vadd.u32 q1, q1, q4 // ....................|...*....................... - // vmul.s32 q4, q1, r4 // ...*................|..........*................ - // vqrdmulh.s32 q1, q1, r5 // .*..................|........*.................. - // vmla.s32 q4, q1, r12 // .....*..............|............*.............. - // vsub.u32 q1, q0, q4 // ..........*.........|.................*......... - // vadd.u32 q0, q0, q4 // ........*...........|...............*........... - // vmul.s32 q4, q3, r6 // ....................|*.......................... - // vqrdmulh.s32 q3, q3, r7 // .......*............|..............*............ - // vmla.s32 q4, q3, r12 // .........*..........|................*.......... - // vsub.u32 q3, q2, q4 // ............*.......|...................*....... - // vadd.u32 q2, q2, q4 // ..............*.....|.....................*..... - // vstrw.u32 q0, [r0], #16 // .................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*64 - 16)] // ...................*|..........................* - // vstrw.u32 q2, [r1], #16 // ...............*....|......................*.... - // vstrw.u32 q3, [r1, #(4*64-16)] // .............*......|....................*...... - - le lr, layer12_loop - vldrw.u32 q4, [r1] // ..*.................... - vqrdmulh.s32 q6, q4, r3 // ...*................... - // gap // ....................... - vmul.s32 q1, q4, r2 // .....*................. - vsub.u32 q5, q2, q0 // *...................... - vmla.s32 q1, q6, r12 // .......*............... - vadd.u32 q4, q2, q0 // ....*.................. - vqrdmulh.s32 q6, q4, r5 // ........*.............. - vldrw.u32 q0, [r0] // ......*................ - vmul.s32 q4, q4, r4 // ..........*............ - vadd.u32 q2, q0, q1 // ...........*........... - vmla.s32 q4, q6, r12 // ............*.......... - vsub.u32 q3, q0, q1 // .........*............. - vqrdmulh.s32 q7, q5, r7 // .............*......... - vsub.u32 q6, q2, q4 // ................*...... - vmul.s32 q0, q5, r6 // .*..................... - vstrw.u32 q6, [r0, #256] // ......................* - vadd.u32 q5, q2, q4 // ..............*........ - vmla.s32 q0, q7, r12 // ...............*....... - vstrw.u32 q5, [r0] , #16 // .....................*. - vadd.u32 q2, q3, q0 // ...................*... - vstrw.u32 q2, [r1] , #16 // ....................*.. - vsub.u32 q0, q3, q0 // .................*..... - vstrw.u32 q0, [r1, #240] // ..................*.... - - // original source code - // vsub.u32 q1, q2, q0 // ...*................... - // vmul.s32 q5, q1, r6 // ..............*........ - // vldrw.u32 q7, [r1] // *...................... - // vqrdmulh.s32 q6, q7, r3 // .*..................... - // vadd.u32 q3, q2, q0 // .....*................. - // vmul.s32 q0, q7, r2 // ..*.................... - // vldrw.u32 q4, [r0] // .......*............... - // vmla.s32 q0, q6, r12 // ....*.................. - // vqrdmulh.s32 q6, q3, r5 // ......*................ - // vsub.u32 q7, q4, q0 // ...........*........... - // vmul.s32 q3, q3, r4 // ........*.............. - // vadd.u32 q4, q4, q0 // .........*............. - // vmla.s32 q3, q6, r12 // ..........*............ - // vqrdmulh.s32 q1, q1, r7 // ............*.......... - // vadd.u32 q6, q4, q3 // ................*...... - // vmla.s32 q5, q1, r12 // .................*..... - // vsub.u32 q4, q4, q3 // .............*......... - // vsub.u32 q1, q7, q5 // .....................*. - // vstrw.u32 q1, [r1, #256] // ......................* - // vadd.u32 q7, q7, q5 // ...................*... - // vstrw.u32 q7, [r1] , #16 // ....................*.. - // vstrw.u32 q6, [r0] , #16 // ..................*.... - // vstrw.u32 q4, [r0, #240] // ...............*....... - - - .unreq in_high - .unreq in_low - in .req r0 - - /* Layers 3,4 */ - sub in, in, #(64*4) - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q2, [r0, #192] // .*... - vmul.s32 q0, q2, r2 // ...*. - // gap // ..... - vqrdmulh.s32 q1, q2, r3 // ..*.. - vldrw.u32 q2, [r0, #64] // *.... - vmla.s32 q0, q1, r12 // ....* - - // original source code - // vldrw.u32 q2, [r0, #64] // ...*. - // vldrw.u32 q0, [r0, #192] // *.... - // vqrdmulh.s32 q3, q0, r3 // ..*.. - // vmul.s32 q0, q0, r2 // .*... - // vmla.s32 q0, q3, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer34_loop: - vsub.u32 q1, q2, q0 // ............*............... - vmul.s32 q5, q1, r6 // ...................*........ - vldrw.u32 q7, [r0, #128] // ..*......................... - vqrdmulh.s32 q6, q7, r3 // .....*...................... - vadd.u32 q3, q2, q0 // .............*.............. - vmul.s32 q0, q7, r2 // ....*....................... - vldrw.u32 q4, [r0] // *........................... - vmla.s32 q0, q6, r12 // ......*..................... - vldrw.u32 q2, [r0, #80] // .e.......................... - vqrdmulh.s32 q6, q3, r5 // ...............*............ - vsub.u32 q7, q4, q0 // .......*.................... - vmul.s32 q3, q3, r4 // ..............*............. - vadd.u32 q4, q4, q0 // ........*................... - vmla.s32 q3, q6, r12 // ................*........... - vldrw.u32 q0, [r0, #208] // ...e........................ - vqrdmulh.s32 q1, q1, r7 // ....................*....... - vadd.u32 q6, q4, q3 // ..................*......... - vmla.s32 q5, q1, r12 // .....................*...... - vsub.u32 q4, q4, q3 // .................*.......... - vqrdmulh.s32 q3, q0, r3 // ..........e................. - vsub.u32 q1, q7, q5 // ......................*..... - vstrw.u32 q1, [r0, #192] // ...........................* - vadd.u32 q7, q7, q5 // .......................*.... - vstrw.u32 q7, [r0, #128] // ..........................*. - vmul.s32 q0, q0, r2 // .........e.................. - vstrw.u32 q6, [r0] , #16 // ........................*... - vmla.s32 q0, q3, r12 // ...........e................ - vstrw.u32 q4, [r0, #48] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ....................|.....*..................... - // vldrw.u32 q1, [r0, #(4*1*16)] // e...................|.......e................... - // vldrw.u32 q2, [r0, #(4*2*16)] // ....................|.*......................... - // vldrw.u32 q3, [r0, #(4*3*16)] // ......e.............|.............e............. - // vmul.s32 q4, q2, r2 // ....................|....*...................... - // vqrdmulh.s32 q2, q2, r3 // ....................|..*........................ - // vmla.s32 q4, q2, r12 // ....................|......*.................... - // vsub.u32 q2, q0, q4 // ..*.................|.........*................. - // vadd.u32 q0, q0, q4 // ....*...............|...........*............... - // vmul.s32 q4, q3, r2 // ................e...|.......................e... - // vqrdmulh.s32 q3, q3, r3 // ...........e........|..................e........ - // vmla.s32 q4, q3, r12 // ..................e.|.........................e. - // vsub.u32 q3, q1, q4 // ....................*........................... - // vadd.u32 q1, q1, q4 // ....................|...*....................... - // vmul.s32 q4, q1, r4 // ...*................|..........*................ - // vqrdmulh.s32 q1, q1, r5 // .*..................|........*.................. - // vmla.s32 q4, q1, r12 // .....*..............|............*.............. - // vsub.u32 q1, q0, q4 // ..........*.........|.................*......... - // vadd.u32 q0, q0, q4 // ........*...........|...............*........... - // vmul.s32 q4, q3, r6 // ....................|*.......................... - // vqrdmulh.s32 q3, q3, r7 // .......*............|..............*............ - // vmla.s32 q4, q3, r12 // .........*..........|................*.......... - // vsub.u32 q3, q2, q4 // ............*.......|...................*....... - // vadd.u32 q2, q2, q4 // ..............*.....|.....................*..... - // vstrw.u32 q0, [r0], #16 // .................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ...................*|..........................* - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // ...............*....|......................*.... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // .............*......|....................*...... - - le lr, layer34_loop - vldrw.u32 q3, [r0, #128] // ..*.................... - vqrdmulh.s32 q4, q3, r3 // ...*................... - vadd.u32 q5, q2, q0 // ....*.................. - vmul.s32 q1, q3, r2 // .....*................. - vsub.u32 q0, q2, q0 // *...................... - vmla.s32 q1, q4, r12 // .......*............... - vldrw.u32 q3, [r0] // ......*................ - vqrdmulh.s32 q4, q5, r5 // ........*.............. - vsub.u32 q7, q3, q1 // .........*............. - vmul.s32 q2, q5, r4 // ..........*............ - // gap // ....................... - vmla.s32 q2, q4, r12 // ............*.......... - vadd.u32 q5, q3, q1 // ...........*........... - vqrdmulh.s32 q3, q0, r7 // .............*......... - vadd.u32 q4, q5, q2 // ..............*........ - vmul.s32 q1, q0, r6 // .*..................... - vstrw.u32 q4, [r0] , #16 // .....................*. - vmla.s32 q1, q3, r12 // ...............*....... - vsub.u32 q4, q5, q2 // ................*...... - vstrw.u32 q4, [r0, #48] // ......................* - vadd.u32 q3, q7, q1 // ...................*... - vstrw.u32 q3, [r0, #112] // ....................*.. - vsub.u32 q7, q7, q1 // .................*..... - vstrw.u32 q7, [r0, #176] // ..................*.... - - // original source code - // vsub.u32 q1, q2, q0 // ....*.................. - // vmul.s32 q5, q1, r6 // ..............*........ - // vldrw.u32 q7, [r0, #128] // *...................... - // vqrdmulh.s32 q6, q7, r3 // .*..................... - // vadd.u32 q3, q2, q0 // ..*.................... - // vmul.s32 q0, q7, r2 // ...*................... - // vldrw.u32 q4, [r0] // ......*................ - // vmla.s32 q0, q6, r12 // .....*................. - // vqrdmulh.s32 q6, q3, r5 // .......*............... - // vsub.u32 q7, q4, q0 // ........*.............. - // vmul.s32 q3, q3, r4 // .........*............. - // vadd.u32 q4, q4, q0 // ...........*........... - // vmla.s32 q3, q6, r12 // ..........*............ - // vqrdmulh.s32 q1, q1, r7 // ............*.......... - // vadd.u32 q6, q4, q3 // .............*......... - // vmla.s32 q5, q1, r12 // ................*...... - // vsub.u32 q4, q4, q3 // .................*..... - // vsub.u32 q1, q7, q5 // .....................*. - // vstrw.u32 q1, [r0, #192] // ......................* - // vadd.u32 q7, q7, q5 // ...................*... - // vstrw.u32 q7, [r0, #128] // ....................*.. - // vstrw.u32 q6, [r0] , #16 // ...............*....... - // vstrw.u32 q4, [r0, #48] // ..................*.... - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - /* Layers 5,6 */ - sub in, in, #(4*256) - - mov lr, #16 - ldrd r9, r8, [r11] , #24 // *...... - vldrw.u32 q5, [r0, #48] // ...*... - vqrdmulh.s32 q1, q5, r8 // .....*. - ldrd r4, r7, [r11, #-8] // ..*.... - vmul.s32 q0, q5, r9 // ....*.. - vldrw.u32 q5, [r0, #16] // .*..... - vmla.s32 q0, q1, r12 // ......* - - // original source code - // ldrd r9, r8, [r11] , #24 // *...... - // vldrw.u32 q5, [r0, #16] // .....*. - // ldrd r4, r7, [r11, #-8] // ...*... - // vldrw.u32 q1, [r0, #48] // .*..... - // vmul.s32 q0, q1, r9 // ....*.. - // vqrdmulh.s32 q4, q1, r8 // ..*.... - // vmla.s32 q0, q4, r12 // ......* - - sub lr, lr, #1 -.p2align 2 -layer56_loop: - vsub.u32 q1, q5, q0 // ...............*............... - vqrdmulh.s32 q6, q1, r7 // .......................*....... - vldrw.u32 q7, [r0, #32] // .....*......................... - vqrdmulh.s32 q2, q7, r8 // ........*...................... - vadd.u32 q4, q5, q0 // ................*.............. - vmul.s32 q3, q1, r4 // ......................*........ - vldrw.u32 q5, [r0] // ...*........................... - vmul.s32 q0, q7, r9 // .......*....................... - ldrd r10, r1, [r11, #-16] // .*............................. - vmla.s32 q0, q2, r12 // .........*..................... - ldrd r9, r8, [r11] , #24 // e.............................. - vqrdmulh.s32 q2, q4, r1 // ..................*............ - vsub.u32 q7, q5, q0 // ..........*.................... - vmul.s32 q4, q4, r10 // .................*............. - vadd.u32 q0, q5, q0 // ...........*................... - vmla.s32 q3, q6, r12 // ........................*...... - vldrw.u32 q5, [r0, #80] // ....e.......................... - vmla.s32 q4, q2, r12 // ...................*........... - vsub.u32 q2, q7, q3 // .........................*..... - ldrd r4, r7, [r11, #-8] // ..e............................ - vldrw.u32 q1, [r0, #112] // ......e........................ - vsub.u32 q6, q0, q4 // ....................*.......... - vstrw.u32 q6, [r0, #16] // ............................*.. - vadd.u32 q6, q0, q4 // .....................*......... - vmul.s32 q0, q1, r9 // ............e.................. - vstrw.u32 q2, [r0, #48] // ..............................* - vqrdmulh.s32 q4, q1, r8 // .............e................. - vstrw.u32 q6, [r0] , #64 // ...........................*... - vmla.s32 q0, q4, r12 // ..............e................ - vadd.u32 q1, q7, q3 // ..........................*.... - vstrw.u32 q1, [r0, #-32] // .............................*. - - // original source code - // ldrd r2, r3, [r11], #+24 // e....................|.........e.................... - // ldrd r4, r5, [r11, #(-16)] // .....................|.......*...................... - // ldrd r6, r7, [r11, #(-8)] // .........e...........|..................e........... - // vldrw.u32 q0, [r0] // .....................|.....*........................ - // vldrw.u32 q1, [r0, #(4*1*4)] // ......e..............|...............e.............. - // vldrw.u32 q2, [r0, #(4*2*4)] // .....................|.*............................ - // vldrw.u32 q3, [r0, #(4*3*4)] // ..........e..........|...................e.......... - // vmul.s32 q4, q2, r2 // .....................|......*....................... - // vqrdmulh.s32 q2, q2, r3 // .....................|..*........................... - // vmla.s32 q4, q2, r12 // .....................|........*..................... - // vsub.u32 q2, q0, q4 // ..*..................|...........*.................. - // vadd.u32 q0, q0, q4 // ....*................|.............*................ - // vmul.s32 q4, q3, r2 // ..............e......|.......................e...... - // vqrdmulh.s32 q3, q3, r3 // ................e....|.........................e.... - // vmla.s32 q4, q3, r12 // ..................e..|...........................e.. - // vsub.u32 q3, q1, q4 // .....................*.............................. - // vadd.u32 q1, q1, q4 // .....................|...*.......................... - // vmul.s32 q4, q1, r4 // ...*.................|............*................. - // vqrdmulh.s32 q1, q1, r5 // .*...................|..........*................... - // vmla.s32 q4, q1, r12 // .......*.............|................*............. - // vsub.u32 q1, q0, q4 // ...........*.........|....................*......... - // vadd.u32 q0, q0, q4 // .............*.......|......................*....... - // vmul.s32 q4, q3, r6 // .....................|....*......................... - // vqrdmulh.s32 q3, q3, r7 // .....................|*............................. - // vmla.s32 q4, q3, r12 // .....*...............|..............*............... - // vsub.u32 q3, q2, q4 // ........*............|.................*............ - // vadd.u32 q2, q2, q4 // ...................*.|............................*. - // vstrw.u32 q0, [r0], #64 // .................*...|..........................*... - // vstrw.u32 q1, [r0, #(-64+16)] // ............*........|.....................*........ - // vstrw.u32 q2, [r0, #(-64+32)] // ....................*|.............................* - // vstrw.u32 q3, [r0, #(-64+48)] // ...............*.....|........................*..... - - le lr, layer56_loop - layer56_loop_end: - ldrd r10, r1, [r11, #-16] // ........*........................ - vadd.u32 q6, q5, q0 // ....*............................ - vmul.s32 q3, q6, r10 // ............*.................... - vldrw.u32 q1, [r0, #32] // ..*.............................. - vmul.s32 q7, q1, r9 // .......*......................... - vsub.u32 q2, q5, q0 // *................................ - vqrdmulh.s32 q1, q1, r8 // ...*............................. - vldrw.u32 q4, [r0] // ......*.......................... - vmla.s32 q7, q1, r12 // .........*....................... - vldrw.u32 q0, [r11] , #96 // ............................*.... - vqrdmulh.s32 q1, q6, r1 // ..........*...................... - vadd.u32 q6, q4, q7 // .............*................... - vmla.s32 q3, q1, r12 // ...............*................. - vsub.u32 q4, q4, q7 // ...........*..................... - vmul.s32 q1, q2, r4 // .....*........................... - vsub.u32 q5, q6, q3 // .................*............... - vstrw.u32 q5, [r0, #16] // ..................*.............. - vqrdmulh.s32 q7, q2, r7 // .*............................... - vadd.u32 q6, q6, q3 // ...................*............. - vmla.s32 q1, q7, r12 // ..............*.................. - vstrw.u32 q6, [r0] , #64 // .....................*........... - vadd.u32 q2, q4, q1 // ......................*.......... - vstrw.u32 q2, [r0, #-32] // .......................*......... - vsub.u32 q6, q4, q1 // ................*................ - vstrw.u32 q6, [r0, #-16] // ....................*............ - sub r0, r0, #(4*256) // ........................*........ - // gap // ................................. - vld40.u32 {q2,q3,q4,q5}, [r0] // ..........................*...... - // gap // ................................. - vld41.u32 {q2,q3,q4,q5}, [r0] // ...........................*..... - // gap // ................................. - vld42.u32 {q2,q3,q4,q5}, [r0] // .............................*... - mov r14, #16 // .........................*....... - vld43.u32 {q2,q3,q4,q5}, [r0]! // ..............................*.. - sub r14, r14, #1 // ................................* - vmul.s32 q6, q4, q0 // ...............................*. - - // original source code - // vsub.u32 q1, q5, q0 // .....*........................... - // vqrdmulh.s32 q6, q1, r7 // .................*............... - // vldrw.u32 q7, [r0, #32] // ...*............................. - // vqrdmulh.s32 q2, q7, r8 // ......*.......................... - // vadd.u32 q4, q5, q0 // .*............................... - // vmul.s32 q3, q1, r4 // ..............*.................. - // vldrw.u32 q5, [r0] // .......*......................... - // vmul.s32 q0, q7, r9 // ....*............................ - // ldrd r10, r1, [r11, #-16] // *................................ - // vmla.s32 q0, q2, r12 // ........*........................ - // vqrdmulh.s32 q2, q4, r1 // ..........*...................... - // vsub.u32 q7, q5, q0 // .............*................... - // vmul.s32 q4, q4, r10 // ..*.............................. - // vadd.u32 q0, q5, q0 // ...........*..................... - // vmla.s32 q3, q6, r12 // ...................*............. - // vmla.s32 q4, q2, r12 // ............*.................... - // vsub.u32 q2, q7, q3 // .......................*......... - // vsub.u32 q6, q0, q4 // ...............*................. - // vstrw.u32 q6, [r0, #16] // ................*................ - // vadd.u32 q6, q0, q4 // ..................*.............. - // vstrw.u32 q2, [r0, #48] // ........................*........ - // vstrw.u32 q6, [r0] , #64 // ....................*............ - // vadd.u32 q1, q7, q3 // .....................*........... - // vstrw.u32 q1, [r0, #-32] // ......................*.......... - // sub r0, r0, #(4*256) // .........................*....... - // mov r14, #16 // .............................*... - // vld40.u32 {q2,q3,q4,q5}, [r0] // ..........................*...... - // vld41.u32 {q2,q3,q4,q5}, [r0] // ...........................*..... - // vldrw.u32 q0, [r11] , #96 // .........*....................... - // vld42.u32 {q2,q3,q4,q5}, [r0] // ............................*.... - // vld43.u32 {q2,q3,q4,q5}, [r0]! // ..............................*.. - // vmul.s32 q6, q4, q0 // ................................* - // sub r14, r14, #1 // ...............................*. - - layer78_loop: - - vmul.s32 q7, q5, q0 // ...........*...................... - vldrw.u32 q0, [r11, #-80] // .....*............................ - vqrdmulh.s32 q1, q4, q0 // .......*.......................... - vldrw.u32 q4, [r11, #-16] // ........................*......... - vqrdmulh.s32 q0, q5, q0 // ............*..................... - vldrw.u32 q5, [r11, #-32] // .......................*.......... - vmla.s32 q7, q0, r12 // .............*.................... - vldrw.u32 q0, [r11, #-48] // .................*................ - vmla.s32 q6, q1, r12 // ........*......................... - vsub.u32 q1, q3, q7 // ..............*................... - vmul.s32 q5, q1, q5 // .........................*........ - vadd.u32 q7, q3, q7 // ...............*.................. - vqrdmulh.s32 q4, q1, q4 // ..........................*....... - vsub.u32 q3, q2, q6 // .........*........................ - vmla.s32 q5, q4, r12 // ...........................*...... - vldrw.u32 q1, [r11, #-64] // ................*................. - vsub.u32 q4, q3, q5 // ............................*..... - vstrw.u32 q4, [r0, #-16] // .................................* - vadd.u32 q4, q3, q5 // .............................*.... - vstrw.u32 q4, [r0, #-32] // ................................*. - vadd.u32 q6, q2, q6 // ..........*....................... - vld40.u32 {q2,q3,q4,q5}, [r0] // e................................. - vmul.s32 q1, q7, q1 // ..................*............... - vld41.u32 {q2,q3,q4,q5}, [r0] // .e................................ - vqrdmulh.s32 q7, q7, q0 // ...................*.............. - vldrw.u32 q0, [r11] , #96 // ....e............................. - vmla.s32 q1, q7, r12 // ....................*............. - vld42.u32 {q2,q3,q4,q5}, [r0] // ..e............................... - vsub.u32 q7, q6, q1 // .....................*............ - vld43.u32 {q2,q3,q4,q5}, [r0]! // ...e.............................. - vadd.u32 q1, q6, q1 // ......................*........... - vstrw.u32 q1, [r0, #-128] // ..............................*... - vmul.s32 q6, q4, q0 // ......e........................... - vstrw.u32 q7, [r0, #-112] // ...............................*.. - - // original source code - // vld40.u32 {q0, q1, q2, q3}, [r0] // e............|....................e............ - // vld41.u32 {q0, q1, q2, q3}, [r0] // ..e..........|......................e.......... - // vld42.u32 {q0, q1, q2, q3}, [r0] // ......e......|..........................e...... - // vld43.u32 {q0, q1, q2, q3}, [r0]! // ........e....|............................e.... - // vldrw.u32 q5, [r11], #+96 // ....e........|........................e........ - // vldrw.u32 q6, [r11, #(+16-96)] // .............|*................................ - // vmul.s32 q4, q2, q5 // ...........e.|...............................e. - // vqrdmulh.s32 q2, q2, q6 // .............|.*............................... - // vmla.s32 q4, q2, r12 // .............|.......*......................... - // vsub.u32 q2, q0, q4 // .............|............*.................... - // vadd.u32 q0, q0, q4 // .............|...................*............. - // vmul.s32 q4, q3, q5 // .............*................................. - // vqrdmulh.s32 q3, q3, q6 // .............|...*............................. - // vmla.s32 q4, q3, r12 // .............|.....*........................... - // vsub.u32 q3, q1, q4 // .............|........*........................ - // vadd.u32 q1, q1, q4 // .............|..........*...................... - // vldrw.u32 q5, [r11, #(32 - 96)] // .............|..............*.................. - // vldrw.u32 q6, [r11, #(48 - 96)] // .............|......*.......................... - // vmul.s32 q4, q1, q5 // .*...........|.....................*........... - // vqrdmulh.s32 q1, q1, q6 // ...*.........|.......................*......... - // vmla.s32 q4, q1, r12 // .....*.......|.........................*....... - // vsub.u32 q1, q0, q4 // .......*.....|...........................*..... - // vadd.u32 q0, q0, q4 // .........*...|.............................*... - // vldrw.u32 q5, [r11, #(64-96)] // .............|....*............................ - // vldrw.u32 q6, [r11, #(80-96)] // .............|..*.............................. - // vmul.s32 q4, q3, q5 // .............|.........*....................... - // vqrdmulh.s32 q3, q3, q6 // .............|...........*..................... - // vmla.s32 q4, q3, r12 // .............|.............*................... - // vsub.u32 q3, q2, q4 // .............|...............*................. - // vadd.u32 q2, q2, q4 // .............|.................*............... - // vstrw.32 q0, [r0, #( 0 - 64)] // ..........*..|..............................*.. - // vstrw.32 q1, [r0, #(16 - 64)] // ............*|................................* - // vstrw.32 q2, [r0, #(32 - 64)] // .............|..................*.............. - // vstrw.32 q3, [r0, #(48 - 64)] // .............|................*................ - - le lr, layer78_loop - vmul.s32 q7, q5, q0 // *........................... - vldrw.u32 q1, [r11, #-80] // .*.......................... - vqrdmulh.s32 q0, q5, q1 // ....*....................... - vldrw.u32 q5, [r11, #-64] // ...............*............ - vmla.s32 q7, q0, r12 // ......*..................... - vldrw.u32 q0, [r11, #-48] // .......*.................... - vqrdmulh.s32 q4, q4, q1 // ..*......................... - vadd.u32 q1, q3, q7 // ...........*................ - vqrdmulh.s32 q0, q1, q0 // ......................*..... - vsub.u32 q3, q3, q7 // .........*.................. - vmla.s32 q6, q4, r12 // ........*................... - vldrw.u32 q7, [r11, #-32] // .....*...................... - vmul.s32 q5, q1, q5 // .....................*...... - vadd.u32 q1, q2, q6 // ....................*....... - vmla.s32 q5, q0, r12 // .......................*.... - vldrw.u32 q0, [r11, #-16] // ...*........................ - vsub.u32 q4, q1, q5 // ........................*... - vmul.s32 q7, q3, q7 // ..........*................. - vstrw.u32 q4, [r0, #-48] // ...........................* - vadd.u32 q4, q1, q5 // .........................*.. - vqrdmulh.s32 q5, q3, q0 // ............*............... - vsub.u32 q0, q2, q6 // .............*.............. - vmla.s32 q7, q5, r12 // ..............*............. - vstrw.u32 q4, [r0, #-64] // ..........................*. - vadd.u32 q4, q0, q7 // ..................*......... - vstrw.u32 q4, [r0, #-32] // ...................*........ - vsub.u32 q4, q0, q7 // ................*........... - vstrw.u32 q4, [r0, #-16] // .................*.......... - - // original source code - // vmul.s32 q7, q5, q0 // *........................... - // vldrw.u32 q0, [r11, #-80] // .*.......................... - // vqrdmulh.s32 q1, q4, q0 // ......*..................... - // vldrw.u32 q4, [r11, #-16] // ...............*............ - // vqrdmulh.s32 q0, q5, q0 // ..*......................... - // vldrw.u32 q5, [r11, #-32] // ...........*................ - // vmla.s32 q7, q0, r12 // ....*....................... - // vldrw.u32 q0, [r11, #-48] // .....*...................... - // vmla.s32 q6, q1, r12 // ..........*................. - // vsub.u32 q1, q3, q7 // .........*.................. - // vmul.s32 q5, q1, q5 // .................*.......... - // vadd.u32 q7, q3, q7 // .......*.................... - // vqrdmulh.s32 q4, q1, q4 // ....................*....... - // vsub.u32 q3, q2, q6 // .....................*...... - // vmla.s32 q5, q4, r12 // ......................*..... - // vldrw.u32 q1, [r11, #-64] // ...*........................ - // vsub.u32 q4, q3, q5 // ..........................*. - // vstrw.u32 q4, [r0, #-16] // ...........................* - // vadd.u32 q4, q3, q5 // ........................*... - // vstrw.u32 q4, [r0, #-32] // .........................*.. - // vadd.u32 q6, q2, q6 // .............*.............. - // vmul.s32 q1, q7, q1 // ............*............... - // vqrdmulh.s32 q7, q7, q0 // ........*................... - // vmla.s32 q1, q7, r12 // ..............*............. - // vsub.u32 q7, q6, q1 // ................*........... - // vadd.u32 q1, q6, q1 // ...................*........ - // vstrw.u32 q1, [r0, #-64] // .......................*.... - // vstrw.u32 q7, [r0, #-48] // ..................*......... - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85.s deleted file mode 100644 index 98f2f01..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85.s +++ /dev/null @@ -1,600 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_dilithium_12_34_56_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85, %function -.global ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85 -ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q4, [r1] // *. - vqrdmulh.s32 q5, q4, r3 // .* - - // original source code - // vldrw.u32 q4, [r1] // *. - // vqrdmulh.s32 q5, q4, r3 // .* - - sub lr, lr, #1 -.p2align 2 -layer12_loop: - vmul.s32 q1, q4, r2 // ....*....................... - vldrw.u32 q0, [r1, #256] // ...*........................ - vqrdmulh.s32 q6, q0, r3 // ..........*................. - vldrw.u32 q2, [r0] // *........................... - vmul.s32 q7, q0, r2 // .........*.................. - vldrw.u32 q3, [r0, #256] // .*.......................... - vmla.s32 q7, q6, r12 // ...........*................ - vldrw.u32 q4, [r1, #16] // ..e......................... - vmla.s32 q1, q5, r12 // ......*..................... - vadd.u32 q6, q3, q7 // .............*.............. - vmul.s32 q5, q6, r4 // ..............*............. - vsub.u32 q3, q3, q7 // ............*............... - vqrdmulh.s32 q7, q6, r5 // ...............*............ - vadd.u32 q6, q2, q1 // ........*................... - vmla.s32 q5, q7, r12 // ................*........... - vsub.u32 q2, q2, q1 // .......*.................... - vqrdmulh.s32 q7, q3, r7 // ....................*....... - vadd.u32 q1, q6, q5 // ..................*......... - vmul.s32 q3, q3, r6 // ...................*........ - vsub.u32 q6, q6, q5 // .................*.......... - vmla.s32 q3, q7, r12 // .....................*...... - vstrw.u32 q6, [r0, #256] // .........................*.. - vsub.u32 q6, q2, q3 // ......................*..... - vstrw.u32 q6, [r1, #256] // ...........................* - vadd.u32 q7, q2, q3 // .......................*.... - vstrw.u32 q1, [r0] , #16 // ........................*... - vqrdmulh.s32 q5, q4, r3 // .....e...................... - vstrw.u32 q7, [r1] , #16 // ..........................*. - - // original source code - // vldrw.u32 q0, [r0] // .....................|..*........................ - // vldrw.u32 q1, [r0, #(4*64)] // .....................|....*...................... - // vldrw.u32 q2, [r1] // e....................|......e.................... - // vldrw.u32 q3, [r1, #(4*64)] // .....................|*.......................... - // vmul.s32 q4, q2, r2 // .....................*........................... - // vqrdmulh.s32 q2, q2, r3 // ...................e.|.........................e. - // vmla.s32 q4, q2, r12 // .*...................|.......*................... - // vsub.u32 q2, q0, q4 // ........*............|..............*............ - // vadd.u32 q0, q0, q4 // ......*..............|............*.............. - // vmul.s32 q4, q3, r2 // .....................|...*....................... - // vqrdmulh.s32 q3, q3, r3 // .....................|.*......................... - // vmla.s32 q4, q3, r12 // .....................|.....*..................... - // vsub.u32 q3, q1, q4 // ....*................|..........*................ - // vadd.u32 q1, q1, q4 // ..*..................|........*.................. - // vmul.s32 q4, q1, r4 // ...*.................|.........*................. - // vqrdmulh.s32 q1, q1, r5 // .....*...............|...........*............... - // vmla.s32 q4, q1, r12 // .......*.............|.............*............. - // vsub.u32 q1, q0, q4 // ............*........|..................*........ - // vadd.u32 q0, q0, q4 // ..........*..........|................*.......... - // vmul.s32 q4, q3, r6 // ...........*.........|.................*......... - // vqrdmulh.s32 q3, q3, r7 // .........*...........|...............*........... - // vmla.s32 q4, q3, r12 // .............*.......|...................*....... - // vsub.u32 q3, q2, q4 // ...............*.....|.....................*..... - // vadd.u32 q2, q2, q4 // .................*...|.......................*... - // vstrw.u32 q0, [r0], #16 // ..................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*64 - 16)] // ..............*......|....................*...... - // vstrw.u32 q2, [r1], #16 // ....................*|..........................* - // vstrw.u32 q3, [r1, #(4*64-16)] // ................*....|......................*.... - - le lr, layer12_loop - vmul.s32 q6, q4, r2 // *......................... - vldrw.u32 q1, [r1, #256] // .*........................ - vqrdmulh.s32 q0, q1, r3 // ..*....................... - vldrw.u32 q2, [r0] // ...*...................... - vmul.s32 q3, q1, r2 // ....*..................... - vldrw.u32 q4, [r0, #256] // .....*.................... - vmla.s32 q3, q0, r12 // ......*................... - // gap // .......................... - vmla.s32 q6, q5, r12 // .......*.................. - vadd.u32 q0, q4, q3 // ........*................. - vqrdmulh.s32 q5, q0, r5 // ...........*.............. - vsub.u32 q1, q2, q6 // ..............*........... - vmul.s32 q7, q0, r4 // .........*................ - vadd.u32 q2, q2, q6 // ............*............. - vmla.s32 q7, q5, r12 // .............*............ - vsub.u32 q4, q4, q3 // ..........*............... - vqrdmulh.s32 q0, q4, r7 // ...............*.......... - vsub.u32 q3, q2, q7 // ..................*....... - vstrw.u32 q3, [r0, #256] // ....................*..... - vmul.s32 q4, q4, r6 // .................*........ - vadd.u32 q7, q2, q7 // ................*......... - vmla.s32 q4, q0, r12 // ...................*...... - vstrw.u32 q7, [r0] , #16 // ........................*. - vsub.u32 q5, q1, q4 // .....................*.... - vstrw.u32 q5, [r1, #256] // ......................*... - vadd.u32 q7, q1, q4 // .......................*.. - vstrw.u32 q7, [r1] , #16 // .........................* - - // original source code - // vmul.s32 q1, q4, r2 // *......................... - // vldrw.u32 q0, [r1, #256] // .*........................ - // vqrdmulh.s32 q6, q0, r3 // ..*....................... - // vldrw.u32 q2, [r0] // ...*...................... - // vmul.s32 q7, q0, r2 // ....*..................... - // vldrw.u32 q3, [r0, #256] // .....*.................... - // vmla.s32 q7, q6, r12 // ......*................... - // vmla.s32 q1, q5, r12 // .......*.................. - // vadd.u32 q6, q3, q7 // ........*................. - // vmul.s32 q5, q6, r4 // ...........*.............. - // vsub.u32 q3, q3, q7 // ..............*........... - // vqrdmulh.s32 q7, q6, r5 // .........*................ - // vadd.u32 q6, q2, q1 // ............*............. - // vmla.s32 q5, q7, r12 // .............*............ - // vsub.u32 q2, q2, q1 // ..........*............... - // vqrdmulh.s32 q7, q3, r7 // ...............*.......... - // vadd.u32 q1, q6, q5 // ...................*...... - // vmul.s32 q3, q3, r6 // ..................*....... - // vsub.u32 q6, q6, q5 // ................*......... - // vmla.s32 q3, q7, r12 // ....................*..... - // vstrw.u32 q6, [r0, #256] // .................*........ - // vsub.u32 q6, q2, q3 // ......................*... - // vstrw.u32 q6, [r1, #256] // .......................*.. - // vadd.u32 q7, q2, q3 // ........................*. - // vstrw.u32 q1, [r0] , #16 // .....................*.... - // vstrw.u32 q7, [r1] , #16 // .........................* - - - .unreq in_high - .unreq in_low - in .req r0 - - /* Layers 3,4 */ - sub in, in, #(64*4) - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q0, [r0, #192] // *. - vmul.s32 q2, q0, r2 // .* - - // original source code - // vldrw.u32 q0, [r0, #192] // *. - // vmul.s32 q2, q0, r2 // .* - - sub lr, lr, #1 -.p2align 2 -layer34_loop: - vqrdmulh.s32 q0, q0, r3 // ..........*................. - vldrw.u32 q6, [r0, #128] // ..*......................... - vqrdmulh.s32 q4, q6, r3 // .....*...................... - vldrw.u32 q7, [r0] // *........................... - vmla.s32 q2, q0, r12 // ...........*................ - vldrw.u32 q1, [r0, #64] // .*.......................... - vmul.s32 q5, q6, r2 // ....*....................... - vsub.u32 q0, q1, q2 // ............*............... - vmla.s32 q5, q4, r12 // ......*..................... - vadd.u32 q1, q1, q2 // .............*.............. - vmul.s32 q2, q0, r6 // ...................*........ - vadd.u32 q4, q7, q5 // ........*................... - vqrdmulh.s32 q3, q0, r7 // ....................*....... - vsub.u32 q5, q7, q5 // .......*.................... - vmla.s32 q2, q3, r12 // .....................*...... - vldrw.u32 q0, [r0, #208] // ...e........................ - vsub.u32 q7, q5, q2 // ......................*..... - vmul.s32 q3, q1, r4 // ..............*............. - vadd.u32 q5, q5, q2 // .......................*.... - vqrdmulh.s32 q2, q1, r5 // ...............*............ - vstrw.u32 q7, [r0, #192] // ...........................* - vmla.s32 q3, q2, r12 // ................*........... - vstrw.u32 q5, [r0, #128] // ..........................*. - vadd.u32 q2, q4, q3 // ..................*......... - vstrw.u32 q2, [r0] , #16 // ........................*... - vsub.u32 q4, q4, q3 // .................*.......... - vmul.s32 q2, q0, r2 // .........e.................. - vstrw.u32 q4, [r0, #48] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // .............|..*........................ - // vldrw.u32 q1, [r0, #(4*1*16)] // .............|....*...................... - // vldrw.u32 q2, [r0, #(4*2*16)] // .............|*.......................... - // vldrw.u32 q3, [r0, #(4*3*16)] // e............|..............e............ - // vmul.s32 q4, q2, r2 // .............|.....*..................... - // vqrdmulh.s32 q2, q2, r3 // .............|.*......................... - // vmla.s32 q4, q2, r12 // .............|.......*................... - // vsub.u32 q2, q0, q4 // .............|............*.............. - // vadd.u32 q0, q0, q4 // .............|..........*................ - // vmul.s32 q4, q3, r2 // ...........e.|.........................e. - // vqrdmulh.s32 q3, q3, r3 // .............*........................... - // vmla.s32 q4, q3, r12 // .............|...*....................... - // vsub.u32 q3, q1, q4 // .............|......*.................... - // vadd.u32 q1, q1, q4 // .............|........*.................. - // vmul.s32 q4, q1, r4 // ..*..........|................*.......... - // vqrdmulh.s32 q1, q1, r5 // ....*........|..................*........ - // vmla.s32 q4, q1, r12 // ......*......|....................*...... - // vsub.u32 q1, q0, q4 // ..........*..|........................*.. - // vadd.u32 q0, q0, q4 // ........*....|......................*.... - // vmul.s32 q4, q3, r6 // .............|.........*................. - // vqrdmulh.s32 q3, q3, r7 // .............|...........*............... - // vmla.s32 q4, q3, r12 // .............|.............*............. - // vsub.u32 q3, q2, q4 // .*...........|...............*........... - // vadd.u32 q2, q2, q4 // ...*.........|.................*......... - // vstrw.u32 q0, [r0], #16 // .........*...|.......................*... - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ............*|..........................* - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // .......*.....|.....................*..... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // .....*.......|...................*....... - - le lr, layer34_loop - vqrdmulh.s32 q4, q0, r3 // *......................... - vldrw.u32 q6, [r0, #128] // .*........................ - vmla.s32 q2, q4, r12 // ....*..................... - vldrw.u32 q4, [r0, #64] // .....*.................... - vqrdmulh.s32 q0, q6, r3 // ..*....................... - vsub.u32 q5, q4, q2 // .......*.................. - vmul.s32 q6, q6, r2 // ......*................... - vadd.u32 q4, q4, q2 // .........*................ - vmla.s32 q6, q0, r12 // ........*................. - vldrw.u32 q0, [r0] // ...*...................... - vmul.s32 q7, q5, r6 // ..........*............... - vadd.u32 q1, q0, q6 // ...........*.............. - vqrdmulh.s32 q5, q5, r7 // ............*............. - vsub.u32 q6, q0, q6 // .............*............ - vmla.s32 q7, q5, r12 // ..............*........... - // gap // .......................... - vmul.s32 q0, q4, r4 // ................*......... - vsub.u32 q5, q6, q7 // ...............*.......... - vqrdmulh.s32 q4, q4, r5 // ..................*....... - vadd.u32 q6, q6, q7 // .................*........ - vstrw.u32 q5, [r0, #192] // ...................*...... - vmla.s32 q0, q4, r12 // ....................*..... - vstrw.u32 q6, [r0, #128] // .....................*.... - vadd.u32 q4, q1, q0 // ......................*... - vstrw.u32 q4, [r0] , #16 // .......................*.. - vsub.u32 q4, q1, q0 // ........................*. - vstrw.u32 q4, [r0, #48] // .........................* - - // original source code - // vqrdmulh.s32 q0, q0, r3 // *......................... - // vldrw.u32 q6, [r0, #128] // .*........................ - // vqrdmulh.s32 q4, q6, r3 // ....*..................... - // vldrw.u32 q7, [r0] // .........*................ - // vmla.s32 q2, q0, r12 // ..*....................... - // vldrw.u32 q1, [r0, #64] // ...*...................... - // vmul.s32 q5, q6, r2 // ......*................... - // vsub.u32 q0, q1, q2 // .....*.................... - // vmla.s32 q5, q4, r12 // ........*................. - // vadd.u32 q1, q1, q2 // .......*.................. - // vmul.s32 q2, q0, r6 // ..........*............... - // vadd.u32 q4, q7, q5 // ...........*.............. - // vqrdmulh.s32 q3, q0, r7 // ............*............. - // vsub.u32 q5, q7, q5 // .............*............ - // vmla.s32 q2, q3, r12 // ..............*........... - // vsub.u32 q7, q5, q2 // ................*......... - // vmul.s32 q3, q1, r4 // ...............*.......... - // vadd.u32 q5, q5, q2 // ..................*....... - // vqrdmulh.s32 q2, q1, r5 // .................*........ - // vstrw.u32 q7, [r0, #192] // ...................*...... - // vmla.s32 q3, q2, r12 // ....................*..... - // vstrw.u32 q5, [r0, #128] // .....................*.... - // vadd.u32 q2, q4, q3 // ......................*... - // vstrw.u32 q2, [r0] , #16 // .......................*.. - // vsub.u32 q4, q4, q3 // ........................*. - // vstrw.u32 q4, [r0, #48] // .........................* - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - /* Layers 5,6 */ - sub in, in, #(4*256) - - mov lr, #16 -.p2align 2 -layer56_loop: - ldrd r9, r3, [r11] , #24 // *.............................. - vldrw.u32 q3, [r0, #48] // ......*........................ - vqrdmulh.s32 q4, q3, r3 // .............*................. - vldrw.u32 q1, [r0, #32] // .....*......................... - vmul.s32 q3, q3, r9 // ............*.................. - vldrw.u32 q5, [r0, #16] // ....*.......................... - vmla.s32 q3, q4, r12 // ..............*................ - ldrd r10, r6, [r11, #-8] // ..*............................ - vmul.s32 q6, q1, r9 // .......*....................... - vsub.u32 q4, q5, q3 // ...............*............... - vqrdmulh.s32 q0, q1, r3 // ........*...................... - vadd.u32 q5, q5, q3 // ................*.............. - vmla.s32 q6, q0, r12 // .........*..................... - vldrw.u32 q7, [r0] // ...*........................... - vqrdmulh.s32 q1, q4, r6 // .......................*....... - vsub.u32 q3, q7, q6 // ..........*.................... - vmul.s32 q0, q4, r10 // ......................*........ - vadd.u32 q7, q7, q6 // ...........*................... - vmla.s32 q0, q1, r12 // ........................*...... - ldrd r5, r10, [r11, #-16] // .*............................. - vadd.u32 q4, q3, q0 // ..........................*.... - vmul.s32 q6, q5, r5 // .................*............. - vsub.u32 q0, q3, q0 // .........................*..... - vqrdmulh.s32 q2, q5, r10 // ..................*............ - vstrw.u32 q0, [r0, #48] // ..............................* - vmla.s32 q6, q2, r12 // ...................*........... - vstrw.u32 q4, [r0, #32] // .............................*. - vsub.u32 q0, q7, q6 // ....................*.......... - vstrw.u32 q0, [r0, #16] // ............................*.. - vadd.u32 q4, q7, q6 // .....................*......... - vstrw.u32 q4, [r0] , #64 // ...........................*... - - // original source code - // ldrd r2, r3, [r11], #+24 // *.............................. - // ldrd r4, r5, [r11, #(-16)] // ...................*........... - // ldrd r6, r7, [r11, #(-8)] // .......*....................... - // vldrw.u32 q0, [r0] // .............*................. - // vldrw.u32 q1, [r0, #(4*1*4)] // .....*......................... - // vldrw.u32 q2, [r0, #(4*2*4)] // ...*........................... - // vldrw.u32 q3, [r0, #(4*3*4)] // .*............................. - // vmul.s32 q4, q2, r2 // ........*...................... - // vqrdmulh.s32 q2, q2, r3 // ..........*.................... - // vmla.s32 q4, q2, r12 // ............*.................. - // vsub.u32 q2, q0, q4 // ...............*............... - // vadd.u32 q0, q0, q4 // .................*............. - // vmul.s32 q4, q3, r2 // ....*.......................... - // vqrdmulh.s32 q3, q3, r3 // ..*............................ - // vmla.s32 q4, q3, r12 // ......*........................ - // vsub.u32 q3, q1, q4 // .........*..................... - // vadd.u32 q1, q1, q4 // ...........*................... - // vmul.s32 q4, q1, r4 // .....................*......... - // vqrdmulh.s32 q1, q1, r5 // .......................*....... - // vmla.s32 q4, q1, r12 // .........................*..... - // vsub.u32 q1, q0, q4 // ...........................*... - // vadd.u32 q0, q0, q4 // .............................*. - // vmul.s32 q4, q3, r6 // ................*.............. - // vqrdmulh.s32 q3, q3, r7 // ..............*................ - // vmla.s32 q4, q3, r12 // ..................*............ - // vsub.u32 q3, q2, q4 // ......................*........ - // vadd.u32 q2, q2, q4 // ....................*.......... - // vstrw.u32 q0, [r0], #64 // ..............................* - // vstrw.u32 q1, [r0, #(-64+16)] // ............................*.. - // vstrw.u32 q2, [r0, #(-64+32)] // ..........................*.... - // vstrw.u32 q3, [r0, #(-64+48)] // ........................*...... - - le lr, layer56_loop - layer56_loop_end: - sub r0, r0, #(4*256) // *........ - vld40.u32 {q4,q5,q6,q7}, [r0] // ..*...... - mov r14, #16 // .*....... - vld41.u32 {q4,q5,q6,q7}, [r0] // ...*..... - sub r14, r14, #1 // ........* - vld42.u32 {q4,q5,q6,q7}, [r0] // ....*.... - // gap // ......... - vld43.u32 {q4,q5,q6,q7}, [r0]! // .....*... - // gap // ......... - vldrw.u32 q2, [r11, #16] // ......*.. - // gap // ......... - vqrdmulh.s32 q1, q7, q2 // .......*. - - // original source code - // sub r0, r0, #(4*256) // *........ - // mov r14, #16 // ..*...... - // vld40.u32 {q4,q5,q6,q7}, [r0] // .*....... - // vld41.u32 {q4,q5,q6,q7}, [r0] // ...*..... - // vld42.u32 {q4,q5,q6,q7}, [r0] // .....*... - // vld43.u32 {q4,q5,q6,q7}, [r0]! // ......*.. - // vldrw.u32 q2, [r11, #16] // .......*. - // vqrdmulh.s32 q1, q7, q2 // ........* - // sub r14, r14, #1 // ....*.... - - layer78_loop: - - vqrdmulh.s32 q0, q6, q2 // .......*.......................... - vldrw.u32 q3, [r11] , #96 // ....*............................. - vmul.s32 q7, q7, q3 // ...........*...................... - vldrw.u32 q2, [r11, #-48] // .................*................ - vmla.s32 q7, q1, r12 // .............*.................... - vldrw.u32 q1, [r11, #-64] // ................*................. - vmul.s32 q6, q6, q3 // ......*........................... - vsub.u32 q3, q5, q7 // ..............*................... - vmla.s32 q6, q0, r12 // ........*......................... - vadd.u32 q7, q5, q7 // ...............*.................. - vqrdmulh.s32 q2, q7, q2 // ...................*.............. - vsub.u32 q0, q4, q6 // .........*........................ - vmul.s32 q5, q7, q1 // ..................*............... - vadd.u32 q4, q4, q6 // ..........*....................... - vmla.s32 q5, q2, r12 // ....................*............. - vldrw.u32 q1, [r11, #-32] // .......................*.......... - vsub.u32 q7, q4, q5 // .....................*............ - vstrw.u32 q7, [r0, #-48] // ...............................*.. - vadd.u32 q2, q4, q5 // ......................*........... - vld40.u32 {q4,q5,q6,q7}, [r0] // e................................. - vstrw.u32 q2, [r0, #-64] // ..............................*... - vmul.s32 q2, q3, q1 // .........................*........ - vldrw.u32 q1, [r11, #-16] // ........................*......... - vqrdmulh.s32 q1, q3, q1 // ..........................*....... - vld41.u32 {q4,q5,q6,q7}, [r0] // .e................................ - vmla.s32 q2, q1, r12 // ...........................*...... - vld42.u32 {q4,q5,q6,q7}, [r0] // ..e............................... - vadd.u32 q3, q0, q2 // .............................*.... - vld43.u32 {q4,q5,q6,q7}, [r0]! // ...e.............................. - vstrw.u32 q3, [r0, #-96] // ................................*. - vsub.u32 q3, q0, q2 // ............................*..... - vldrw.u32 q2, [r11, #16] // .....e............................ - vqrdmulh.s32 q1, q7, q2 // ............e..................... - vstrw.u32 q3, [r0, #-80] // .................................* - - // original source code - // vld40.u32 {q0, q1, q2, q3}, [r0] // e..............|..................e.............. - // vld41.u32 {q0, q1, q2, q3}, [r0] // .....e.........|.......................e......... - // vld42.u32 {q0, q1, q2, q3}, [r0] // .......e.......|.........................e....... - // vld43.u32 {q0, q1, q2, q3}, [r0]! // .........e.....|...........................e..... - // vldrw.u32 q5, [r11], #+96 // ...............|*................................ - // vldrw.u32 q6, [r11, #(+16-96)] // ............e..|..............................e.. - // vmul.s32 q4, q2, q5 // ...............|.....*........................... - // vqrdmulh.s32 q2, q2, q6 // ...............*................................. - // vmla.s32 q4, q2, r12 // ...............|.......*......................... - // vsub.u32 q2, q0, q4 // ...............|..........*...................... - // vadd.u32 q0, q0, q4 // ...............|............*.................... - // vmul.s32 q4, q3, q5 // ...............|.*............................... - // vqrdmulh.s32 q3, q3, q6 // .............e.|...............................e. - // vmla.s32 q4, q3, r12 // ...............|...*............................. - // vsub.u32 q3, q1, q4 // ...............|......*.......................... - // vadd.u32 q1, q1, q4 // ...............|........*........................ - // vldrw.u32 q5, [r11, #(32 - 96)] // ...............|....*............................ - // vldrw.u32 q6, [r11, #(48 - 96)] // ...............|..*.............................. - // vmul.s32 q4, q1, q5 // ...............|...........*..................... - // vqrdmulh.s32 q1, q1, q6 // ...............|.........*....................... - // vmla.s32 q4, q1, r12 // ...............|.............*................... - // vsub.u32 q1, q0, q4 // ...............|...............*................. - // vadd.u32 q0, q0, q4 // ...............|.................*............... - // vldrw.u32 q5, [r11, #(64-96)] // ...............|..............*.................. - // vldrw.u32 q6, [r11, #(80-96)] // ...*...........|.....................*........... - // vmul.s32 q4, q3, q5 // ..*............|....................*............ - // vqrdmulh.s32 q3, q3, q6 // ....*..........|......................*.......... - // vmla.s32 q4, q3, r12 // ......*........|........................*........ - // vsub.u32 q3, q2, q4 // ...........*...|.............................*... - // vadd.u32 q2, q2, q4 // ........*......|..........................*...... - // vstrw.32 q0, [r0, #( 0 - 64)] // .*.............|...................*............. - // vstrw.32 q1, [r0, #(16 - 64)] // ...............|................*................ - // vstrw.32 q2, [r0, #(32 - 64)] // ..........*....|............................*.... - // vstrw.32 q3, [r0, #(48 - 64)] // ..............*|................................* - - le lr, layer78_loop - vqrdmulh.s32 q3, q6, q2 // *........................... - vldrw.u32 q2, [r11] , #96 // .*.......................... - vmul.s32 q0, q7, q2 // ..*......................... - vldrw.u32 q7, [r11, #-48] // ...*........................ - vmla.s32 q0, q1, r12 // ....*....................... - vldrw.u32 q1, [r11, #-64] // .....*...................... - vmul.s32 q2, q6, q2 // ......*..................... - vadd.u32 q6, q5, q0 // .........*.................. - vmla.s32 q2, q3, r12 // ........*................... - vsub.u32 q3, q5, q0 // .......*.................... - vmul.s32 q5, q6, q1 // ............*............... - vadd.u32 q0, q4, q2 // .............*.............. - vqrdmulh.s32 q6, q6, q7 // ..........*................. - vsub.u32 q7, q4, q2 // ...........*................ - vmla.s32 q5, q6, r12 // ..............*............. - vldrw.u32 q6, [r11, #-32] // ...............*............ - vadd.u32 q4, q0, q5 // ..................*......... - vstrw.u32 q4, [r0, #-64] // ...................*........ - vmul.s32 q6, q3, q6 // ....................*....... - vldrw.u32 q4, [r11, #-16] // .....................*...... - vqrdmulh.s32 q4, q3, q4 // ......................*..... - vsub.u32 q0, q0, q5 // ................*........... - vmla.s32 q6, q4, r12 // .......................*.... - vstrw.u32 q0, [r0, #-48] // .................*.......... - vsub.u32 q4, q7, q6 // ..........................*. - vstrw.u32 q4, [r0, #-16] // ...........................* - vadd.u32 q4, q7, q6 // ........................*... - vstrw.u32 q4, [r0, #-32] // .........................*.. - - // original source code - // vqrdmulh.s32 q0, q6, q2 // *........................... - // vldrw.u32 q3, [r11] , #96 // .*.......................... - // vmul.s32 q7, q7, q3 // ..*......................... - // vldrw.u32 q2, [r11, #-48] // ...*........................ - // vmla.s32 q7, q1, r12 // ....*....................... - // vldrw.u32 q1, [r11, #-64] // .....*...................... - // vmul.s32 q6, q6, q3 // ......*..................... - // vsub.u32 q3, q5, q7 // .........*.................. - // vmla.s32 q6, q0, r12 // ........*................... - // vadd.u32 q7, q5, q7 // .......*.................... - // vqrdmulh.s32 q2, q7, q2 // ............*............... - // vsub.u32 q0, q4, q6 // .............*.............. - // vmul.s32 q5, q7, q1 // ..........*................. - // vadd.u32 q4, q4, q6 // ...........*................ - // vmla.s32 q5, q2, r12 // ..............*............. - // vldrw.u32 q1, [r11, #-32] // ...............*............ - // vsub.u32 q7, q4, q5 // .....................*...... - // vstrw.u32 q7, [r0, #-48] // .......................*.... - // vadd.u32 q2, q4, q5 // ................*........... - // vstrw.u32 q2, [r0, #-64] // .................*.......... - // vmul.s32 q2, q3, q1 // ..................*......... - // vldrw.u32 q1, [r11, #-16] // ...................*........ - // vqrdmulh.s32 q1, q3, q1 // ....................*....... - // vmla.s32 q2, q1, r12 // ......................*..... - // vadd.u32 q3, q0, q2 // ..........................*. - // vstrw.u32 q3, [r0, #-32] // ...........................* - // vsub.u32 q3, q0, q2 // ........................*... - // vstrw.u32 q3, [r0, #-16] // .........................*.. - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m55.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m55.s deleted file mode 100644 index 02989c4..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m55.s +++ /dev/null @@ -1,670 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_dilithium_12_34_56_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_12_34_56_78_opt_m55, %function -.global ntt_dilithium_12_34_56_78_opt_m55 -ntt_dilithium_12_34_56_78_opt_m55: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q2, [r1, #256] // .*... - vmul.s32 q7, q2, r2 // ...*. - // gap // ..... - vqrdmulh.s32 q4, q2, r3 // ..*.. - vldrw.u32 q2, [r0, #256] // *.... - vmla.s32 q7, q4, r12 // ....* - - // original source code - // vldrw.u32 q2, [r0, #256] // ...*. - // vldrw.u32 q7, [r1, #256] // *.... - // vqrdmulh.s32 q0, q7, r3 // ..*.. - // vmul.s32 q7, q7, r2 // .*... - // vmla.s32 q7, q0, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer12_loop: - vsub.u32 q4, q2, q7 // ............*............... - vmul.s32 q1, q4, r6 // ...................*........ - vldrw.u32 q5, [r1] // ..*......................... - vqrdmulh.s32 q6, q5, r3 // .....*...................... - vadd.u32 q0, q2, q7 // .............*.............. - vmul.s32 q7, q5, r2 // ....*....................... - vldrw.u32 q3, [r0] // *........................... - vmla.s32 q7, q6, r12 // ......*..................... - vldrw.u32 q2, [r0, #272] // .e.......................... - vqrdmulh.s32 q6, q0, r5 // ...............*............ - vsub.u32 q5, q3, q7 // .......*.................... - vmul.s32 q0, q0, r4 // ..............*............. - vadd.u32 q3, q3, q7 // ........*................... - vmla.s32 q0, q6, r12 // ................*........... - vldrw.u32 q7, [r1, #272] // ...e........................ - vqrdmulh.s32 q4, q4, r7 // ....................*....... - vadd.u32 q6, q3, q0 // ..................*......... - vmla.s32 q1, q4, r12 // .....................*...... - vsub.u32 q3, q3, q0 // .................*.......... - vqrdmulh.s32 q0, q7, r3 // ..........e................. - vsub.u32 q4, q5, q1 // ......................*..... - vstrw.u32 q4, [r1, #256] // ...........................* - vadd.u32 q5, q5, q1 // .......................*.... - vstrw.u32 q5, [r1] , #16 // ..........................*. - vmul.s32 q7, q7, r2 // .........e.................. - vstrw.u32 q6, [r0] , #16 // ........................*... - vmla.s32 q7, q0, r12 // ...........e................ - vstrw.u32 q3, [r0, #240] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ....................|.....*..................... - // vldrw.u32 q1, [r0, #(4*64)] // e...................|.......e................... - // vldrw.u32 q2, [r1] // ....................|.*......................... - // vldrw.u32 q3, [r1, #(4*64)] // ......e.............|.............e............. - // vmul.s32 q4, q2, r2 // ....................|....*...................... - // vqrdmulh.s32 q2, q2, r3 // ....................|..*........................ - // vmla.s32 q4, q2, r12 // ....................|......*.................... - // vsub.u32 q2, q0, q4 // ..*.................|.........*................. - // vadd.u32 q0, q0, q4 // ....*...............|...........*............... - // vmul.s32 q4, q3, r2 // ................e...|.......................e... - // vqrdmulh.s32 q3, q3, r3 // ...........e........|..................e........ - // vmla.s32 q4, q3, r12 // ..................e.|.........................e. - // vsub.u32 q3, q1, q4 // ....................*........................... - // vadd.u32 q1, q1, q4 // ....................|...*....................... - // vmul.s32 q4, q1, r4 // ...*................|..........*................ - // vqrdmulh.s32 q1, q1, r5 // .*..................|........*.................. - // vmla.s32 q4, q1, r12 // .....*..............|............*.............. - // vsub.u32 q1, q0, q4 // ..........*.........|.................*......... - // vadd.u32 q0, q0, q4 // ........*...........|...............*........... - // vmul.s32 q4, q3, r6 // ....................|*.......................... - // vqrdmulh.s32 q3, q3, r7 // .......*............|..............*............ - // vmla.s32 q4, q3, r12 // .........*..........|................*.......... - // vsub.u32 q3, q2, q4 // ............*.......|...................*....... - // vadd.u32 q2, q2, q4 // ..............*.....|.....................*..... - // vstrw.u32 q0, [r0], #16 // .................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*64 - 16)] // ...................*|..........................* - // vstrw.u32 q2, [r1], #16 // ...............*....|......................*.... - // vstrw.u32 q3, [r1, #(4*64-16)] // .............*......|....................*...... - - le lr, layer12_loop - vldrw.u32 q0, [r1] // ..*.................... - vqrdmulh.s32 q3, q0, r3 // ...*................... - vadd.u32 q1, q2, q7 // ....*.................. - vmul.s32 q4, q0, r2 // .....*................. - vsub.u32 q7, q2, q7 // *...................... - vmla.s32 q4, q3, r12 // .......*............... - vldrw.u32 q0, [r0] // ......*................ - vqrdmulh.s32 q3, q1, r5 // ........*.............. - vsub.u32 q5, q0, q4 // .........*............. - vmul.s32 q2, q1, r4 // ..........*............ - // gap // ....................... - vmla.s32 q2, q3, r12 // ............*.......... - vadd.u32 q1, q0, q4 // ...........*........... - vqrdmulh.s32 q0, q7, r7 // .............*......... - vadd.u32 q3, q1, q2 // ..............*........ - vmul.s32 q4, q7, r6 // .*..................... - vstrw.u32 q3, [r0] , #16 // .....................*. - vmla.s32 q4, q0, r12 // ...............*....... - vsub.u32 q3, q1, q2 // ................*...... - vstrw.u32 q3, [r0, #240] // ......................* - vadd.u32 q0, q5, q4 // ...................*... - vstrw.u32 q0, [r1] , #16 // ....................*.. - vsub.u32 q5, q5, q4 // .................*..... - vstrw.u32 q5, [r1, #240] // ..................*.... - - // original source code - // vsub.u32 q4, q2, q7 // ....*.................. - // vmul.s32 q1, q4, r6 // ..............*........ - // vldrw.u32 q5, [r1] // *...................... - // vqrdmulh.s32 q6, q5, r3 // .*..................... - // vadd.u32 q0, q2, q7 // ..*.................... - // vmul.s32 q7, q5, r2 // ...*................... - // vldrw.u32 q3, [r0] // ......*................ - // vmla.s32 q7, q6, r12 // .....*................. - // vqrdmulh.s32 q6, q0, r5 // .......*............... - // vsub.u32 q5, q3, q7 // ........*.............. - // vmul.s32 q0, q0, r4 // .........*............. - // vadd.u32 q3, q3, q7 // ...........*........... - // vmla.s32 q0, q6, r12 // ..........*............ - // vqrdmulh.s32 q4, q4, r7 // ............*.......... - // vadd.u32 q6, q3, q0 // .............*......... - // vmla.s32 q1, q4, r12 // ................*...... - // vsub.u32 q3, q3, q0 // .................*..... - // vsub.u32 q4, q5, q1 // .....................*. - // vstrw.u32 q4, [r1, #256] // ......................* - // vadd.u32 q5, q5, q1 // ...................*... - // vstrw.u32 q5, [r1] , #16 // ....................*.. - // vstrw.u32 q6, [r0] , #16 // ...............*....... - // vstrw.u32 q3, [r0, #240] // ..................*.... - - - .unreq in_high - .unreq in_low - in .req r0 - - /* Layers 3,4 */ - sub in, in, #(64*4) - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q2, [r0, #192] // .*... - vqrdmulh.s32 q4, q2, r3 // ..*.. - // gap // ..... - vmul.s32 q7, q2, r2 // ...*. - vldrw.u32 q2, [r0, #64] // *.... - vmla.s32 q7, q4, r12 // ....* - - // original source code - // vldrw.u32 q2, [r0, #64] // ...*. - // vldrw.u32 q7, [r0, #192] // *.... - // vqrdmulh.s32 q0, q7, r3 // .*... - // vmul.s32 q7, q7, r2 // ..*.. - // vmla.s32 q7, q0, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer34_loop: - vsub.u32 q4, q2, q7 // ............*............... - vmul.s32 q1, q4, r6 // ...................*........ - vldrw.u32 q5, [r0, #128] // ..*......................... - vqrdmulh.s32 q6, q5, r3 // .....*...................... - vadd.u32 q0, q2, q7 // .............*.............. - vmul.s32 q7, q5, r2 // ....*....................... - vldrw.u32 q3, [r0] // *........................... - vmla.s32 q7, q6, r12 // ......*..................... - vldrw.u32 q2, [r0, #80] // .e.......................... - vqrdmulh.s32 q6, q0, r5 // ...............*............ - vsub.u32 q5, q3, q7 // .......*.................... - vmul.s32 q0, q0, r4 // ..............*............. - vadd.u32 q3, q3, q7 // ........*................... - vmla.s32 q0, q6, r12 // ................*........... - vldrw.u32 q7, [r0, #208] // ...e........................ - vqrdmulh.s32 q4, q4, r7 // ....................*....... - vadd.u32 q6, q3, q0 // ..................*......... - vmla.s32 q1, q4, r12 // .....................*...... - vsub.u32 q3, q3, q0 // .................*.......... - vqrdmulh.s32 q0, q7, r3 // ..........e................. - vsub.u32 q4, q5, q1 // ......................*..... - vstrw.u32 q4, [r0, #192] // ...........................* - vadd.u32 q5, q5, q1 // .......................*.... - vstrw.u32 q5, [r0, #128] // ..........................*. - vmul.s32 q7, q7, r2 // .........e.................. - vstrw.u32 q6, [r0] , #16 // ........................*... - vmla.s32 q7, q0, r12 // ...........e................ - vstrw.u32 q3, [r0, #48] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ....................|.....*..................... - // vldrw.u32 q1, [r0, #(4*1*16)] // e...................|.......e................... - // vldrw.u32 q2, [r0, #(4*2*16)] // ....................|.*......................... - // vldrw.u32 q3, [r0, #(4*3*16)] // ......e.............|.............e............. - // vmul.s32 q4, q2, r2 // ....................|....*...................... - // vqrdmulh.s32 q2, q2, r3 // ....................|..*........................ - // vmla.s32 q4, q2, r12 // ....................|......*.................... - // vsub.u32 q2, q0, q4 // ..*.................|.........*................. - // vadd.u32 q0, q0, q4 // ....*...............|...........*............... - // vmul.s32 q4, q3, r2 // ................e...|.......................e... - // vqrdmulh.s32 q3, q3, r3 // ...........e........|..................e........ - // vmla.s32 q4, q3, r12 // ..................e.|.........................e. - // vsub.u32 q3, q1, q4 // ....................*........................... - // vadd.u32 q1, q1, q4 // ....................|...*....................... - // vmul.s32 q4, q1, r4 // ...*................|..........*................ - // vqrdmulh.s32 q1, q1, r5 // .*..................|........*.................. - // vmla.s32 q4, q1, r12 // .....*..............|............*.............. - // vsub.u32 q1, q0, q4 // ..........*.........|.................*......... - // vadd.u32 q0, q0, q4 // ........*...........|...............*........... - // vmul.s32 q4, q3, r6 // ....................|*.......................... - // vqrdmulh.s32 q3, q3, r7 // .......*............|..............*............ - // vmla.s32 q4, q3, r12 // .........*..........|................*.......... - // vsub.u32 q3, q2, q4 // ............*.......|...................*....... - // vadd.u32 q2, q2, q4 // ..............*.....|.....................*..... - // vstrw.u32 q0, [r0], #16 // .................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ...................*|..........................* - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // ...............*....|......................*.... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // .............*......|....................*...... - - le lr, layer34_loop - vldrw.u32 q0, [r0, #128] // ..*.................... - vqrdmulh.s32 q3, q0, r3 // ...*................... - vadd.u32 q1, q2, q7 // ....*.................. - vmul.s32 q4, q0, r2 // .....*................. - vsub.u32 q7, q2, q7 // *...................... - vmla.s32 q4, q3, r12 // .......*............... - vldrw.u32 q0, [r0] // ......*................ - vqrdmulh.s32 q3, q1, r5 // ........*.............. - vsub.u32 q5, q0, q4 // .........*............. - vmul.s32 q2, q1, r4 // ..........*............ - // gap // ....................... - vmla.s32 q2, q3, r12 // ............*.......... - vadd.u32 q1, q0, q4 // ...........*........... - vqrdmulh.s32 q0, q7, r7 // .............*......... - vadd.u32 q3, q1, q2 // ..............*........ - vmul.s32 q4, q7, r6 // .*..................... - vstrw.u32 q3, [r0] , #16 // .....................*. - vmla.s32 q4, q0, r12 // ...............*....... - vsub.u32 q3, q1, q2 // ................*...... - vstrw.u32 q3, [r0, #48] // ......................* - vadd.u32 q0, q5, q4 // ...................*... - vstrw.u32 q0, [r0, #112] // ....................*.. - vsub.u32 q5, q5, q4 // .................*..... - vstrw.u32 q5, [r0, #176] // ..................*.... - - // original source code - // vsub.u32 q4, q2, q7 // ....*.................. - // vmul.s32 q1, q4, r6 // ..............*........ - // vldrw.u32 q5, [r0, #128] // *...................... - // vqrdmulh.s32 q6, q5, r3 // .*..................... - // vadd.u32 q0, q2, q7 // ..*.................... - // vmul.s32 q7, q5, r2 // ...*................... - // vldrw.u32 q3, [r0] // ......*................ - // vmla.s32 q7, q6, r12 // .....*................. - // vqrdmulh.s32 q6, q0, r5 // .......*............... - // vsub.u32 q5, q3, q7 // ........*.............. - // vmul.s32 q0, q0, r4 // .........*............. - // vadd.u32 q3, q3, q7 // ...........*........... - // vmla.s32 q0, q6, r12 // ..........*............ - // vqrdmulh.s32 q4, q4, r7 // ............*.......... - // vadd.u32 q6, q3, q0 // .............*......... - // vmla.s32 q1, q4, r12 // ................*...... - // vsub.u32 q3, q3, q0 // .................*..... - // vsub.u32 q4, q5, q1 // .....................*. - // vstrw.u32 q4, [r0, #192] // ......................* - // vadd.u32 q5, q5, q1 // ...................*... - // vstrw.u32 q5, [r0, #128] // ....................*.. - // vstrw.u32 q6, [r0] , #16 // ...............*....... - // vstrw.u32 q3, [r0, #48] // ..................*.... - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - /* Layers 5,6 */ - sub in, in, #(4*256) - - mov lr, #16 - ldrd r7, r3, [r11] , #24 // ...*... - vldrw.u32 q5, [r0, #48] // *...... - vmul.s32 q7, q5, r7 // .....*. - ldrd r5, r6, [r11, #-8] // ..*.... - vqrdmulh.s32 q4, q5, r3 // ....*.. - vldrw.u32 q5, [r0, #16] // .*..... - vmla.s32 q7, q4, r12 // ......* - - // original source code - // vldrw.u32 q7, [r0, #48] // .*..... - // vldrw.u32 q5, [r0, #16] // .....*. - // ldrd r5, r6, [r11, #16] // ...*... - // ldrd r7, r3, [r11] , #24 // *...... - // vqrdmulh.s32 q6, q7, r3 // ....*.. - // vmul.s32 q7, q7, r7 // ..*.... - // vmla.s32 q7, q6, r12 // ......* - - sub lr, lr, #1 -.p2align 2 -layer56_loop: - vsub.u32 q0, q5, q7 // ...............*............... - vqrdmulh.s32 q3, q0, r6 // .......................*....... - vldrw.u32 q6, [r0, #32] // .....*......................... - vmul.s32 q4, q6, r7 // .......*....................... - ldrd r7, r8, [r11, #-16] // .*............................. - vqrdmulh.s32 q6, q6, r3 // ........*...................... - vadd.u32 q2, q5, q7 // ................*.............. - vmla.s32 q4, q6, r12 // .........*..................... - vldrw.u32 q1, [r0] // ...*........................... - vmul.s32 q5, q0, r5 // ......................*........ - vsub.u32 q0, q1, q4 // ..........*.................... - vmla.s32 q5, q3, r12 // ........................*...... - vadd.u32 q1, q1, q4 // ...........*................... - vmul.s32 q4, q2, r7 // .................*............. - vsub.u32 q3, q0, q5 // .........................*..... - vqrdmulh.s32 q6, q2, r8 // ..................*............ - vadd.u32 q2, q0, q5 // ..........................*.... - vmla.s32 q4, q6, r12 // ...................*........... - vldrw.u32 q7, [r0, #112] // ......e........................ - vadd.u32 q0, q1, q4 // .....................*......... - vldrw.u32 q5, [r0, #80] // ....e.......................... - vsub.u32 q1, q1, q4 // ....................*.......... - ldrd r5, r6, [r11, #16] // ..e............................ - ldrd r7, r3, [r11] , #24 // e.............................. - vst40.u32 {q0,q1,q2,q3}, [r0] // ...........................*... - vqrdmulh.s32 q6, q7, r3 // .............e................. - vst41.u32 {q0,q1,q2,q3}, [r0] // ............................*.. - vmul.s32 q7, q7, r7 // ............e.................. - vst42.u32 {q0,q1,q2,q3}, [r0] // .............................*. - vmla.s32 q7, q6, r12 // ..............e................ - vst43.u32 {q0,q1,q2,q3}, [r0]! // ..............................* - - // original source code - // ldrd r2, r3, [r11], #+24 // .....e.......|......................e....... - // ldrd r4, r5, [r11, #(-16)] // .............|...*.......................... - // ldrd r6, r7, [r11, #(-8)] // ....e........|.....................e........ - // vldrw.u32 q0, [r0] // .............|.......*...................... - // vldrw.u32 q1, [r0, #(4*1*4)] // ..e..........|...................e.......... - // vldrw.u32 q2, [r0, #(4*2*4)] // .............|.*............................ - // vldrw.u32 q3, [r0, #(4*3*4)] // e............|.................e............ - // vmul.s32 q4, q2, r2 // .............|..*........................... - // vqrdmulh.s32 q2, q2, r3 // .............|....*......................... - // vmla.s32 q4, q2, r12 // .............|......*....................... - // vsub.u32 q2, q0, q4 // .............|.........*.................... - // vadd.u32 q0, q0, q4 // .............|...........*.................. - // vmul.s32 q4, q3, r2 // .........e...|..........................e... - // vqrdmulh.s32 q3, q3, r3 // .......e.....|........................e..... - // vmla.s32 q4, q3, r12 // ...........e.|............................e. - // vsub.u32 q3, q1, q4 // .............*.............................. - // vadd.u32 q1, q1, q4 // .............|.....*........................ - // vmul.s32 q4, q1, r4 // .............|............*................. - // vqrdmulh.s32 q1, q1, r5 // .............|..............*............... - // vmla.s32 q4, q1, r12 // .............|................*............. - // vsub.u32 q1, q0, q4 // ...*.........|....................*......... - // vadd.u32 q0, q0, q4 // .*...........|..................*........... - // vmul.s32 q4, q3, r6 // .............|........*..................... - // vqrdmulh.s32 q3, q3, r7 // .............|*............................. - // vmla.s32 q4, q3, r12 // .............|..........*................... - // vsub.u32 q3, q2, q4 // .............|.............*................ - // vadd.u32 q2, q2, q4 // .............|...............*.............. - // vst40.u32 {q0, q1, q2, q3}, [r0] // ......*......|.......................*...... - // vst41.u32 {q0, q1, q2, q3}, [r0] // ........*....|.........................*.... - // vst42.u32 {q0, q1, q2, q3}, [r0] // ..........*..|...........................*.. - // vst43.u32 {q0, q1, q2, q3}, [r0]! // ............*|.............................* - - le lr, layer56_loop - layer56_loop_end: - vldrw.u32 q1, [r0, #32] // ..*.......................... - vqrdmulh.s32 q2, q1, r3 // .....*....................... - vldrw.u32 q0, [r0] // ........*.................... - vmul.s32 q6, q1, r7 // ...*......................... - ldrd r7, r8, [r11, #-16] // ....*........................ - vmla.s32 q6, q2, r12 // .......*..................... - vsub.u32 q3, q5, q7 // *............................ - vmul.s32 q4, q3, r5 // .........*................... - vsub.u32 q1, q0, q6 // ..........*.................. - vqrdmulh.s32 q3, q3, r6 // .*........................... - vadd.u32 q2, q5, q7 // ......*...................... - vmla.s32 q4, q3, r12 // ...........*................. - vadd.u32 q5, q0, q6 // ............*................ - vmul.s32 q6, q2, r7 // .............*............... - vsub.u32 q3, q1, q4 // ..............*.............. - vqrdmulh.s32 q0, q2, r8 // ...............*............. - vadd.u32 q2, q1, q4 // ................*............ - vmla.s32 q6, q0, r12 // .................*........... - mov r14, #16 // .........................*... - vsub.u32 q1, q5, q6 // ...................*......... - sub r14, r14, #1 // ............................* - vadd.u32 q0, q5, q6 // ..................*.......... - vldrw.u32 q4, [r11] , #96 // ..........................*.. - // gap // ............................. - vst40.u32 {q0,q1,q2,q3}, [r0] // ....................*........ - // gap // ............................. - vst41.u32 {q0,q1,q2,q3}, [r0] // .....................*....... - // gap // ............................. - vst42.u32 {q0,q1,q2,q3}, [r0] // ......................*...... - // gap // ............................. - vst43.u32 {q0,q1,q2,q3}, [r0]! // .......................*..... - sub r0, r0, #(4*256) // ........................*.... - // gap // ............................. - vldrw.u32 q5, [r0, #48] // ...........................*. - - // original source code - // vsub.u32 q0, q5, q7 // ......*...................... - // vqrdmulh.s32 q3, q0, r6 // .........*................... - // vldrw.u32 q6, [r0, #32] // *............................ - // vmul.s32 q4, q6, r7 // ...*......................... - // ldrd r7, r8, [r11, #-16] // ....*........................ - // vqrdmulh.s32 q6, q6, r3 // .*........................... - // vadd.u32 q2, q5, q7 // ..........*.................. - // vmla.s32 q4, q6, r12 // .....*....................... - // vldrw.u32 q1, [r0] // ..*.......................... - // vmul.s32 q5, q0, r5 // .......*..................... - // vsub.u32 q0, q1, q4 // ........*.................... - // vmla.s32 q5, q3, r12 // ...........*................. - // vadd.u32 q1, q1, q4 // ............*................ - // vmul.s32 q4, q2, r7 // .............*............... - // vsub.u32 q3, q0, q5 // ..............*.............. - // vqrdmulh.s32 q6, q2, r8 // ...............*............. - // vadd.u32 q2, q0, q5 // ................*............ - // vmla.s32 q4, q6, r12 // .................*........... - // vadd.u32 q0, q1, q4 // .....................*....... - // vsub.u32 q1, q1, q4 // ...................*......... - // vst40.u32 {q0,q1,q2,q3}, [r0] // .......................*..... - // vst41.u32 {q0,q1,q2,q3}, [r0] // ........................*.... - // vst42.u32 {q0,q1,q2,q3}, [r0] // .........................*... - // vst43.u32 {q0,q1,q2,q3}, [r0]! // ..........................*.. - // sub r0, r0, #(4*256) // ...........................*. - // mov r14, #16 // ..................*.......... - // vldrw.u32 q4, [r11] , #96 // ......................*...... - // vldrw.u32 q5, [r0, #48] // ............................* - // sub r14, r14, #1 // ....................*........ - - layer78_loop: - - vmul.s32 q3, q5, q4 // ...........*...................... - vldrw.u32 q6, [r0, #32] // ..*............................... - vmul.s32 q7, q6, q4 // ......*........................... - vldrw.u32 q4, [r11, #-80] // .....*............................ - vqrdmulh.s32 q0, q5, q4 // ............*..................... - vldrw.u32 q2, [r0, #16] // .*................................ - vmla.s32 q3, q0, r12 // .............*.................... - vldrw.u32 q5, [r0] // *................................. - vqrdmulh.s32 q4, q6, q4 // .......*.......................... - vadd.u32 q1, q2, q3 // ...............*.................. - vmla.s32 q7, q4, r12 // ........*......................... - vldrw.u32 q0, [r11, #-64] // ................*................. - vadd.u32 q6, q5, q7 // ..........*....................... - vmul.s32 q0, q1, q0 // ..................*............... - vldrw.u32 q4, [r11, #-48] // .................*................ - vqrdmulh.s32 q4, q1, q4 // ...................*.............. - vsub.u32 q7, q5, q7 // .........*........................ - vmla.s32 q0, q4, r12 // ....................*............. - vldrw.u32 q4, [r11] , #96 // ....e............................. - vadd.u32 q5, q6, q0 // ......................*........... - vldrw.u32 q1, [r11, #-128] // .......................*.......... - vsub.u32 q6, q6, q0 // .....................*............ - vstrw.u32 q5, [r0] , #64 // ..............................*... - vsub.u32 q5, q2, q3 // ..............*................... - vmul.s32 q0, q5, q1 // .........................*........ - vldrw.u32 q2, [r11, #-112] // ........................*......... - vqrdmulh.s32 q1, q5, q2 // ..........................*....... - vstrw.u32 q6, [r0, #-48] // ...............................*.. - vmla.s32 q0, q1, r12 // ...........................*...... - vldrw.u32 q5, [r0, #48] // ...e.............................. - vadd.u32 q3, q7, q0 // .............................*.... - vstrw.u32 q3, [r0, #-32] // ................................*. - vsub.u32 q3, q7, q0 // ............................*..... - vstrw.u32 q3, [r0, #-16] // .................................* - - // original source code - // vldrw.u32 q0, [r0] // ................|......*.......................... - // vldrw.u32 q1, [r0, #16] // ................|....*............................ - // vldrw.u32 q2, [r0, #32] // ................|*................................ - // vldrw.u32 q3, [r0, #48] // ...........e....|............................e.... - // vldrw.u32 q5, [r11], #+96 // e...............|.................e............... - // vldrw.u32 q6, [r11, #(+16-96)] // ................|..*.............................. - // vmul.s32 q4, q2, q5 // ................|.*............................... - // vqrdmulh.s32 q2, q2, q6 // ................|.......*......................... - // vmla.s32 q4, q2, r12 // ................|.........*....................... - // vsub.u32 q2, q0, q4 // ................|...............*................. - // vadd.u32 q0, q0, q4 // ................|...........*..................... - // vmul.s32 q4, q3, q5 // ................*................................. - // vqrdmulh.s32 q3, q3, q6 // ................|...*............................. - // vmla.s32 q4, q3, r12 // ................|.....*........................... - // vsub.u32 q3, q1, q4 // .....*..........|......................*.......... - // vadd.u32 q1, q1, q4 // ................|........*........................ - // vldrw.u32 q5, [r11, #(32 - 96)] // ................|..........*...................... - // vldrw.u32 q6, [r11, #(48 - 96)] // ................|.............*................... - // vmul.s32 q4, q1, q5 // ................|............*.................... - // vqrdmulh.s32 q1, q1, q6 // ................|..............*.................. - // vmla.s32 q4, q1, r12 // ................|................*................ - // vsub.u32 q1, q0, q4 // ...*............|....................*............ - // vadd.u32 q0, q0, q4 // .*..............|..................*.............. - // vldrw.u32 q5, [r11, #(64-96)] // ..*.............|...................*............. - // vldrw.u32 q6, [r11, #(80-96)] // .......*........|........................*........ - // vmul.s32 q4, q3, q5 // ......*.........|.......................*......... - // vqrdmulh.s32 q3, q3, q6 // ........*.......|.........................*....... - // vmla.s32 q4, q3, r12 // ..........*.....|...........................*..... - // vsub.u32 q3, q2, q4 // ..............*.|...............................*. - // vadd.u32 q2, q2, q4 // ............*...|.............................*... - // vstrw.u32 q0, [r0], #64 // ....*...........|.....................*........... - // vstrw.u32 q1, [r0, #-48] // .........*......|..........................*...... - // vstrw.u32 q2, [r0, #-32] // .............*..|..............................*.. - // vstrw.u32 q3, [r0, #-16] // ...............*|................................* - - le lr, layer78_loop - vmul.s32 q7, q5, q4 // *............................... - vldrw.u32 q2, [r11, #-80] // ...*............................ - vqrdmulh.s32 q0, q5, q2 // ....*........................... - vldrw.u32 q1, [r0, #32] // .*.............................. - vmla.s32 q7, q0, r12 // ......*......................... - vldrw.u32 q5, [r0, #16] // .....*.......................... - vmul.s32 q4, q1, q4 // ..*............................. - vldrw.u32 q3, [r11, #-16] // ........................*....... - vqrdmulh.s32 q2, q1, q2 // ........*....................... - vsub.u32 q0, q5, q7 // ......................*......... - vqrdmulh.s32 q3, q0, q3 // .........................*...... - vldrw.u32 q1, [r11, #-32] // ...................*............ - vmul.s32 q0, q0, q1 // .......................*........ - vldrw.u32 q6, [r11, #-48] // ..............*................. - vmla.s32 q4, q2, r12 // ..........*..................... - vldrw.u32 q1, [r0] // .......*........................ - vsub.u32 q2, q1, q4 // ................*............... - vmla.s32 q0, q3, r12 // ...........................*.... - vadd.u32 q5, q5, q7 // .........*...................... - vqrdmulh.s32 q3, q5, q6 // ...............*................ - vldrw.u32 q6, [r11, #-64] // ...........*.................... - vmul.s32 q5, q5, q6 // .............*.................. - vadd.u32 q1, q1, q4 // ............*................... - vmla.s32 q5, q3, r12 // .................*.............. - vadd.u32 q3, q2, q0 // ............................*... - vstrw.u32 q3, [r0, #32] // .............................*.. - vsub.u32 q3, q1, q5 // ....................*........... - vstrw.u32 q3, [r0, #16] // ..........................*..... - vsub.u32 q3, q2, q0 // ..............................*. - vstrw.u32 q3, [r0, #48] // ...............................* - vadd.u32 q3, q1, q5 // ..................*............. - vstrw.u32 q3, [r0] , #64 // .....................*.......... - - // original source code - // vmul.s32 q3, q5, q4 // *............................... - // vldrw.u32 q6, [r0, #32] // ...*............................ - // vmul.s32 q7, q6, q4 // ......*......................... - // vldrw.u32 q4, [r11, #-80] // .*.............................. - // vqrdmulh.s32 q0, q5, q4 // ..*............................. - // vldrw.u32 q2, [r0, #16] // .....*.......................... - // vmla.s32 q3, q0, r12 // ....*........................... - // vldrw.u32 q5, [r0] // ...............*................ - // vqrdmulh.s32 q4, q6, q4 // ........*....................... - // vadd.u32 q1, q2, q3 // ..................*............. - // vmla.s32 q7, q4, r12 // ..............*................. - // vldrw.u32 q0, [r11, #-64] // ....................*........... - // vadd.u32 q6, q5, q7 // ......................*......... - // vmul.s32 q0, q1, q0 // .....................*.......... - // vldrw.u32 q4, [r11, #-48] // .............*.................. - // vqrdmulh.s32 q4, q1, q4 // ...................*............ - // vsub.u32 q7, q5, q7 // ................*............... - // vmla.s32 q0, q4, r12 // .......................*........ - // vadd.u32 q5, q6, q0 // ..............................*. - // vldrw.u32 q1, [r11, #-32] // ...........*.................... - // vsub.u32 q6, q6, q0 // ..........................*..... - // vstrw.u32 q5, [r0] , #64 // ...............................* - // vsub.u32 q5, q2, q3 // .........*...................... - // vmul.s32 q0, q5, q1 // ............*................... - // vldrw.u32 q2, [r11, #-16] // .......*........................ - // vqrdmulh.s32 q1, q5, q2 // ..........*..................... - // vstrw.u32 q6, [r0, #-48] // ...........................*.... - // vmla.s32 q0, q1, r12 // .................*.............. - // vadd.u32 q3, q7, q0 // ........................*....... - // vstrw.u32 q3, [r0, #-32] // .........................*...... - // vsub.u32 q3, q7, q0 // ............................*... - // vstrw.u32 q3, [r0, #-16] // .............................*.. - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m85.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m85.s deleted file mode 100644 index 28b488d..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_opt_m85.s +++ /dev/null @@ -1,668 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_dilithium_12_34_56_78_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s32 \dst, \src, \const - vqrdmulh.s32 \src, \src, \const_twisted - vmla.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_dilithium_12_34_56_78_opt_m85, %function -.global ntt_dilithium_12_34_56_78_opt_m85 -ntt_dilithium_12_34_56_78_opt_m85: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -8380417 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q3, [r1, #256] // *. - vmul.s32 q5, q3, r2 // .* - - // original source code - // vldrw.u32 q3, [r1, #256] // *. - // vmul.s32 q5, q3, r2 // .* - - sub lr, lr, #1 -.p2align 2 -layer12_loop: - vqrdmulh.s32 q2, q3, r3 // ..........*................. - vldrw.u32 q0, [r0] // *........................... - vmla.s32 q5, q2, r12 // ...........*................ - vldrw.u32 q6, [r1] // ..*......................... - vqrdmulh.s32 q2, q6, r3 // .....*...................... - vldrw.u32 q4, [r0, #256] // .*.......................... - vmul.s32 q7, q6, r2 // ....*....................... - vsub.u32 q1, q4, q5 // ............*............... - vmul.s32 q6, q1, r6 // ...................*........ - vldrw.u32 q3, [r1, #272] // ...e........................ - vmla.s32 q7, q2, r12 // ......*..................... - vadd.u32 q2, q4, q5 // .............*.............. - vqrdmulh.s32 q5, q1, r7 // ....................*....... - vsub.u32 q4, q0, q7 // .......*.................... - vmla.s32 q6, q5, r12 // .....................*...... - vadd.u32 q5, q0, q7 // ........*................... - vmul.s32 q1, q2, r4 // ..............*............. - vadd.u32 q0, q4, q6 // .......................*.... - vqrdmulh.s32 q2, q2, r5 // ...............*............ - vsub.u32 q6, q4, q6 // ......................*..... - vmla.s32 q1, q2, r12 // ................*........... - vstrw.u32 q6, [r1, #256] // ...........................* - vadd.u32 q6, q5, q1 // ..................*......... - vstrw.u32 q0, [r1] , #16 // ..........................*. - vsub.u32 q1, q5, q1 // .................*.......... - vstrw.u32 q6, [r0] , #16 // ........................*... - vmul.s32 q5, q3, r2 // .........e.................. - vstrw.u32 q1, [r0, #240] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ...................|*.......................... - // vldrw.u32 q1, [r0, #(4*64)] // ...................|....*...................... - // vldrw.u32 q2, [r1] // ...................|..*........................ - // vldrw.u32 q3, [r1, #(4*64)] // e..................|........e.................. - // vmul.s32 q4, q2, r2 // ...................|.....*..................... - // vqrdmulh.s32 q2, q2, r3 // ...................|...*....................... - // vmla.s32 q4, q2, r12 // .*.................|.........*................. - // vsub.u32 q2, q0, q4 // ....*..............|............*.............. - // vadd.u32 q0, q0, q4 // ......*............|..............*............ - // vmul.s32 q4, q3, r2 // .................e.|.........................e. - // vqrdmulh.s32 q3, q3, r3 // ...................*........................... - // vmla.s32 q4, q3, r12 // ...................|.*......................... - // vsub.u32 q3, q1, q4 // ...................|......*.................... - // vadd.u32 q1, q1, q4 // ..*................|..........*................ - // vmul.s32 q4, q1, r4 // .......*...........|...............*........... - // vqrdmulh.s32 q1, q1, r5 // .........*.........|.................*......... - // vmla.s32 q4, q1, r12 // ...........*.......|...................*....... - // vsub.u32 q1, q0, q4 // ...............*...|.......................*... - // vadd.u32 q0, q0, q4 // .............*.....|.....................*..... - // vmul.s32 q4, q3, r6 // ...................|.......*................... - // vqrdmulh.s32 q3, q3, r7 // ...*...............|...........*............... - // vmla.s32 q4, q3, r12 // .....*.............|.............*............. - // vsub.u32 q3, q2, q4 // ..........*........|..................*........ - // vadd.u32 q2, q2, q4 // ........*..........|................*.......... - // vstrw.u32 q0, [r0], #16 // ................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*64 - 16)] // ..................*|..........................* - // vstrw.u32 q2, [r1], #16 // ..............*....|......................*.... - // vstrw.u32 q3, [r1, #(4*64-16)] // ............*......|....................*...... - - le lr, layer12_loop - vqrdmulh.s32 q1, q3, r3 // *......................... - vldrw.u32 q7, [r0] // .*........................ - vmla.s32 q5, q1, r12 // ..*....................... - vldrw.u32 q1, [r1] // ...*...................... - vqrdmulh.s32 q6, q1, r3 // ....*..................... - vldrw.u32 q2, [r0, #256] // .....*.................... - vmul.s32 q1, q1, r2 // ......*................... - vsub.u32 q3, q2, q5 // .......*.................. - vmla.s32 q1, q6, r12 // .........*................ - vadd.u32 q5, q2, q5 // ..........*............... - vqrdmulh.s32 q6, q3, r7 // ...........*.............. - vsub.u32 q2, q7, q1 // ............*............. - vmul.s32 q3, q3, r6 // ........*................. - vadd.u32 q1, q7, q1 // ..............*........... - vmla.s32 q3, q6, r12 // .............*............ - // gap // .......................... - vmul.s32 q7, q5, r4 // ...............*.......... - vadd.u32 q6, q2, q3 // ................*......... - vqrdmulh.s32 q5, q5, r5 // .................*........ - vsub.u32 q2, q2, q3 // ..................*....... - vstrw.u32 q6, [r1] , #16 // ......................*... - vmla.s32 q7, q5, r12 // ...................*...... - vstrw.u32 q2, [r1, #240] // ....................*..... - vadd.u32 q5, q1, q7 // .....................*.... - vstrw.u32 q5, [r0] , #16 // ........................*. - vsub.u32 q5, q1, q7 // .......................*.. - vstrw.u32 q5, [r0, #240] // .........................* - - // original source code - // vqrdmulh.s32 q2, q3, r3 // *......................... - // vldrw.u32 q0, [r0] // .*........................ - // vmla.s32 q5, q2, r12 // ..*....................... - // vldrw.u32 q6, [r1] // ...*...................... - // vqrdmulh.s32 q2, q6, r3 // ....*..................... - // vldrw.u32 q4, [r0, #256] // .....*.................... - // vmul.s32 q7, q6, r2 // ......*................... - // vsub.u32 q1, q4, q5 // .......*.................. - // vmul.s32 q6, q1, r6 // ............*............. - // vmla.s32 q7, q2, r12 // ........*................. - // vadd.u32 q2, q4, q5 // .........*................ - // vqrdmulh.s32 q5, q1, r7 // ..........*............... - // vsub.u32 q4, q0, q7 // ...........*.............. - // vmla.s32 q6, q5, r12 // ..............*........... - // vadd.u32 q5, q0, q7 // .............*............ - // vmul.s32 q1, q2, r4 // ...............*.......... - // vadd.u32 q0, q4, q6 // ................*......... - // vqrdmulh.s32 q2, q2, r5 // .................*........ - // vsub.u32 q6, q4, q6 // ..................*....... - // vmla.s32 q1, q2, r12 // ....................*..... - // vstrw.u32 q6, [r1, #256] // .....................*.... - // vadd.u32 q6, q5, q1 // ......................*... - // vstrw.u32 q0, [r1] , #16 // ...................*...... - // vsub.u32 q1, q5, q1 // ........................*. - // vstrw.u32 q6, [r0] , #16 // .......................*.. - // vstrw.u32 q1, [r0, #240] // .........................* - - - .unreq in_high - .unreq in_low - in .req r0 - - /* Layers 3,4 */ - sub in, in, #(64*4) - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q6, [r0, #192] // *. - vmul.s32 q4, q6, r2 // .* - - // original source code - // vldrw.u32 q6, [r0, #192] // *. - // vmul.s32 q4, q6, r2 // .* - - sub lr, lr, #1 -.p2align 2 -layer34_loop: - vqrdmulh.s32 q3, q6, r3 // ..........*................. - vldrw.u32 q6, [r0, #64] // .*.......................... - vmla.s32 q4, q3, r12 // ...........*................ - vldrw.u32 q0, [r0, #128] // ..*......................... - vqrdmulh.s32 q7, q0, r3 // .....*...................... - vadd.u32 q3, q6, q4 // .............*.............. - vmul.s32 q5, q0, r2 // ....*....................... - vldrw.u32 q0, [r0] // *........................... - vmla.s32 q5, q7, r12 // ......*..................... - vsub.u32 q4, q6, q4 // ............*............... - vqrdmulh.s32 q1, q4, r7 // ....................*....... - vadd.u32 q7, q0, q5 // ........*................... - vmul.s32 q2, q4, r6 // ...................*........ - vsub.u32 q0, q0, q5 // .......*.................... - vmla.s32 q2, q1, r12 // .....................*...... - vldrw.u32 q6, [r0, #208] // ...e........................ - vmul.s32 q1, q3, r4 // ..............*............. - vsub.u32 q4, q0, q2 // ......................*..... - vqrdmulh.s32 q5, q3, r5 // ...............*............ - vstrw.u32 q4, [r0, #192] // ...........................* - vmla.s32 q1, q5, r12 // ................*........... - vadd.u32 q2, q0, q2 // .......................*.... - vstrw.u32 q2, [r0, #128] // ..........................*. - vadd.u32 q3, q7, q1 // ..................*......... - vstrw.u32 q3, [r0] , #16 // ........................*... - vsub.u32 q5, q7, q1 // .................*.......... - vmul.s32 q4, q6, r2 // .........e.................. - vstrw.u32 q5, [r0, #48] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // .............|......*.................... - // vldrw.u32 q1, [r0, #(4*1*16)] // .............|*.......................... - // vldrw.u32 q2, [r0, #(4*2*16)] // .............|..*........................ - // vldrw.u32 q3, [r0, #(4*3*16)] // e............|..............e............ - // vmul.s32 q4, q2, r2 // .............|.....*..................... - // vqrdmulh.s32 q2, q2, r3 // .............|...*....................... - // vmla.s32 q4, q2, r12 // .............|.......*................... - // vsub.u32 q2, q0, q4 // .............|............*.............. - // vadd.u32 q0, q0, q4 // .............|..........*................ - // vmul.s32 q4, q3, r2 // ...........e.|.........................e. - // vqrdmulh.s32 q3, q3, r3 // .............*........................... - // vmla.s32 q4, q3, r12 // .............|.*......................... - // vsub.u32 q3, q1, q4 // .............|........*.................. - // vadd.u32 q1, q1, q4 // .............|....*...................... - // vmul.s32 q4, q1, r4 // .*...........|...............*........... - // vqrdmulh.s32 q1, q1, r5 // ...*.........|.................*......... - // vmla.s32 q4, q1, r12 // .....*.......|...................*....... - // vsub.u32 q1, q0, q4 // ..........*..|........................*.. - // vadd.u32 q0, q0, q4 // ........*....|......................*.... - // vmul.s32 q4, q3, r6 // .............|...........*............... - // vqrdmulh.s32 q3, q3, r7 // .............|.........*................. - // vmla.s32 q4, q3, r12 // .............|.............*............. - // vsub.u32 q3, q2, q4 // ..*..........|................*.......... - // vadd.u32 q2, q2, q4 // ......*......|....................*...... - // vstrw.u32 q0, [r0], #16 // .........*...|.......................*... - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ............*|..........................* - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // .......*.....|.....................*..... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // ....*........|..................*........ - - le lr, layer34_loop - vqrdmulh.s32 q1, q6, r3 // *......................... - vldrw.u32 q7, [r0, #128] // ...*...................... - vmla.s32 q4, q1, r12 // ..*....................... - vldrw.u32 q6, [r0, #64] // .*........................ - vmul.s32 q5, q7, r2 // ......*................... - vsub.u32 q1, q6, q4 // .........*................ - vqrdmulh.s32 q3, q7, r3 // ....*..................... - vldrw.u32 q2, [r0] // .......*.................. - vmla.s32 q5, q3, r12 // ........*................. - vadd.u32 q7, q6, q4 // .....*.................... - vqrdmulh.s32 q6, q1, r7 // ..........*............... - vadd.u32 q0, q2, q5 // ...........*.............. - vmul.s32 q1, q1, r6 // ............*............. - vsub.u32 q5, q2, q5 // .............*............ - vmla.s32 q1, q6, r12 // ..............*........... - // gap // .......................... - vadd.u32 q4, q5, q1 // ....................*..... - vmul.s32 q6, q7, r4 // ...............*.......... - vsub.u32 q5, q5, q1 // ................*......... - vqrdmulh.s32 q1, q7, r5 // .................*........ - vstrw.u32 q4, [r0, #128] // .....................*.... - vmla.s32 q6, q1, r12 // ...................*...... - vstrw.u32 q5, [r0, #192] // ..................*....... - vsub.u32 q5, q0, q6 // ........................*. - vstrw.u32 q5, [r0, #64] // .........................* - vadd.u32 q0, q0, q6 // ......................*... - vstrw.u32 q0, [r0] , #16 // .......................*.. - - // original source code - // vqrdmulh.s32 q3, q6, r3 // *......................... - // vldrw.u32 q6, [r0, #64] // ...*...................... - // vmla.s32 q4, q3, r12 // ..*....................... - // vldrw.u32 q0, [r0, #128] // .*........................ - // vqrdmulh.s32 q7, q0, r3 // ......*................... - // vadd.u32 q3, q6, q4 // .........*................ - // vmul.s32 q5, q0, r2 // ....*..................... - // vldrw.u32 q0, [r0] // .......*.................. - // vmla.s32 q5, q7, r12 // ........*................. - // vsub.u32 q4, q6, q4 // .....*.................... - // vqrdmulh.s32 q1, q4, r7 // ..........*............... - // vadd.u32 q7, q0, q5 // ...........*.............. - // vmul.s32 q2, q4, r6 // ............*............. - // vsub.u32 q0, q0, q5 // .............*............ - // vmla.s32 q2, q1, r12 // ..............*........... - // vmul.s32 q1, q3, r4 // ................*......... - // vsub.u32 q4, q0, q2 // .................*........ - // vqrdmulh.s32 q5, q3, r5 // ..................*....... - // vstrw.u32 q4, [r0, #192] // .....................*.... - // vmla.s32 q1, q5, r12 // ....................*..... - // vadd.u32 q2, q0, q2 // ...............*.......... - // vstrw.u32 q2, [r0, #128] // ...................*...... - // vadd.u32 q3, q7, q1 // ........................*. - // vstrw.u32 q3, [r0] , #16 // .........................* - // vsub.u32 q5, q7, q1 // ......................*... - // vstrw.u32 q5, [r0, #48] // .......................*.. - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - /* Layers 5,6 */ - sub in, in, #(4*256) - - mov lr, #16 - ldrd r6, r4, [r11] , #24 // *...... - vldrw.u32 q5, [r0, #48] // .*..... - vqrdmulh.s32 q0, q5, r4 // ....*.. - vldrw.u32 q1, [r0, #32] // ..*.... - vqrdmulh.s32 q3, q1, r4 // ...*... - ldrd r10, r3, [r11, #-8] // ......* - vmul.s32 q2, q5, r6 // .....*. - - // original source code - // ldrd r6, r2, [r11] , #24 // *...... - // vldrw.u32 q2, [r0, #48] // .*..... - // vldrw.u32 q1, [r0, #32] // ...*... - // vqrdmulh.s32 q3, q1, r2 // ....*.. - // vqrdmulh.s32 q0, q2, r2 // ..*.... - // vmul.s32 q2, q2, r6 // ......* - // ldrd r10, r3, [r11, #-8] // .....*. - - sub lr, lr, #1 -.p2align 2 -layer56_loop: - vmla.s32 q2, q0, r12 // ..............*................ - vldrw.u32 q5, [r0, #16] // ....*.......................... - vmul.s32 q1, q1, r6 // .......*....................... - vsub.u32 q7, q5, q2 // ...............*............... - vmla.s32 q1, q3, r12 // .........*..................... - ldrd r2, r4, [r11, #-16] // .*............................. - vmul.s32 q6, q7, r10 // ......................*........ - vadd.u32 q5, q5, q2 // ................*.............. - vqrdmulh.s32 q2, q7, r3 // .......................*....... - vldrw.u32 q0, [r0] // ...*........................... - vmul.s32 q3, q5, r2 // .................*............. - vsub.u32 q4, q0, q1 // ..........*.................... - vmla.s32 q6, q2, r12 // ........................*...... - vadd.u32 q1, q0, q1 // ...........*................... - vqrdmulh.s32 q2, q5, r4 // ..................*............ - vsub.u32 q7, q4, q6 // .........................*..... - vmla.s32 q3, q2, r12 // ...................*........... - vadd.u32 q6, q4, q6 // ..........................*.... - ldrd r6, r2, [r11] , #24 // e.............................. - vsub.u32 q5, q1, q3 // ....................*.......... - vldrw.u32 q2, [r0, #112] // ......e........................ - vadd.u32 q4, q1, q3 // .....................*......... - vldrw.u32 q1, [r0, #96] // .....e......................... - vqrdmulh.s32 q3, q1, r2 // ........e...................... - vst40.u32 {q4,q5,q6,q7}, [r0] // ...........................*... - vqrdmulh.s32 q0, q2, r2 // .............e................. - vst41.u32 {q4,q5,q6,q7}, [r0] // ............................*.. - vmul.s32 q2, q2, r6 // ............e.................. - vst42.u32 {q4,q5,q6,q7}, [r0] // .............................*. - ldrd r10, r3, [r11, #-8] // ..e............................ - vst43.u32 {q4,q5,q6,q7}, [r0]! // ..............................* - - // original source code - // ldrd r2, r3, [r11], #+24 // e............|.................e............ - // ldrd r4, r5, [r11, #(-16)] // .............|....*......................... - // ldrd r6, r7, [r11, #(-8)] // ...........e.|............................e. - // vldrw.u32 q0, [r0] // .............|........*..................... - // vldrw.u32 q1, [r0, #(4*1*4)] // .............|*............................. - // vldrw.u32 q2, [r0, #(4*2*4)] // ....e........|.....................e........ - // vldrw.u32 q3, [r0, #(4*3*4)] // ..e..........|...................e.......... - // vmul.s32 q4, q2, r2 // .............|.*............................ - // vqrdmulh.s32 q2, q2, r3 // .....e.......|......................e....... - // vmla.s32 q4, q2, r12 // .............|...*.......................... - // vsub.u32 q2, q0, q4 // .............|..........*................... - // vadd.u32 q0, q0, q4 // .............|............*................. - // vmul.s32 q4, q3, r2 // .........e...|..........................e... - // vqrdmulh.s32 q3, q3, r3 // .......e.....|........................e..... - // vmla.s32 q4, q3, r12 // .............*.............................. - // vsub.u32 q3, q1, q4 // .............|..*........................... - // vadd.u32 q1, q1, q4 // .............|......*....................... - // vmul.s32 q4, q1, r4 // .............|.........*.................... - // vqrdmulh.s32 q1, q1, r5 // .............|.............*................ - // vmla.s32 q4, q1, r12 // .............|...............*.............. - // vsub.u32 q1, q0, q4 // .*...........|..................*........... - // vadd.u32 q0, q0, q4 // ...*.........|....................*......... - // vmul.s32 q4, q3, r6 // .............|.....*........................ - // vqrdmulh.s32 q3, q3, r7 // .............|.......*...................... - // vmla.s32 q4, q3, r12 // .............|...........*.................. - // vsub.u32 q3, q2, q4 // .............|..............*............... - // vadd.u32 q2, q2, q4 // .............|................*............. - // vst40.u32 {q0, q1, q2, q3}, [r0] // ......*......|.......................*...... - // vst41.u32 {q0, q1, q2, q3}, [r0] // ........*....|.........................*.... - // vst42.u32 {q0, q1, q2, q3}, [r0] // ..........*..|...........................*.. - // vst43.u32 {q0, q1, q2, q3}, [r0]! // ............*|.............................* - - le lr, layer56_loop - layer56_loop_end: - vmla.s32 q2, q0, r12 // *........................... - vldrw.u32 q4, [r0, #16] // .*.......................... - vsub.u32 q7, q4, q2 // ...*........................ - vmul.s32 q6, q7, r10 // ......*..................... - vadd.u32 q2, q4, q2 // .......*.................... - vmul.s32 q4, q1, r6 // ..*......................... - ldrd r2, r1, [r11, #-16] // .....*...................... - vmla.s32 q4, q3, r12 // ....*....................... - vldrw.u32 q0, [r0] // .........*.................. - vqrdmulh.s32 q1, q7, r3 // ........*................... - vsub.u32 q5, q0, q4 // ...........*................ - vmla.s32 q6, q1, r12 // ............*............... - vadd.u32 q1, q0, q4 // .............*.............. - vmul.s32 q3, q2, r2 // ..........*................. - vsub.u32 q7, q5, q6 // ...............*............ - vqrdmulh.s32 q2, q2, r1 // ..............*............. - vadd.u32 q6, q5, q6 // .................*.......... - vmla.s32 q3, q2, r12 // ................*........... - mov r14, #16 // .........................*.. - vsub.u32 q5, q1, q3 // ..................*......... - sub r14, r14, #1 // ...........................* - vadd.u32 q4, q1, q3 // ...................*........ - // gap // ............................ - // gap // ............................ - vst40.u32 {q4,q5,q6,q7}, [r0] // ....................*....... - // gap // ............................ - vst41.u32 {q4,q5,q6,q7}, [r0] // .....................*...... - // gap // ............................ - vst42.u32 {q4,q5,q6,q7}, [r0] // ......................*..... - // gap // ............................ - vst43.u32 {q4,q5,q6,q7}, [r0]! // .......................*.... - sub r0, r0, #(4*256) // ........................*... - vldrw.u32 q5, [r0, #32] // ..........................*. - - // original source code - // vmla.s32 q2, q0, r12 // *........................... - // vldrw.u32 q5, [r0, #16] // .*.......................... - // vmul.s32 q1, q1, r6 // .....*...................... - // vsub.u32 q7, q5, q2 // ..*......................... - // vmla.s32 q1, q3, r12 // .......*.................... - // ldrd r2, r4, [r11, #-16] // ......*..................... - // vmul.s32 q6, q7, r10 // ...*........................ - // vadd.u32 q5, q5, q2 // ....*....................... - // vqrdmulh.s32 q2, q7, r3 // .........*.................. - // vldrw.u32 q0, [r0] // ........*................... - // vmul.s32 q3, q5, r2 // .............*.............. - // vsub.u32 q4, q0, q1 // ..........*................. - // vmla.s32 q6, q2, r12 // ...........*................ - // vadd.u32 q1, q0, q1 // ............*............... - // vqrdmulh.s32 q2, q5, r4 // ...............*............ - // vsub.u32 q7, q4, q6 // ..............*............. - // vmla.s32 q3, q2, r12 // .................*.......... - // vadd.u32 q6, q4, q6 // ................*........... - // vsub.u32 q5, q1, q3 // ...................*........ - // vadd.u32 q4, q1, q3 // .....................*...... - // vst40.u32 {q4,q5,q6,q7}, [r0] // ......................*..... - // vst41.u32 {q4,q5,q6,q7}, [r0] // .......................*.... - // vst42.u32 {q4,q5,q6,q7}, [r0] // ........................*... - // vst43.u32 {q4,q5,q6,q7}, [r0]! // .........................*.. - // sub r0, r0, #(4*256) // ..........................*. - // mov r14, #16 // ..................*......... - // vldrw.u32 q5, [r0, #32] // ...........................* - // sub r14, r14, #1 // ....................*....... - - layer78_loop: - - vldrw.u32 q0, [r11, #16] // .....*............................ - vqrdmulh.s32 q6, q5, q0 // .......*.......................... - vldrw.u32 q1, [r11] , #96 // ....*............................. - vmul.s32 q3, q5, q1 // ......*........................... - vldrw.u32 q7, [r0] // *................................. - vmla.s32 q3, q6, r12 // ........*......................... - vldrw.u32 q4, [r0, #48] // ...*.............................. - vadd.u32 q5, q7, q3 // ..........*....................... - vqrdmulh.s32 q2, q4, q0 // ............*..................... - vsub.u32 q6, q7, q3 // .........*........................ - vmul.s32 q3, q4, q1 // ...........*...................... - vldrw.u32 q4, [r0, #16] // .*................................ - vmla.s32 q3, q2, r12 // .............*.................... - vldrw.u32 q7, [r11, #-48] // .................*................ - vadd.u32 q1, q4, q3 // ...............*.................. - vqrdmulh.s32 q0, q1, q7 // ...................*.............. - vldrw.u32 q2, [r11, #-64] // ................*................. - vmul.s32 q1, q1, q2 // ..................*............... - vsub.u32 q7, q4, q3 // ..............*................... - vmla.s32 q1, q0, r12 // ....................*............. - vldrw.u32 q2, [r11, #-16] // ........................*......... - vadd.u32 q3, q5, q1 // ......................*........... - vstrw.u32 q3, [r0] , #64 // ..............................*... - vsub.u32 q1, q5, q1 // .....................*............ - vqrdmulh.s32 q5, q7, q2 // ..........................*....... - vldrw.u32 q2, [r11, #-32] // .......................*.......... - vmul.s32 q2, q7, q2 // .........................*........ - vstrw.u32 q1, [r0, #-48] // ...............................*.. - vmla.s32 q2, q5, r12 // ...........................*...... - vldrw.u32 q5, [r0, #32] // ..e............................... - vsub.u32 q3, q6, q2 // ............................*..... - vstrw.u32 q3, [r0, #-16] // .................................* - vadd.u32 q2, q6, q2 // .............................*.... - vstrw.u32 q2, [r0, #-32] // ................................*. - - // original source code - // vldrw.u32 q0, [r0] // .....|...*............................. - // vldrw.u32 q1, [r0, #16] // .....|..........*...................... - // vldrw.u32 q2, [r0, #32] // e....|............................e.... - // vldrw.u32 q3, [r0, #48] // .....|.....*........................... - // vldrw.u32 q5, [r11], #+96 // .....|.*............................... - // vldrw.u32 q6, [r11, #(+16-96)] // .....*................................. - // vmul.s32 q4, q2, q5 // .....|..*.............................. - // vqrdmulh.s32 q2, q2, q6 // .....|*................................ - // vmla.s32 q4, q2, r12 // .....|....*............................ - // vsub.u32 q2, q0, q4 // .....|........*........................ - // vadd.u32 q0, q0, q4 // .....|......*.......................... - // vmul.s32 q4, q3, q5 // .....|.........*....................... - // vqrdmulh.s32 q3, q3, q6 // .....|.......*......................... - // vmla.s32 q4, q3, r12 // .....|...........*..................... - // vsub.u32 q3, q1, q4 // .....|.................*............... - // vadd.u32 q1, q1, q4 // .....|.............*................... - // vldrw.u32 q5, [r11, #(32 - 96)] // .....|...............*................. - // vldrw.u32 q6, [r11, #(48 - 96)] // .....|............*.................... - // vmul.s32 q4, q1, q5 // .....|................*................ - // vqrdmulh.s32 q1, q1, q6 // .....|..............*.................. - // vmla.s32 q4, q1, r12 // .....|..................*.............. - // vsub.u32 q1, q0, q4 // .....|......................*.......... - // vadd.u32 q0, q0, q4 // .....|....................*............ - // vldrw.u32 q5, [r11, #(64-96)] // .....|........................*........ - // vldrw.u32 q6, [r11, #(80-96)] // .....|...................*............. - // vmul.s32 q4, q3, q5 // .....|.........................*....... - // vqrdmulh.s32 q3, q3, q6 // .....|.......................*......... - // vmla.s32 q4, q3, r12 // .....|...........................*..... - // vsub.u32 q3, q2, q4 // .*...|.............................*... - // vadd.u32 q2, q2, q4 // ...*.|...............................*. - // vstrw.u32 q0, [r0], #64 // .....|.....................*........... - // vstrw.u32 q1, [r0, #-48] // .....|..........................*...... - // vstrw.u32 q2, [r0, #-32] // ....*|................................* - // vstrw.u32 q3, [r0, #-16] // ..*..|..............................*.. - - le lr, layer78_loop - vldrw.u32 q1, [r11, #16] // *................................ - vqrdmulh.s32 q7, q5, q1 // .*............................... - vldrw.u32 q6, [r11] , #96 // ..*.............................. - vmul.s32 q5, q5, q6 // ...*............................. - vldrw.u32 q2, [r0, #48] // ......*.......................... - vqrdmulh.s32 q1, q2, q1 // ........*........................ - vldrw.u32 q3, [r0] // ....*............................ - vmla.s32 q5, q7, r12 // .....*........................... - vldrw.u32 q7, [r0, #16] // ...........*..................... - vmul.s32 q6, q2, q6 // ..........*...................... - vadd.u32 q2, q3, q5 // .......*......................... - vmla.s32 q6, q1, r12 // ............*.................... - vsub.u32 q5, q3, q5 // .........*....................... - vldrw.u32 q1, [r11, #-48] // .............*................... - vadd.u32 q3, q7, q6 // ..............*.................. - vqrdmulh.s32 q1, q3, q1 // ...............*................. - vsub.u32 q7, q7, q6 // ..................*.............. - vldrw.u32 q6, [r11, #-64] // ................*................ - vmul.s32 q6, q3, q6 // .................*............... - vldrw.u32 q3, [r11, #-16] // ....................*............ - vmla.s32 q6, q1, r12 // ...................*............. - vldrw.u32 q1, [r11, #-32] // .........................*....... - vadd.u32 q0, q2, q6 // .....................*........... - vqrdmulh.s32 q3, q7, q3 // ........................*........ - vsub.u32 q6, q2, q6 // .......................*......... - vmul.s32 q1, q7, q1 // ..........................*...... - vstrw.u32 q0, [r0] , #64 // ......................*.......... - vmla.s32 q1, q3, r12 // ............................*.... - vstrw.u32 q6, [r0, #-48] // ...........................*..... - vsub.u32 q7, q5, q1 // .............................*... - vstrw.u32 q7, [r0, #-16] // ..............................*.. - vadd.u32 q5, q5, q1 // ...............................*. - vstrw.u32 q5, [r0, #-32] // ................................* - - // original source code - // vldrw.u32 q0, [r11, #16] // *................................ - // vqrdmulh.s32 q6, q5, q0 // .*............................... - // vldrw.u32 q1, [r11] , #96 // ..*.............................. - // vmul.s32 q3, q5, q1 // ...*............................. - // vldrw.u32 q7, [r0] // ......*.......................... - // vmla.s32 q3, q6, r12 // .......*......................... - // vldrw.u32 q4, [r0, #48] // ....*............................ - // vadd.u32 q5, q7, q3 // ..........*...................... - // vqrdmulh.s32 q2, q4, q0 // .....*........................... - // vsub.u32 q6, q7, q3 // ............*.................... - // vmul.s32 q3, q4, q1 // .........*....................... - // vldrw.u32 q4, [r0, #16] // ........*........................ - // vmla.s32 q3, q2, r12 // ...........*..................... - // vldrw.u32 q7, [r11, #-48] // .............*................... - // vadd.u32 q1, q4, q3 // ..............*.................. - // vqrdmulh.s32 q0, q1, q7 // ...............*................. - // vldrw.u32 q2, [r11, #-64] // .................*............... - // vmul.s32 q1, q1, q2 // ..................*.............. - // vsub.u32 q7, q4, q3 // ................*................ - // vmla.s32 q1, q0, r12 // ....................*............ - // vldrw.u32 q2, [r11, #-16] // ...................*............. - // vadd.u32 q3, q5, q1 // ......................*.......... - // vstrw.u32 q3, [r0] , #64 // ..........................*...... - // vsub.u32 q1, q5, q1 // ........................*........ - // vqrdmulh.s32 q5, q7, q2 // .......................*......... - // vldrw.u32 q2, [r11, #-32] // .....................*........... - // vmul.s32 q2, q7, q2 // .........................*....... - // vstrw.u32 q1, [r0, #-48] // ............................*.... - // vmla.s32 q2, q5, r12 // ...........................*..... - // vsub.u32 q3, q6, q2 // .............................*... - // vstrw.u32 q3, [r0, #-16] // ..............................*.. - // vadd.u32 q2, q6, q2 // ...............................*. - // vstrw.u32 q2, [r0, #-32] // ................................* - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_twiddles.s b/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_twiddles.s deleted file mode 100644 index 7624c53..0000000 --- a/tests/ntt_dilithium/manual/ntt_dilithium_12_34_56_78_twiddles.s +++ /dev/null @@ -1,538 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -3572223 -.word -915382907 -.word 3765607 -.word 964937599 -.word 3761513 -.word 963888510 -.word -3201494 -.word -820383522 -.word -601683 -.word -154181397 -.word 3542485 -.word 907762539 -.word -2883726 -.word -738955404 -.word 2682288 -.word 687336873 -.word 2129892 -.word 545785280 -.word -3145678 -.word -806080660 -.word 3764867 -.word 964747974 -.word -1005239 -.word -257592709 -.word -3201430 -.word -820367122 -.word 557458 -.word 142848732 -.word -1221177 -.word -312926867 -.word -3370349 -.word -863652652 -.word 3602218 -.word 923069133 -.word 3182878 -.word 815613168 -.word -4063053 -.word -1041158200 -.word 2740543 -.word 702264730 -.word -3586446 -.word -919027554 -.word 2663378 -.word 682491182 -.word -3110818 -.word -797147778 -.word 2101410 -.word 538486762 -.word -1674615 -.word -429120452 -.word 3704823 -.word 949361686 -.word 1159875 -.word 297218217 -.word -3524442 -.word -903139016 -.word 394148 -.word 101000509 -.word 928749 -.word 237992130 -.word -434125 -.word -111244624 -.word 1095468 -.word 280713909 -.word -3506380 -.word -898510625 -.word 676590 -.word 173376332 -.word 2071829 -.word 530906624 -.word -4018989 -.word -1029866791 -.word -1335936 -.word -342333886 -.word 3241972 -.word 830756018 -.word 2156050 -.word 552488273 -.word -3227876 -.word -827143915 -.word 3415069 -.word 875112161 -.word 1759347 -.word 450833045 -.word 1714295 -.word 439288460 -.word -817536 -.word -209493775 -.word -3574466 -.word -915957677 -.word 2453983 -.word 628833668 -.word 3756790 -.word 962678241 -.word -1935799 -.word -496048908 -.word 1460718 -.word 374309300 -.word -1716988 -.word -439978542 -.word -3950053 -.word -1012201926 -.word -642628 -.word -164673562 -.word -2897314 -.word -742437332 -.word 3192354 -.word 818041395 -.word -3585098 -.word -918682129 -.word 556856 -.word 142694469 -.word 3870317 -.word 991769559 -.word 2815639 -.word 721508096 -.word 2917338 -.word 747568486 -.word 1853806 -.word 475038184 -.word 2283733 -.word 585207070 -.word 3345963 -.word 857403734 -.word 1858416 -.word 476219497 -// Word count until here: 126 -// Blocked layers start -.word 3073009 -.word 1277625 -.word -2635473 -.word 3852015 -.word 787459213 -.word 327391679 -.word -675340520 -.word 987079667 -.word 1753 -.word -2659525 -.word 2660408 -.word -59148 -.word 449207 -.word -681503850 -.word 681730119 -.word -15156688 -.word -1935420 -.word -1455890 -.word -1780227 -.word 2772600 -.word -495951789 -.word -373072124 -.word -456183549 -.word 710479343 -.word 4183372 -.word -3222807 -.word -3121440 -.word -274060 -.word 1071989969 -.word -825844983 -.word -799869667 -.word -70227934 -.word 1182243 -.word 636927 -.word -3956745 -.word -3284915 -.word 302950022 -.word 163212680 -.word -1013916752 -.word -841760171 -.word 87208 -.word -3965306 -.word -2296397 -.word -3716946 -.word 22347069 -.word -1016110510 -.word -588452222 -.word -952468207 -.word 2508980 -.word 2028118 -.word 1937570 -.word -3815725 -.word 642926661 -.word 519705671 -.word 496502727 -.word -977780347 -.word -27812 -.word 1009365 -.word -1979497 -.word -3956944 -.word -7126831 -.word 258649997 -.word -507246529 -.word -1013967746 -.word 822541 -.word -2454145 -.word 1596822 -.word -3759465 -.word 210776307 -.word -628875181 -.word 409185979 -.word -963363710 -.word 2811291 -.word -2983781 -.word -1109516 -.word 4158088 -.word 720393920 -.word -764594519 -.word -284313712 -.word 1065510939 -.word -1685153 -.word 2678278 -.word -3551006 -.word -250446 -.word -431820817 -.word 686309310 -.word -909946047 -.word -64176841 -.word -3410568 -.word -3768948 -.word 635956 -.word -2455377 -.word -873958779 -.word -965793731 -.word 162963861 -.word -629190881 -.word 1528066 -.word 482649 -.word 1148858 -.word -2962264 -.word 391567239 -.word 123678909 -.word 294395108 -.word -759080783 -.word -4146264 -.word 2192938 -.word 2387513 -.word -268456 -.word -1062481036 -.word 561940831 -.word 611800717 -.word -68791907 -.word -1772588 -.word -1727088 -.word -3611750 -.word -3180456 -.word -454226054 -.word -442566669 -.word -925511710 -.word -814992530 -.word -565603 -.word 169688 -.word 2462444 -.word -3334383 -.word -144935890 -.word 43482586 -.word 631001801 -.word -854436357 -.word 3747250 -.word 1239911 -.word 3195676 -.word 1254190 -.word 960233614 -.word 317727459 -.word 818892658 -.word 321386456 -.word 2296099 -.word -3838479 -.word 2642980 -.word -12417 -.word 588375860 -.word -983611064 -.word 677264190 -.word -3181859 -.word -4166425 -.word -3488383 -.word 1987814 -.word -3197248 -.word -1067647297 -.word -893898890 -.word 509377762 -.word -819295484 -.word 2998219 -.word -89301 -.word -1354892 -.word -1310261 -.word 768294260 -.word -22883400 -.word -347191365 -.word -335754661 -.word 141835 -.word 2513018 -.word 613238 -.word -2218467 -.word 36345249 -.word 643961400 -.word 157142369 -.word -568482643 -.word 1736313 -.word 235407 -.word -3250154 -.word 3258457 -.word 444930577 -.word 60323094 -.word -832852657 -.word 834980303 -.word -458740 -.word 4040196 -.word 2039144 -.word -818761 -.word -117552223 -.word 1035301089 -.word 522531086 -.word -209807681 -.word -1921994 -.word -3472069 -.word -1879878 -.word -2178965 -.word -492511373 -.word -889718424 -.word -481719139 -.word -558360247 -.word -2579253 -.word 1787943 -.word -2391089 -.word -2254727 -.word -660934133 -.word 458160776 -.word -612717067 -.word -577774276 -.word -1623354 -.word -2374402 -.word 586241 -.word 527981 -.word -415984810 -.word -608441020 -.word 150224382 -.word 135295244 -.word 2105286 -.word -2033807 -.word -1179613 -.word -2743411 -.word 539479988 -.word -521163479 -.word -302276083 -.word -702999655 -.word 3482206 -.word -4182915 -.word -1300016 -.word -2362063 -.word 892316032 -.word -1071872863 -.word -333129378 -.word -605279149 -.word -1476985 -.word 2491325 -.word 507927 -.word -724804 -.word -378477722 -.word 638402564 -.word 130156402 -.word -185731180 -.word 1994046 -.word -1393159 -.word -1187885 -.word -1834526 -.word 510974714 -.word -356997292 -.word -304395785 -.word -470097680 -.word -1317678 -.word 2461387 -.word 3035980 -.word 621164 -.word -337655269 -.word 630730945 -.word 777970524 -.word 159173408 -.word -3033742 -.word 2647994 -.word -2612853 -.word 749577 -.word -777397036 -.word 678549029 -.word -669544140 -.word 192079267 -.word -338420 -.word 3009748 -.word 4148469 -.word -4022750 -.word -86720197 -.word 771248568 -.word 1063046068 -.word -1030830548 -.word 3901472 -.word -1226661 -.word 2925816 -.word 3374250 -.word 999753034 -.word -314332144 -.word 749740976 -.word 864652284 -.word 3980599 -.word -1615530 -.word 1665318 -.word 1163598 -.word 1020029345 -.word -413979908 -.word 426738094 -.word 298172236 -.word 2569011 -.word 1723229 -.word 2028038 -.word -3369273 -.word 658309618 -.word 441577800 -.word 519685171 -.word -863376927 -.word 1356448 -.word -2775755 -.word 2683270 -.word -2778788 -.word 347590090 -.word -711287812 -.word 687588511 -.word -712065019 -.word 3994671 -.word -1370517 -.word 3363542 -.word 545376 -.word 1023635298 -.word -351195274 -.word 861908357 -.word 139752717 -.word -11879 -.word 3020393 -.word 214880 -.word -770441 -.word -3043996 -.word 773976352 -.word 55063046 -.word -197425671 -.word -3467665 -.word 2312838 -.word -653275 -.word -459163 -.word -888589898 -.word 592665232 -.word -167401858 -.word -117660617 -.word 3105558 -.word 508145 -.word 860144 -.word 140244 -.word 795799901 -.word 130212265 -.word 220412084 -.word 35937555 -.word -1103344 -.word -553718 -.word 3430436 -.word -1514152 -.word -282732136 -.word -141890356 -.word 879049958 -.word -388001774 -.word 348812 -.word -327848 -.word 1011223 -.word -2354215 -.word 89383150 -.word -84011120 -.word 259126110 -.word -603268097 -.word -2185084 -.word 2358373 -.word -3014420 -.word 2926054 -.word -559928242 -.word 604333585 -.word -772445769 -.word 749801963 -.word 3123762 -.word -2193087 -.word -1716814 -.word -392707 -.word 800464680 -.word -561979013 -.word -439933955 -.word -100631253 -.word -3818627 -.word -1922253 -.word -2236726 -.word 1744507 -.word -978523985 -.word -492577742 -.word -573161516 -.word 447030292 -.word -303005 -.word -3974485 -.word 1900052 -.word 1054478 -.word -77645096 -.word -1018462631 -.word 486888731 -.word 270210213 -.word 3531229 -.word -3773731 -.word -781875 -.word -731434 -.word 904878186 -.word -967019376 -.word -200355636 -.word -187430119 \ No newline at end of file diff --git a/tests/ntt_dilithium/ntt_dilithium.mk b/tests/ntt_dilithium/ntt_dilithium.mk new file mode 100644 index 0000000..fd3261f --- /dev/null +++ b/tests/ntt_dilithium/ntt_dilithium.mk @@ -0,0 +1,17 @@ +TESTS += ntt_dilithium + +NTT_DILITHIUM_PLATFORMS += m55-an547 +NTT_DILITHIUM_PLATFORMS += m85-an555 + +NTT_DILITHIUM_SOURCES += main.c + +NTT_DILITHIUM_ASM_DIR = ../../asm/manual/ntt_dilithium +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_no_trans_vld4.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_opt_m55.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_opt_m85.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_opt_size_m55.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_opt_size_m85.s +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78.s \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_12_345_67.s b/tests/ntt_kyber/manual/ntt_kyber_12_345_67.s deleted file mode 100644 index fe999c2..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_12_345_67.s +++ /dev/null @@ -1,258 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -.p2align 4 -roots: -#include "ntt_kyber_12_345_67_twiddles.s" -.text - -#define QSTACK4 (0*16) -#define QSTACK5 (1*16) -#define QSTACK6 (2*16) -#define STACK0 (3*16) - -#define POS_ROOT_1 1 -#define POS_ROOT_2 2 -#define POS_ROOT_3 3 -#define POS_ROOT_4 4 -#define POS_ROOT_5 5 -#define POS_ROOT_6 6 - -#define STACK_SIZE (3*16 + 8) - -.macro qsave loc, a // slothy:no-unfold - vstrw.32 \a, [sp, #\loc\()] -.endm -.macro qrestore a, loc // slothy:no-unfold - vldrw.32 \a, [sp, #\loc\()] -.endm -.macro restored a, b, loc // slothy:no-unfold - ldrd \a, \b, [sp, #\loc\()] -.endm -.macro saved loc, a, b // slothy:no-unfold - strd \a, \b, [sp, #\loc\()] -.endm -.macro restore a, loc // slothy:no-unfold - ldr \a, [sp, #\loc\()] -.endm -.macro save loc, a // slothy:no-unfold - str \a, [sp, #\loc\()] -.endm - -// Barrett multiplication -.macro mulmod dst, src, const, const_tw - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_tw - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_tw - mulmod tmp, \b, \root, \root_tw - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -// Aligns stack =0 mod 16 -.macro align_stack_do // slothy:no-unfold - mov r11, sp - and r12, r11, #0xC - sub sp, sp, r12 // Align stack to 16 byte - sub sp, sp, #16 - str r12, [sp] -.endm - -// Reverts initial stack correction -.macro align_stack_undo // slothy:no-unfold - ldr r12, [sp] - add sp, sp, #16 - add sp, sp, r12 -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_12_345_67, %function -.global ntt_kyber_12_345_67 - - modulus .req r12 - r_ptr .req r11 - .equ modulus_const, -3329 - - in .req r0 - inp .req r1 - in_low .req r0 - in_high .req r1 - - root0 .req r2 - root0_tw .req r3 - root1 .req r4 - root1_tw .req r5 - root2 .req r6 - root2_tw .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - data4 .req q1 - data5 .req q2 - data6 .req q3 - data7 .req q4 - - tmp .req q7 - - rtmp .req r3 - rtmp_tw .req r4 - - qtmp .req q5 - qtmp_tw .req q6 - -ntt_kyber_12_345_67: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - align_stack_do - - sub sp, sp, #STACK_SIZE - movw modulus, #:lower16:modulus_const - ldr r_ptr, roots_addr - - /* Layers 1,2 */ - - save STACK0, in - add in_high, in_low, #(2*128) - ldrd root0, root0_tw, [r_ptr], #+24 - ldrd root1, root1_tw, [r_ptr, #-16] - ldrd root2, root2_tw, [r_ptr, #-8] - - mov lr, #8 - .p2align 2 -layer12_loop: - vldrw.32 data0, [in_low] - vldrw.32 data1, [in_low, #(2*64)] - vldrw.32 data2, [in_high] - vldrw.32 data3, [in_high, #(2*64)] - ct_butterfly data0, data2, root0, root0_tw - ct_butterfly data1, data3, root0, root0_tw - ct_butterfly data0, data1, root1, root1_tw - ct_butterfly data2, data3, root2, root2_tw - vstrw.u32 data0, [in_low], #16 - vstrw.u32 data1, [in_low, #(2*64 - 16)] - vstrw.u32 data2, [in_high], #16 - vstrw.u32 data3, [in_high, #(2*64 - 16)] - le lr, layer12_loop - - /* Layers 3,4,5 */ - - restore in, STACK0 - mov lr, #4 - .p2align 2 -layer345_loop: - ldrd rtmp, rtmp_tw, [r_ptr], #(7*8) - vldrw.32 data0, [in] - vldrw.32 data4, [in, #64] - ct_butterfly data0, data4, rtmp, rtmp_tw - qsave QSTACK4, data4 - vldrw.32 data1, [in, #16] - vldrw.32 data5, [in, #80] - ct_butterfly data1, data5, rtmp, rtmp_tw - qsave QSTACK5, data5 - vldrw.32 data2, [in, #32] - vldrw.32 data6, [in, #96] - ct_butterfly data2, data6, rtmp, rtmp_tw - qsave QSTACK6, data6 - vldrw.32 data3, [in, #48] - vldrw.32 data7, [in, #112] - ct_butterfly data3, data7, rtmp, rtmp_tw - - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + POS_ROOT_1)*8)] - ct_butterfly data0, data2, rtmp, rtmp_tw - ct_butterfly data1, data3, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + POS_ROOT_2)*8)] - ct_butterfly data0, data1, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + POS_ROOT_3)*8)] - ct_butterfly data2, data3, rtmp, rtmp_tw - vstrw.u32 data0, [in], #128 - vstrw.u32 data1, [in, #(-128+16)] - vstrw.u32 data2, [in, #(-128+32)] - vstrw.u32 data3, [in, #(-128+48)] - - qrestore data4, QSTACK4 - qrestore data5, QSTACK5 - qrestore data6, QSTACK6 - - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + POS_ROOT_4)*8)] - ct_butterfly data4, data6, rtmp, rtmp_tw - ct_butterfly data5, data7, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + POS_ROOT_5)*8)] - ct_butterfly data4, data5, rtmp, rtmp_tw - ldrd rtmp, rtmp_tw, [r_ptr, #((-7 + POS_ROOT_6)*8)] - ct_butterfly data6, data7, rtmp, rtmp_tw - - vstrw.u32 data4, [in, #(-128+64)] - vstrw.u32 data5, [in, #(-128+80)] - vstrw.u32 data6, [in, #(-128+96)] - vstrw.u32 data7, [in, #(-128+112)] - le lr, layer345_loop - - // Layer 67 - - // Use a different base register to facilitate Helight being able to - // overlap the first iteration of L67 with the last iteration of L345. - restore inp, STACK0 - mov lr, #8 - .p2align 2 -layer67_loop: - vld40.32 {data0, data1, data2, data3}, [inp] - vld41.32 {data0, data1, data2, data3}, [inp] - vld42.32 {data0, data1, data2, data3}, [inp] - vld43.32 {data0, data1, data2, data3}, [inp]! - vldrh.16 qtmp, [r_ptr], #+96 - vldrh.16 qtmp_tw, [r_ptr, #(+16-96)] - ct_butterfly data0, data2, qtmp, qtmp_tw - ct_butterfly data1, data3, qtmp, qtmp_tw - vldrh.16 qtmp, [r_ptr, #(32 - 96)] - vldrh.16 qtmp_tw, [r_ptr, #(48 - 96)] - ct_butterfly data0, data1, qtmp, qtmp_tw - vldrh.16 qtmp, [r_ptr, #(64-96)] - vldrh.16 qtmp_tw, [r_ptr, #(80-96)] - ct_butterfly data2, data3, qtmp, qtmp_tw - vstrw.u32 data0, [inp, #-64] - vstrw.u32 data1, [inp, #-48] - vstrw.u32 data2, [inp, #-32] - vstrw.u32 data3, [inp, #-16] - le lr, layer67_loop - - add sp, sp, #STACK_SIZE - - align_stack_undo - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr diff --git a/tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m55.s b/tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m55.s deleted file mode 100644 index fa5dad8..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m55.s +++ /dev/null @@ -1,642 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -.p2align 4 -roots: -#include "ntt_kyber_12_345_67_twiddles.s" -.text - -#define QSTACK4 (0*16) -#define QSTACK5 (1*16) -#define QSTACK6 (2*16) -#define STACK0 (3*16) - -#define POS_ROOT_1 1 -#define POS_ROOT_2 2 -#define POS_ROOT_3 3 -#define POS_ROOT_4 4 -#define POS_ROOT_5 5 -#define POS_ROOT_6 6 - -#define STACK_SIZE (3*16 + 8) - -.macro qsave loc, a // slothy:no-unfold - vstrw.32 \a, [sp, #\loc\()] -.endm -.macro qrestore a, loc // slothy:no-unfold - vldrw.32 \a, [sp, #\loc\()] -.endm -.macro restored a, b, loc // slothy:no-unfold - ldrd \a, \b, [sp, #\loc\()] -.endm -.macro saved loc, a, b // slothy:no-unfold - strd \a, \b, [sp, #\loc\()] -.endm -.macro restore a, loc // slothy:no-unfold - ldr \a, [sp, #\loc\()] -.endm -.macro save loc, a // slothy:no-unfold - str \a, [sp, #\loc\()] -.endm - -// Barrett multiplication -.macro mulmod dst, src, const, const_tw - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_tw - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_tw - mulmod tmp, \b, \root, \root_tw - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -// Aligns stack =0 mod 16 -.macro align_stack_do // slothy:no-unfold - mov r11, sp - and r12, r11, #0xC - sub sp, sp, r12 // Align stack to 16 byte - sub sp, sp, #16 - str r12, [sp] -.endm - -// Reverts initial stack correction -.macro align_stack_undo // slothy:no-unfold - ldr r12, [sp] - add sp, sp, #16 - add sp, sp, r12 -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_12_345_67_opt_size_m55, %function -.global ntt_kyber_12_345_67_opt_size_m55 - - modulus .req r12 - r_ptr .req r11 - .equ modulus_const, -3329 - - in .req r0 - inp .req r1 - in_low .req r0 - in_high .req r1 - - root0 .req r2 - root0_tw .req r3 - root1 .req r4 - root1_tw .req r5 - root2 .req r6 - root2_tw .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - data4 .req q1 - data5 .req q2 - data6 .req q3 - data7 .req q4 - - tmp .req q7 - - rtmp .req r3 - rtmp_tw .req r4 - - qtmp .req q5 - qtmp_tw .req q6 - -ntt_kyber_12_345_67_opt_size_m55: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - align_stack_do - - sub sp, sp, #STACK_SIZE - movw modulus, #:lower16:modulus_const - ldr r_ptr, roots_addr - - /* Layers 1,2 */ - - save STACK0, in - add in_high, in_low, #(2*128) - ldrd root0, root0_tw, [r_ptr], #+24 - ldrd root1, root1_tw, [r_ptr, #-16] - ldrd root2, root2_tw, [r_ptr, #-8] - - mov lr, #8 - .p2align 2 - vldrw.32 q4, [r1, #128] // *.... - vqrdmulh.s16 q2, q4, r3 // ..*.. - // gap // ..... - vmul.s16 q0, q4, r2 // ...*. - vldrw.32 q4, [r0, #128] // .*... - vmla.s16 q0, q2, r12 // ....* - - // original source code - // vldrw.32 q5, [r1, #128] // *.... - // vldrw.32 q4, [r0, #128] // ...*. - // vqrdmulh.s16 q1, q5, r3 // .*... - // vmul.s16 q0, q5, r2 // ..*.. - // vmla.s16 q0, q1, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer12_loop: - vadd.u16 q7, q4, q0 // .............*.............. - vmul.s16 q2, q7, r4 // ..............*............. - vldrw.32 q5, [r1] // ..*......................... - vqrdmulh.s16 q1, q5, r3 // .....*...................... - vldrw.32 q6, [r0] // *........................... - vmul.s16 q3, q5, r2 // ....*....................... - vsub.u16 q0, q4, q0 // ............*............... - vmla.s16 q3, q1, r12 // ......*..................... - vldrw.32 q5, [r1, #144] // ...e........................ - vqrdmulh.s16 q1, q7, r5 // ...............*............ - vadd.u16 q7, q6, q3 // ........*................... - vmla.s16 q2, q1, r12 // ................*........... - vldrw.32 q4, [r0, #144] // .e.......................... - vsub.u16 q1, q7, q2 // .................*.......... - vstrw.u32 q1, [r0, #128] // .........................*.. - vqrdmulh.s16 q1, q0, r7 // ....................*....... - vadd.u16 q2, q7, q2 // ..................*......... - vmul.s16 q7, q0, r6 // ...................*........ - vstrw.u32 q2, [r0] , #16 // ........................*... - vmla.s16 q7, q1, r12 // .....................*...... - vsub.u16 q3, q6, q3 // .......*.................... - vqrdmulh.s16 q1, q5, r3 // ..........e................. - vadd.u16 q6, q3, q7 // .......................*.... - vmul.s16 q0, q5, r2 // .........e.................. - vsub.u16 q5, q3, q7 // ......................*..... - vstrw.u32 q6, [r1] , #16 // ..........................*. - vmla.s16 q0, q1, r12 // ...........e................ - vstrw.u32 q5, [r1, #112] // ...........................* - - // original source code - // vldrw.32 q0, [r0] // ....................|...*....................... - // vldrw.32 q1, [r0, #(2*64)] // ....e...............|...........e............... - // vldrw.32 q2, [r1] // ....................|.*......................... - // vldrw.32 q3, [r1, #(2*64)] // e...................|.......e................... - // vmul.s16 q7, q2, r2 // ....................|....*...................... - // vqrdmulh.s16 q2, q2, r3 // ....................|..*........................ - // vmla.s16 q7, q2, r12 // ....................|......*.................... - // vsub.u16 q2, q0, q7 // ............*.......|...................*....... - // vadd.u16 q0, q0, q7 // ..*.................|.........*................. - // vmul.s16 q7, q3, r2 // ...............e....|......................e.... - // vqrdmulh.s16 q3, q3, r3 // .............e......|....................e...... - // vmla.s16 q7, q3, r12 // ..................e.|.........................e. - // vsub.u16 q3, q1, q7 // ....................|.....*..................... - // vadd.u16 q1, q1, q7 // ....................*........................... - // vmul.s16 q7, q1, r4 // ....................|*.......................... - // vqrdmulh.s16 q1, q1, r5 // .*..................|........*.................. - // vmla.s16 q7, q1, r12 // ...*................|..........*................ - // vsub.u16 q1, q0, q7 // .....*..............|............*.............. - // vadd.u16 q0, q0, q7 // ........*...........|...............*........... - // vmul.s16 q7, q3, r6 // .........*..........|................*.......... - // vqrdmulh.s16 q3, q3, r7 // .......*............|..............*............ - // vmla.s16 q7, q3, r12 // ...........*........|..................*........ - // vsub.u16 q3, q2, q7 // ................*...|.......................*... - // vadd.u16 q2, q2, q7 // ..............*.....|.....................*..... - // vstrw.u32 q0, [r0], #16 // ..........*.........|.................*......... - // vstrw.u32 q1, [r0, #(2*64 - 16)] // ......*.............|.............*............. - // vstrw.u32 q2, [r1], #16 // .................*..|........................*.. - // vstrw.u32 q3, [r1, #(2*64 - 16)] // ...................*|..........................* - - le lr, layer12_loop -layer12_loop_end: // end of loop kernel - vldrw.32 q5, [r1] // ..*.................... - vmul.s16 q6, q5, r2 // .....*................. - vsub.u16 q1, q4, q0 // ......*................ - vqrdmulh.s16 q2, q5, r3 // ...*................... - vadd.u16 q0, q4, q0 // *...................... - vmla.s16 q6, q2, r12 // .......*............... - vldrw.32 q7, [r0] // ....*.................. - vmul.s16 q3, q1, r6 // ...............*....... - vadd.u16 q4, q7, q6 // .........*............. - vqrdmulh.s16 q2, q1, r7 // .............*......... - // gap // ....................... - vmla.s16 q3, q2, r12 // .................*..... - vsub.u16 q2, q7, q6 // ..................*.... - vmul.s16 q5, q0, r4 // .*..................... - vadd.u16 q7, q2, q3 // ...................*... - vqrdmulh.s16 q6, q0, r5 // ........*.............. - vsub.u16 q3, q2, q3 // ....................*.. - vstrw.u32 q7, [r1] , #16 // .....................*. - vmla.s16 q5, q6, r12 // ..........*............ - vstrw.u32 q3, [r1, #112] // ......................* - vadd.u16 q7, q4, q5 // ..............*........ - vstrw.u32 q7, [r0] , #16 // ................*...... - vsub.u16 q4, q4, q5 // ...........*........... - vstrw.u32 q4, [r0, #112] // ............*.......... - - // original source code - // vadd.u16 q7, q4, q0 // ....*.................. - // vmul.s16 q2, q7, r4 // ............*.......... - // vldrw.32 q5, [r1] // *...................... - // vqrdmulh.s16 q1, q5, r3 // ...*................... - // vldrw.32 q6, [r0] // ......*................ - // vmul.s16 q3, q5, r2 // .*..................... - // vsub.u16 q0, q4, q0 // ..*.................... - // vmla.s16 q3, q1, r12 // .....*................. - // vqrdmulh.s16 q1, q7, r5 // ..............*........ - // vadd.u16 q7, q6, q3 // ........*.............. - // vmla.s16 q2, q1, r12 // .................*..... - // vsub.u16 q1, q7, q2 // .....................*. - // vstrw.u32 q1, [r0, #128] // ......................* - // vqrdmulh.s16 q1, q0, r7 // .........*............. - // vadd.u16 q2, q7, q2 // ...................*... - // vmul.s16 q7, q0, r6 // .......*............... - // vstrw.u32 q2, [r0] , #16 // ....................*.. - // vmla.s16 q7, q1, r12 // ..........*............ - // vsub.u16 q3, q6, q3 // ...........*........... - // vadd.u16 q6, q3, q7 // .............*......... - // vsub.u16 q5, q3, q7 // ...............*....... - // vstrw.u32 q6, [r1] , #16 // ................*...... - // vstrw.u32 q5, [r1, #112] // ..................*.... - - - /* Layers 3,4,5 */ - - restore in, STACK0 - mov lr, #4 - .p2align 2 -.p2align 2 -layer345_loop: - ldrd r7, r2, [r11] , #56 // *........................................................................................ - vldrw.32 q1, [r0, #112] // ..........................*.............................................................. - vmul.s16 q0, q1, r7 // ...........................*............................................................. - vldrw.32 q3, [r0, #80] // ..........*.............................................................................. - vmul.s16 q6, q3, r7 // ...........*............................................................................. - vldrw.32 q2, [r0, #96] // ..................*...................................................................... - vqrdmulh.s16 q4, q3, r2 // ............*............................................................................ - ldrd r1, r5, [r11, #8*POS_ROOT_1 - 56] // ................................*........................................................ - vqrdmulh.s16 q1, q1, r2 // ............................*............................................................ - vldrw.32 q7, [r0, #48] // .........................*............................................................... - vmla.s16 q0, q1, r12 // .............................*........................................................... - ldrd r4, r8, [r11, #8*POS_ROOT_2 - 56] // ...........................................*............................................. - vmla.s16 q6, q4, r12 // .............*........................................................................... - vadd.u16 q1, q7, q0 // ...............................*......................................................... - vqrdmulh.s16 q5, q1, r5 // .......................................*................................................. - vsub.u16 q4, q7, q0 // ..............................*.......................................................... - vmul.s16 q1, q1, r1 // ......................................*.................................................. - vldrw.32 q0, [r0, #16] // .........*............................................................................... - vmla.s16 q1, q5, r12 // ........................................*................................................ - vadd.u16 q5, q0, q6 // ...............*......................................................................... - vmul.s16 q7, q2, r7 // ...................*..................................................................... - vadd.u16 q3, q5, q1 // ..........................................*.............................................. - vqrdmulh.s16 q2, q2, r2 // ....................*.................................................................... - vsub.u16 q6, q0, q6 // ..............*.......................................................................... - vmul.s16 q0, q3, r4 // ............................................*............................................ - vsub.u16 q1, q5, q1 // .........................................*............................................... - vmla.s16 q7, q2, r12 // .....................*................................................................... - vldrw.32 q5, [r0, #32] // .................*....................................................................... - vsub.u16 q2, q5, q7 // ......................*.................................................................. - qsave QSTACK5, q6 // ................*........................................................................ - vqrdmulh.s16 q6, q3, r8 // .............................................*........................................... - qsave QSTACK6, q2 // ........................*................................................................ - vmla.s16 q0, q6, r12 // ..............................................*.......................................... - vadd.u16 q7, q5, q7 // .......................*................................................................. - vmul.s16 q3, q7, r1 // .................................*....................................................... - vldrw.32 q6, [r0, #64] // ..*...................................................................................... - vqrdmulh.s16 q5, q7, r5 // ..................................*...................................................... - ldrd r10, r6, [r11, #8*POS_ROOT_4 - 56] // ..............................................................*.......................... - vqrdmulh.s16 q7, q6, r2 // ....*.................................................................................... - ldrd r2, r3, [r11, #8*POS_ROOT_5 - 56] // .........................................................................*............... - vmul.s16 q2, q6, r7 // ...*..................................................................................... - ldrd r7, r9, [r11, #8*POS_ROOT_3 - 56] // .................................................*....................................... - vmla.s16 q2, q7, r12 // .....*................................................................................... - vldrw.32 q7, [r0] // .*....................................................................................... - vsub.u16 q6, q7, q2 // ......*.................................................................................. - vmla.s16 q3, q5, r12 // ...................................*..................................................... - qsave QSTACK4, q6 // ........*................................................................................ - vadd.u16 q5, q7, q2 // .......*................................................................................. - vqrdmulh.s16 q6, q1, r9 // ...................................................*..................................... - vadd.u16 q7, q5, q3 // .....................................*................................................... - vmul.s16 q2, q1, r7 // ..................................................*...................................... - vsub.u16 q5, q5, q3 // ....................................*.................................................... - vmla.s16 q2, q6, r12 // ....................................................*.................................... - vsub.u16 q6, q7, q0 // ...............................................*......................................... - ldrd r9, r7, [r11, #8*POS_ROOT_6 - 56] // ...............................................................................*......... - vadd.u16 q0, q7, q0 // ................................................*........................................ - vstrw.u32 q0, [r0] , #128 // .......................................................*................................. - vsub.u16 q0, q5, q2 // .....................................................*................................... - vstrw.u32 q0, [r0, #-80] // ..........................................................*.............................. - vqrdmulh.s16 q7, q4, r6 // .....................................................................*................... - vadd.u16 q0, q5, q2 // ......................................................*.................................. - vmul.s16 q2, q4, r10 // ....................................................................*.................... - qrestore q1, QSTACK6 // .............................................................*........................... - vqrdmulh.s16 q4, q1, r6 // ................................................................*........................ - qrestore q5, QSTACK5 // ............................................................*............................ - vmul.s16 q1, q1, r10 // ...............................................................*......................... - qrestore q3, QSTACK4 // ...........................................................*............................. - vmla.s16 q1, q4, r12 // .................................................................*....................... - vstrw.u32 q0, [r0, #-96] // .........................................................*............................... - vmla.s16 q2, q7, r12 // ......................................................................*.................. - vstrw.u32 q6, [r0, #-112] // ........................................................*................................ - vadd.u16 q6, q5, q2 // ........................................................................*................ - vqrdmulh.s16 q4, q6, r3 // ...........................................................................*............. - vsub.u16 q5, q5, q2 // .......................................................................*................. - vmul.s16 q2, q6, r2 // ..........................................................................*.............. - vadd.u16 q7, q3, q1 // ...................................................................*..................... - vmla.s16 q2, q4, r12 // ............................................................................*............ - vsub.u16 q4, q3, q1 // ..................................................................*...................... - vmul.s16 q0, q5, r9 // ................................................................................*........ - vadd.u16 q3, q7, q2 // ..............................................................................*.......... - vqrdmulh.s16 q5, q5, r7 // .................................................................................*....... - vsub.u16 q2, q7, q2 // .............................................................................*........... - vstrw.u32 q3, [r0, #-64] // .....................................................................................*... - vmla.s16 q0, q5, r12 // ..................................................................................*...... - vstrw.u32 q2, [r0, #-48] // ......................................................................................*.. - vadd.u16 q2, q4, q0 // ....................................................................................*.... - vstrw.u32 q2, [r0, #-32] // .......................................................................................*. - vsub.u16 q1, q4, q0 // ...................................................................................*..... - vstrw.u32 q1, [r0, #-16] // ........................................................................................* - - // original source code - // ldrd r3, r4, [r11], #(7*8) // *........................................................................................ - // vldrw.32 q0, [r0] // ...........................................*............................................. - // vldrw.32 q1, [r0, #64] // ...................................*..................................................... - // vmul.s16 q7, q1, r3 // ........................................*................................................ - // vqrdmulh.s16 q1, q1, r4 // ......................................*.................................................. - // vmla.s16 q7, q1, r12 // ..........................................*.............................................. - // vsub.u16 q1, q0, q7 // ............................................*............................................ - // vadd.u16 q0, q0, q7 // ...............................................*......................................... - // qsave QSTACK4, q1 // ..............................................*.......................................... - // vldrw.32 q1, [r0, #16] // .................*....................................................................... - // vldrw.32 q2, [r0, #80] // ...*..................................................................................... - // vmul.s16 q7, q2, r3 // ....*.................................................................................... - // vqrdmulh.s16 q2, q2, r4 // ......*.................................................................................. - // vmla.s16 q7, q2, r12 // ............*............................................................................ - // vsub.u16 q2, q1, q7 // .......................*................................................................. - // vadd.u16 q1, q1, q7 // ...................*..................................................................... - // qsave QSTACK5, q2 // .............................*........................................................... - // vldrw.32 q2, [r0, #32] // ...........................*............................................................. - // vldrw.32 q3, [r0, #96] // .....*................................................................................... - // vmul.s16 q7, q3, r3 // ....................*.................................................................... - // vqrdmulh.s16 q3, q3, r4 // ......................*.................................................................. - // vmla.s16 q7, q3, r12 // ..........................*.............................................................. - // vsub.u16 q3, q2, q7 // ............................*............................................................ - // vadd.u16 q2, q2, q7 // .................................*....................................................... - // qsave QSTACK6, q3 // ...............................*......................................................... - // vldrw.32 q3, [r0, #48] // .........*............................................................................... - // vldrw.32 q4, [r0, #112] // .*....................................................................................... - // vmul.s16 q7, q4, r3 // ..*...................................................................................... - // vqrdmulh.s16 q4, q4, r4 // ........*................................................................................ - // vmla.s16 q7, q4, r12 // ..........*.............................................................................. - // vsub.u16 q4, q3, q7 // ...............*......................................................................... - // vadd.u16 q3, q3, q7 // .............*........................................................................... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_1)*8)] // .......*................................................................................. - // vmul.s16 q7, q2, r3 // ..................................*...................................................... - // vqrdmulh.s16 q2, q2, r4 // ....................................*.................................................... - // vmla.s16 q7, q2, r12 // .............................................*........................................... - // vsub.u16 q2, q0, q7 // ...................................................*..................................... - // vadd.u16 q0, q0, q7 // .................................................*....................................... - // vmul.s16 q7, q3, r3 // ................*........................................................................ - // vqrdmulh.s16 q3, q3, r4 // ..............*.......................................................................... - // vmla.s16 q7, q3, r12 // ..................*...................................................................... - // vsub.u16 q3, q1, q7 // .........................*............................................................... - // vadd.u16 q1, q1, q7 // .....................*................................................................... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_2)*8)] // ...........*............................................................................. - // vmul.s16 q7, q1, r3 // ........................*................................................................ - // vqrdmulh.s16 q1, q1, r4 // ..............................*.......................................................... - // vmla.s16 q7, q1, r12 // ................................*........................................................ - // vsub.u16 q1, q0, q7 // .....................................................*................................... - // vadd.u16 q0, q0, q7 // .......................................................*................................. - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_3)*8)] // .........................................*............................................... - // vmul.s16 q7, q3, r3 // ..................................................*...................................... - // vqrdmulh.s16 q3, q3, r4 // ................................................*........................................ - // vmla.s16 q7, q3, r12 // ....................................................*.................................... - // vsub.u16 q3, q2, q7 // .........................................................*............................... - // vadd.u16 q2, q2, q7 // ............................................................*............................ - // vstrw.u32 q0, [r0], #128 // ........................................................*................................ - // vstrw.u32 q1, [r0, #(-128+16)] // ......................................................................*.................. - // vstrw.u32 q2, [r0, #(-128+32)] // ....................................................................*.................... - // vstrw.u32 q3, [r0, #(-128+48)] // ..........................................................*.............................. - // qrestore q1, QSTACK4 // ..................................................................*...................... - // qrestore q2, QSTACK5 // ................................................................*........................ - // qrestore q3, QSTACK6 // ..............................................................*.......................... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_4)*8)] // .....................................*................................................... - // vmul.s16 q7, q3, r3 // .................................................................*....................... - // vqrdmulh.s16 q3, q3, r4 // ...............................................................*......................... - // vmla.s16 q7, q3, r12 // ...................................................................*..................... - // vsub.u16 q3, q1, q7 // .............................................................................*........... - // vadd.u16 q1, q1, q7 // ...........................................................................*............. - // vmul.s16 q7, q4, r3 // .............................................................*........................... - // vqrdmulh.s16 q4, q4, r4 // ...........................................................*............................. - // vmla.s16 q7, q4, r12 // .....................................................................*................... - // vsub.u16 q4, q2, q7 // .........................................................................*............... - // vadd.u16 q2, q2, q7 // .......................................................................*................. - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_5)*8)] // .......................................*................................................. - // vmul.s16 q7, q2, r3 // ..........................................................................*.............. - // vqrdmulh.s16 q2, q2, r4 // ........................................................................*................ - // vmla.s16 q7, q2, r12 // ............................................................................*............ - // vsub.u16 q2, q1, q7 // .................................................................................*....... - // vadd.u16 q1, q1, q7 // ...............................................................................*......... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_6)*8)] // ......................................................*.................................. - // vmul.s16 q7, q4, r3 // ..............................................................................*.......... - // vqrdmulh.s16 q4, q4, r4 // ................................................................................*........ - // vmla.s16 q7, q4, r12 // ...................................................................................*..... - // vsub.u16 q4, q3, q7 // .......................................................................................*. - // vadd.u16 q3, q3, q7 // .....................................................................................*... - // vstrw.u32 q1, [r0, #(-128+64)] // ..................................................................................*...... - // vstrw.u32 q2, [r0, #(-128+80)] // ....................................................................................*.... - // vstrw.u32 q3, [r0, #(-128+96)] // ......................................................................................*.. - // vstrw.u32 q4, [r0, #(-128+112)] // ........................................................................................* - - le lr, layer345_loop - - // Layer 67 - - // Use a different base register to facilitate Helight being able to - // overlap the first iteration of L67 with the last iteration of L345. - restore inp, STACK0 - mov lr, #8 - .p2align 2 - vld40.32 {q1,q2,q3,q4}, [r1] // *..... - // gap // ...... - vld41.32 {q1,q2,q3,q4}, [r1] // .*.... - // gap // ...... - vld42.32 {q1,q2,q3,q4}, [r1] // ..*... - // gap // ...... - vld43.32 {q1,q2,q3,q4}, [r1]! // ...*.. - // gap // ...... - vldrh.16 q5, [r11] , #96 // ....*. - vmul.s16 q6, q3, q5 // .....* - - // original source code - // vld40.32 {q1,q2,q3,q4}, [r1] // *..... - // vld41.32 {q1,q2,q3,q4}, [r1] // .*.... - // vld42.32 {q1,q2,q3,q4}, [r1] // ..*... - // vld43.32 {q1,q2,q3,q4}, [r1]! // ...*.. - // vldrh.16 q5, [r11] , #96 // ....*. - // vmul.s16 q6, q3, q5 // .....* - - sub lr, lr, #1 -.p2align 2 -layer67_loop: - vmul.s16 q5, q4, q5 // ...........*...................... - vldrh.16 q7, [r11, #-80] // .....*............................ - vqrdmulh.s16 q0, q4, q7 // ............*..................... - vldrh.16 q4, [r11, #-32] // .......................*.......... - vqrdmulh.s16 q7, q3, q7 // .......*.......................... - vldrh.16 q3, [r11, #-16] // ........................*......... - vmla.s16 q5, q0, r12 // .............*.................... - vldrh.16 q0, [r11, #-64] // ................*................. - vmla.s16 q6, q7, r12 // ........*......................... - vsub.u16 q7, q2, q5 // ..............*................... - vmul.s16 q4, q7, q4 // .........................*........ - vadd.u16 q5, q2, q5 // ...............*.................. - vqrdmulh.s16 q7, q7, q3 // ..........................*....... - vsub.u16 q3, q1, q6 // .........*........................ - vmla.s16 q4, q7, r12 // ...........................*...... - vldrh.16 q7, [r11, #-48] // .................*................ - vadd.u16 q2, q3, q4 // .............................*.... - vstrw.u32 q2, [r1, #-32] // ................................*. - vsub.u16 q4, q3, q4 // ............................*..... - vstrw.u32 q4, [r1, #-16] // .................................* - vadd.u16 q6, q1, q6 // ..........*....................... - vld40.32 {q1,q2,q3,q4}, [r1] // e................................. - vmul.s16 q0, q5, q0 // ..................*............... - vld41.32 {q1,q2,q3,q4}, [r1] // .e................................ - vqrdmulh.s16 q7, q5, q7 // ...................*.............. - vld42.32 {q1,q2,q3,q4}, [r1] // ..e............................... - vmla.s16 q0, q7, r12 // ....................*............. - vld43.32 {q1,q2,q3,q4}, [r1]! // ...e.............................. - vadd.u16 q5, q6, q0 // ......................*........... - vstrw.u32 q5, [r1, #-128] // ..............................*... - vsub.u16 q7, q6, q0 // .....................*............ - vldrh.16 q5, [r11] , #96 // ....e............................. - vmul.s16 q6, q3, q5 // ......e........................... - vstrw.u32 q7, [r1, #-112] // ...............................*.. - - // original source code - // vld40.32 {q0, q1, q2, q3}, [r1] // e............|....................e............ - // vld41.32 {q0, q1, q2, q3}, [r1] // ..e..........|......................e.......... - // vld42.32 {q0, q1, q2, q3}, [r1] // ....e........|........................e........ - // vld43.32 {q0, q1, q2, q3}, [r1]! // ......e......|..........................e...... - // vldrh.16 q5, [r11], #+96 // ..........e..|..............................e.. - // vldrh.16 q6, [r11, #(+16-96)] // .............|*................................ - // vmul.s16 q7, q2, q5 // ...........e.|...............................e. - // vqrdmulh.s16 q2, q2, q6 // .............|...*............................. - // vmla.s16 q7, q2, r12 // .............|.......*......................... - // vsub.u16 q2, q0, q7 // .............|............*.................... - // vadd.u16 q0, q0, q7 // .............|...................*............. - // vmul.s16 q7, q3, q5 // .............*................................. - // vqrdmulh.s16 q3, q3, q6 // .............|.*............................... - // vmla.s16 q7, q3, r12 // .............|.....*........................... - // vsub.u16 q3, q1, q7 // .............|........*........................ - // vadd.u16 q1, q1, q7 // .............|..........*...................... - // vldrh.16 q5, [r11, #(32 - 96)] // .............|......*.......................... - // vldrh.16 q6, [r11, #(48 - 96)] // .............|..............*.................. - // vmul.s16 q7, q1, q5 // .*...........|.....................*........... - // vqrdmulh.s16 q1, q1, q6 // ...*.........|.......................*......... - // vmla.s16 q7, q1, r12 // .....*.......|.........................*....... - // vsub.u16 q1, q0, q7 // .........*...|.............................*... - // vadd.u16 q0, q0, q7 // .......*.....|...........................*..... - // vldrh.16 q5, [r11, #(64-96)] // .............|..*.............................. - // vldrh.16 q6, [r11, #(80-96)] // .............|....*............................ - // vmul.s16 q7, q3, q5 // .............|.........*....................... - // vqrdmulh.s16 q3, q3, q6 // .............|...........*..................... - // vmla.s16 q7, q3, r12 // .............|.............*................... - // vsub.u16 q3, q2, q7 // .............|.................*............... - // vadd.u16 q2, q2, q7 // .............|...............*................. - // vstrw.u32 q0, [r1, #-64] // ........*....|............................*.... - // vstrw.u32 q1, [r1, #-48] // ............*|................................* - // vstrw.u32 q2, [r1, #-32] // .............|................*................ - // vstrw.u32 q3, [r1, #-16] // .............|..................*.............. - - le lr, layer67_loop - vmul.s16 q0, q4, q5 // *........................... - vldrh.16 q5, [r11, #-80] // .*.......................... - vqrdmulh.s16 q4, q4, q5 // ..*......................... - vldrh.16 q7, [r11, #-48] // ...............*............ - vqrdmulh.s16 q5, q3, q5 // ....*....................... - vldrh.16 q3, [r11, #-16] // .....*...................... - vmla.s16 q0, q4, r12 // ......*..................... - vldrh.16 q4, [r11, #-64] // .......*.................... - vmla.s16 q6, q5, r12 // ........*................... - vsub.u16 q5, q2, q0 // .........*.................. - vqrdmulh.s16 q3, q5, q3 // ............*............... - vadd.u16 q2, q2, q0 // ...........*................ - vldrh.16 q0, [r11, #-32] // ...*........................ - vmul.s16 q5, q5, q0 // ..........*................. - vadd.u16 q0, q1, q6 // ....................*....... - vmla.s16 q5, q3, r12 // ..............*............. - vsub.u16 q3, q1, q6 // .............*.............. - vqrdmulh.s16 q6, q2, q7 // ......................*..... - vsub.u16 q7, q3, q5 // ..................*......... - vmul.s16 q1, q2, q4 // .....................*...... - vstrw.u32 q7, [r1, #-16] // ...................*........ - vmla.s16 q1, q6, r12 // .......................*.... - vadd.u16 q6, q3, q5 // ................*........... - vstrw.u32 q6, [r1, #-32] // .................*.......... - vsub.u16 q4, q0, q1 // ..........................*. - vstrw.u32 q4, [r1, #-48] // ...........................* - vadd.u16 q0, q0, q1 // ........................*... - vstrw.u32 q0, [r1, #-64] // .........................*.. - - // original source code - // vmul.s16 q5, q4, q5 // *........................... - // vldrh.16 q7, [r11, #-80] // .*.......................... - // vqrdmulh.s16 q0, q4, q7 // ..*......................... - // vldrh.16 q4, [r11, #-32] // ............*............... - // vqrdmulh.s16 q7, q3, q7 // ....*....................... - // vldrh.16 q3, [r11, #-16] // .....*...................... - // vmla.s16 q5, q0, r12 // ......*..................... - // vldrh.16 q0, [r11, #-64] // .......*.................... - // vmla.s16 q6, q7, r12 // ........*................... - // vsub.u16 q7, q2, q5 // .........*.................. - // vmul.s16 q4, q7, q4 // .............*.............. - // vadd.u16 q5, q2, q5 // ...........*................ - // vqrdmulh.s16 q7, q7, q3 // ..........*................. - // vsub.u16 q3, q1, q6 // ................*........... - // vmla.s16 q4, q7, r12 // ...............*............ - // vldrh.16 q7, [r11, #-48] // ...*........................ - // vadd.u16 q2, q3, q4 // ......................*..... - // vstrw.u32 q2, [r1, #-32] // .......................*.... - // vsub.u16 q4, q3, q4 // ..................*......... - // vstrw.u32 q4, [r1, #-16] // ....................*....... - // vadd.u16 q6, q1, q6 // ..............*............. - // vmul.s16 q0, q5, q0 // ...................*........ - // vqrdmulh.s16 q7, q5, q7 // .................*.......... - // vmla.s16 q0, q7, r12 // .....................*...... - // vadd.u16 q5, q6, q0 // ..........................*. - // vstrw.u32 q5, [r1, #-64] // ...........................* - // vsub.u16 q7, q6, q0 // ........................*... - // vstrw.u32 q7, [r1, #-48] // .........................*.. - - - add sp, sp, #STACK_SIZE - - align_stack_undo - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m85.s b/tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m85.s deleted file mode 100644 index 0de67fa..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_12_345_67_opt_size_m85.s +++ /dev/null @@ -1,642 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -.p2align 4 -roots: -#include "ntt_kyber_12_345_67_twiddles.s" -.text - -#define QSTACK4 (0*16) -#define QSTACK5 (1*16) -#define QSTACK6 (2*16) -#define STACK0 (3*16) - -#define POS_ROOT_1 1 -#define POS_ROOT_2 2 -#define POS_ROOT_3 3 -#define POS_ROOT_4 4 -#define POS_ROOT_5 5 -#define POS_ROOT_6 6 - -#define STACK_SIZE (3*16 + 8) - -.macro qsave loc, a // slothy:no-unfold - vstrw.32 \a, [sp, #\loc\()] -.endm -.macro qrestore a, loc // slothy:no-unfold - vldrw.32 \a, [sp, #\loc\()] -.endm -.macro restored a, b, loc // slothy:no-unfold - ldrd \a, \b, [sp, #\loc\()] -.endm -.macro saved loc, a, b // slothy:no-unfold - strd \a, \b, [sp, #\loc\()] -.endm -.macro restore a, loc // slothy:no-unfold - ldr \a, [sp, #\loc\()] -.endm -.macro save loc, a // slothy:no-unfold - str \a, [sp, #\loc\()] -.endm - -// Barrett multiplication -.macro mulmod dst, src, const, const_tw - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_tw - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_tw - mulmod tmp, \b, \root, \root_tw - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -// Aligns stack =0 mod 16 -.macro align_stack_do // slothy:no-unfold - mov r11, sp - and r12, r11, #0xC - sub sp, sp, r12 // Align stack to 16 byte - sub sp, sp, #16 - str r12, [sp] -.endm - -// Reverts initial stack correction -.macro align_stack_undo // slothy:no-unfold - ldr r12, [sp] - add sp, sp, #16 - add sp, sp, r12 -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_12_345_67_opt_size_m85, %function -.global ntt_kyber_12_345_67_opt_size_m85 - - modulus .req r12 - r_ptr .req r11 - .equ modulus_const, -3329 - - in .req r0 - inp .req r1 - in_low .req r0 - in_high .req r1 - - root0 .req r2 - root0_tw .req r3 - root1 .req r4 - root1_tw .req r5 - root2 .req r6 - root2_tw .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - data4 .req q1 - data5 .req q2 - data6 .req q3 - data7 .req q4 - - tmp .req q7 - - rtmp .req r3 - rtmp_tw .req r4 - - qtmp .req q5 - qtmp_tw .req q6 - -ntt_kyber_12_345_67_opt_size_m85: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - align_stack_do - - sub sp, sp, #STACK_SIZE - movw modulus, #:lower16:modulus_const - ldr r_ptr, roots_addr - - /* Layers 1,2 */ - - save STACK0, in - add in_high, in_low, #(2*128) - ldrd root0, root0_tw, [r_ptr], #+24 - ldrd root1, root1_tw, [r_ptr, #-16] - ldrd root2, root2_tw, [r_ptr, #-8] - - mov lr, #8 - .p2align 2 - vldrw.32 q3, [r1] // *. - vqrdmulh.s16 q2, q3, r3 // .* - - // original source code - // vldrw.32 q3, [r1] // *. - // vqrdmulh.s16 q2, q3, r3 // .* - - sub lr, lr, #1 -.p2align 2 -layer12_loop: - vmul.s16 q0, q3, r2 // ....*....................... - vldrw.32 q1, [r1, #128] // ...*........................ - vqrdmulh.s16 q5, q1, r3 // ..........*................. - vldrw.32 q4, [r0] // *........................... - vmul.s16 q1, q1, r2 // .........*.................. - vldrw.32 q6, [r0, #128] // .*.......................... - vmla.s16 q1, q5, r12 // ...........*................ - vldrw.32 q3, [r1, #16] // ..e......................... - vmla.s16 q0, q2, r12 // ......*..................... - vadd.u16 q5, q6, q1 // .............*.............. - vqrdmulh.s16 q7, q5, r5 // ...............*............ - vsub.u16 q2, q4, q0 // .......*.................... - vmul.s16 q5, q5, r4 // ..............*............. - vsub.u16 q6, q6, q1 // ............*............... - vmla.s16 q5, q7, r12 // ................*........... - vadd.u16 q4, q4, q0 // ........*................... - vmul.s16 q7, q6, r6 // ...................*........ - vsub.u16 q1, q4, q5 // .................*.......... - vqrdmulh.s16 q6, q6, r7 // ....................*....... - vstrw.u32 q1, [r0, #128] // .........................*.. - vmla.s16 q7, q6, r12 // .....................*...... - vadd.u16 q1, q4, q5 // ..................*......... - vstrw.u32 q1, [r0] , #16 // ........................*... - vadd.u16 q1, q2, q7 // .......................*.... - vstrw.u32 q1, [r1] , #16 // ..........................*. - vsub.u16 q1, q2, q7 // ......................*..... - vqrdmulh.s16 q2, q3, r3 // .....e...................... - vstrw.u32 q1, [r1, #112] // ...........................* - - // original source code - // vldrw.32 q0, [r0] // .....................|..*........................ - // vldrw.32 q1, [r0, #(2*64)] // .....................|....*...................... - // vldrw.32 q2, [r1] // e....................|......e.................... - // vldrw.32 q3, [r1, #(2*64)] // .....................|*.......................... - // vmul.s16 q7, q2, r2 // .....................*........................... - // vqrdmulh.s16 q2, q2, r3 // ...................e.|.........................e. - // vmla.s16 q7, q2, r12 // .*...................|.......*................... - // vsub.u16 q2, q0, q7 // ....*................|..........*................ - // vadd.u16 q0, q0, q7 // ........*............|..............*............ - // vmul.s16 q7, q3, r2 // .....................|...*....................... - // vqrdmulh.s16 q3, q3, r3 // .....................|.*......................... - // vmla.s16 q7, q3, r12 // .....................|.....*..................... - // vsub.u16 q3, q1, q7 // ......*..............|............*.............. - // vadd.u16 q1, q1, q7 // ..*..................|........*.................. - // vmul.s16 q7, q1, r4 // .....*...............|...........*............... - // vqrdmulh.s16 q1, q1, r5 // ...*.................|.........*................. - // vmla.s16 q7, q1, r12 // .......*.............|.............*............. - // vsub.u16 q1, q0, q7 // ..........*..........|................*.......... - // vadd.u16 q0, q0, q7 // ..............*......|....................*...... - // vmul.s16 q7, q3, r6 // .........*...........|...............*........... - // vqrdmulh.s16 q3, q3, r7 // ...........*.........|.................*......... - // vmla.s16 q7, q3, r12 // .............*.......|...................*....... - // vsub.u16 q3, q2, q7 // ..................*..|........................*.. - // vadd.u16 q2, q2, q7 // ................*....|......................*.... - // vstrw.u32 q0, [r0], #16 // ...............*.....|.....................*..... - // vstrw.u32 q1, [r0, #(2*64 - 16)] // ............*........|..................*........ - // vstrw.u32 q2, [r1], #16 // .................*...|.......................*... - // vstrw.u32 q3, [r1, #(2*64 - 16)] // ....................*|..........................* - - le lr, layer12_loop -layer12_loop_end: // end of loop kernel - vmul.s16 q1, q3, r2 // *......................... - vldrw.32 q6, [r1, #128] // .*........................ - vmul.s16 q7, q6, r2 // ....*..................... - vldrw.32 q0, [r0, #128] // .....*.................... - vqrdmulh.s16 q6, q6, r3 // ..*....................... - vldrw.32 q3, [r0] // ...*...................... - vmla.s16 q7, q6, r12 // ......*................... - // gap // .......................... - vmla.s16 q1, q2, r12 // .......*.................. - vsub.u16 q6, q0, q7 // ............*............. - vmul.s16 q4, q6, r6 // ...............*.......... - vsub.u16 q5, q3, q1 // ..........*............... - vqrdmulh.s16 q6, q6, r7 // .................*........ - vadd.u16 q2, q3, q1 // ..............*........... - vmla.s16 q4, q6, r12 // ...................*...... - vadd.u16 q7, q0, q7 // ........*................. - vqrdmulh.s16 q0, q7, r5 // .........*................ - vadd.u16 q6, q5, q4 // ......................*... - vmul.s16 q1, q7, r4 // ...........*.............. - vstrw.u32 q6, [r1] , #16 // .......................*.. - vsub.u16 q6, q5, q4 // ........................*. - vmla.s16 q1, q0, r12 // .............*............ - vstrw.u32 q6, [r1, #112] // .........................* - vsub.u16 q6, q2, q1 // ................*......... - vstrw.u32 q6, [r0, #128] // ..................*....... - vadd.u16 q6, q2, q1 // ....................*..... - vstrw.u32 q6, [r0] , #16 // .....................*.... - - // original source code - // vmul.s16 q0, q3, r2 // *......................... - // vldrw.32 q1, [r1, #128] // .*........................ - // vqrdmulh.s16 q5, q1, r3 // ....*..................... - // vldrw.32 q4, [r0] // .....*.................... - // vmul.s16 q1, q1, r2 // ..*....................... - // vldrw.32 q6, [r0, #128] // ...*...................... - // vmla.s16 q1, q5, r12 // ......*................... - // vmla.s16 q0, q2, r12 // .......*.................. - // vadd.u16 q5, q6, q1 // ..............*........... - // vqrdmulh.s16 q7, q5, r5 // ...............*.......... - // vsub.u16 q2, q4, q0 // ..........*............... - // vmul.s16 q5, q5, r4 // .................*........ - // vsub.u16 q6, q6, q1 // ........*................. - // vmla.s16 q5, q7, r12 // ....................*..... - // vadd.u16 q4, q4, q0 // ............*............. - // vmul.s16 q7, q6, r6 // .........*................ - // vsub.u16 q1, q4, q5 // ......................*... - // vqrdmulh.s16 q6, q6, r7 // ...........*.............. - // vstrw.u32 q1, [r0, #128] // .......................*.. - // vmla.s16 q7, q6, r12 // .............*............ - // vadd.u16 q1, q4, q5 // ........................*. - // vstrw.u32 q1, [r0] , #16 // .........................* - // vadd.u16 q1, q2, q7 // ................*......... - // vstrw.u32 q1, [r1] , #16 // ..................*....... - // vsub.u16 q1, q2, q7 // ...................*...... - // vstrw.u32 q1, [r1, #112] // .....................*.... - - - /* Layers 3,4,5 */ - - restore in, STACK0 - mov lr, #4 - .p2align 2 -.p2align 2 -layer345_loop: - ldrd r1, r2, [r11] , #56 // *........................................................................................ - vldrw.32 q6, [r0, #96] // ..................*...................................................................... - vqrdmulh.s16 q0, q6, r2 // ....................*.................................................................... - vldrw.32 q7, [r0, #80] // ..........*.............................................................................. - vmul.s16 q4, q6, r1 // ...................*..................................................................... - vldrw.32 q5, [r0, #64] // ..*...................................................................................... - vmla.s16 q4, q0, r12 // .....................*................................................................... - vldrw.32 q0, [r0, #32] // .................*....................................................................... - vqrdmulh.s16 q2, q7, r2 // ............*............................................................................ - vsub.u16 q6, q0, q4 // ......................*.................................................................. - vmul.s16 q3, q7, r1 // ...........*............................................................................. - vadd.u16 q0, q0, q4 // .......................*................................................................. - vmla.s16 q3, q2, r12 // .............*........................................................................... - vldrw.32 q7, [r0, #112] // ..........................*.............................................................. - vqrdmulh.s16 q2, q5, r2 // ....*.................................................................................... - qsave QSTACK6, q6 // ........................*................................................................ - vmul.s16 q5, q5, r1 // ...*..................................................................................... - ldrd r10, r5, [r11, #8*POS_ROOT_1 - 56] // ................................*........................................................ - vmla.s16 q5, q2, r12 // .....*................................................................................... - vldrw.32 q1, [r0] // .*....................................................................................... - vqrdmulh.s16 q6, q7, r2 // ............................*............................................................ - vsub.u16 q2, q1, q5 // ......*.................................................................................. - vmul.s16 q7, q7, r1 // ...........................*............................................................. - qsave QSTACK4, q2 // ........*................................................................................ - vmla.s16 q7, q6, r12 // .............................*........................................................... - vadd.u16 q6, q1, q5 // .......*................................................................................. - vldrw.32 q2, [r0, #48] // .........................*............................................................... - vsub.u16 q5, q2, q7 // ..............................*.......................................................... - vqrdmulh.s16 q1, q0, r5 // ..................................*...................................................... - vadd.u16 q2, q2, q7 // ...............................*......................................................... - vqrdmulh.s16 q7, q2, r5 // .......................................*................................................. - ldrd r4, r3, [r11, #8*POS_ROOT_3 - 56] // .................................................*....................................... - vmul.s16 q0, q0, r10 // .................................*....................................................... - ldrd r6, r8, [r11, #8*POS_ROOT_2 - 56] // ...........................................*............................................. - vldrw.32 q4, [r0, #16] // .........*............................................................................... - vmul.s16 q2, q2, r10 // ......................................*.................................................. - ldrd r1, r10, [r11, #8*POS_ROOT_6 - 56] // ...............................................................................*......... - vmla.s16 q0, q1, r12 // ...................................*..................................................... - ldrd r9, r5, [r11, #8*POS_ROOT_4 - 56] // ..............................................................*.......................... - vsub.u16 q1, q6, q0 // ....................................*.................................................... - vmla.s16 q2, q7, r12 // ........................................*................................................ - vadd.u16 q7, q4, q3 // ...............*......................................................................... - ldrd r7, r2, [r11, #8*POS_ROOT_5 - 56] // .........................................................................*............... - vsub.u16 q3, q4, q3 // ..............*.......................................................................... - qsave QSTACK5, q3 // ................*........................................................................ - vadd.u16 q4, q7, q2 // ..........................................*.............................................. - vmul.s16 q3, q4, r6 // ............................................*............................................ - vadd.u16 q0, q6, q0 // .....................................*................................................... - vqrdmulh.s16 q6, q4, r8 // .............................................*........................................... - vsub.u16 q7, q7, q2 // .........................................*............................................... - vmla.s16 q3, q6, r12 // ..............................................*.......................................... - qrestore q2, QSTACK6 // .............................................................*........................... - vmul.s16 q4, q7, r4 // ..................................................*...................................... - vsub.u16 q6, q0, q3 // ...............................................*......................................... - vqrdmulh.s16 q7, q7, r3 // ...................................................*..................................... - vadd.u16 q3, q0, q3 // ................................................*........................................ - vmla.s16 q4, q7, r12 // ....................................................*.................................... - vstrw.u32 q6, [r0, #16] // ........................................................*................................ - vsub.u16 q6, q1, q4 // .....................................................*................................... - vmul.s16 q7, q2, r9 // ...............................................................*......................... - vadd.u16 q1, q1, q4 // ......................................................*.................................. - vqrdmulh.s16 q0, q2, r5 // ................................................................*........................ - vstrw.u32 q1, [r0, #32] // .........................................................*............................... - vqrdmulh.s16 q1, q5, r5 // .....................................................................*................... - qrestore q4, QSTACK4 // ...........................................................*............................. - vmul.s16 q5, q5, r9 // ....................................................................*.................... - vstrw.u32 q6, [r0, #48] // ..........................................................*.............................. - vmla.s16 q7, q0, r12 // .................................................................*....................... - qrestore q0, QSTACK5 // ............................................................*............................ - vmla.s16 q5, q1, r12 // ......................................................................*.................. - vstrw.u32 q3, [r0] , #128 // .......................................................*................................. - vsub.u16 q2, q0, q5 // .......................................................................*................. - vmul.s16 q6, q2, r1 // ................................................................................*........ - vadd.u16 q1, q4, q7 // ...................................................................*..................... - vqrdmulh.s16 q2, q2, r10 // .................................................................................*....... - vsub.u16 q4, q4, q7 // ..................................................................*...................... - vmla.s16 q6, q2, r12 // ..................................................................................*...... - vadd.u16 q5, q0, q5 // ........................................................................*................ - vqrdmulh.s16 q0, q5, r2 // ...........................................................................*............. - vadd.u16 q3, q4, q6 // ....................................................................................*.... - vmul.s16 q2, q5, r7 // ..........................................................................*.............. - vsub.u16 q6, q4, q6 // ...................................................................................*..... - vstrw.u32 q3, [r0, #-32] // .......................................................................................*. - vmla.s16 q2, q0, r12 // ............................................................................*............ - vstrw.u32 q6, [r0, #-16] // ........................................................................................* - vsub.u16 q5, q1, q2 // .............................................................................*........... - vstrw.u32 q5, [r0, #-48] // ......................................................................................*.. - vadd.u16 q6, q1, q2 // ..............................................................................*.......... - vstrw.u32 q6, [r0, #-64] // .....................................................................................*... - - // original source code - // ldrd r3, r4, [r11], #(7*8) // *........................................................................................ - // vldrw.32 q0, [r0] // ...................*..................................................................... - // vldrw.32 q1, [r0, #64] // .....*................................................................................... - // vmul.s16 q7, q1, r3 // ................*........................................................................ - // vqrdmulh.s16 q1, q1, r4 // ..............*.......................................................................... - // vmla.s16 q7, q1, r12 // ..................*...................................................................... - // vsub.u16 q1, q0, q7 // .....................*................................................................... - // vadd.u16 q0, q0, q7 // .........................*............................................................... - // qsave QSTACK4, q1 // .......................*................................................................. - // vldrw.32 q1, [r0, #16] // ..................................*...................................................... - // vldrw.32 q2, [r0, #80] // ...*..................................................................................... - // vmul.s16 q7, q2, r3 // ..........*.............................................................................. - // vqrdmulh.s16 q2, q2, r4 // ........*................................................................................ - // vmla.s16 q7, q2, r12 // ............*............................................................................ - // vsub.u16 q2, q1, q7 // ...........................................*............................................. - // vadd.u16 q1, q1, q7 // .........................................*............................................... - // qsave QSTACK5, q2 // ............................................*............................................ - // vldrw.32 q2, [r0, #32] // .......*................................................................................. - // vldrw.32 q3, [r0, #96] // .*....................................................................................... - // vmul.s16 q7, q3, r3 // ....*.................................................................................... - // vqrdmulh.s16 q3, q3, r4 // ..*...................................................................................... - // vmla.s16 q7, q3, r12 // ......*.................................................................................. - // vsub.u16 q3, q2, q7 // .........*............................................................................... - // vadd.u16 q2, q2, q7 // ...........*............................................................................. - // qsave QSTACK6, q3 // ...............*......................................................................... - // vldrw.32 q3, [r0, #48] // ..........................*.............................................................. - // vldrw.32 q4, [r0, #112] // .............*........................................................................... - // vmul.s16 q7, q4, r3 // ......................*.................................................................. - // vqrdmulh.s16 q4, q4, r4 // ....................*.................................................................... - // vmla.s16 q7, q4, r12 // ........................*................................................................ - // vsub.u16 q4, q3, q7 // ...........................*............................................................. - // vadd.u16 q3, q3, q7 // .............................*........................................................... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_1)*8)] // .................*....................................................................... - // vmul.s16 q7, q2, r3 // ................................*........................................................ - // vqrdmulh.s16 q2, q2, r4 // ............................*............................................................ - // vmla.s16 q7, q2, r12 // .....................................*................................................... - // vsub.u16 q2, q0, q7 // .......................................*................................................. - // vadd.u16 q0, q0, q7 // ...............................................*......................................... - // vmul.s16 q7, q3, r3 // ...................................*..................................................... - // vqrdmulh.s16 q3, q3, r4 // ..............................*.......................................................... - // vmla.s16 q7, q3, r12 // ........................................*................................................ - // vsub.u16 q3, q1, q7 // .................................................*....................................... - // vadd.u16 q1, q1, q7 // .............................................*........................................... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_2)*8)] // .................................*....................................................... - // vmul.s16 q7, q1, r3 // ..............................................*.......................................... - // vqrdmulh.s16 q1, q1, r4 // ................................................*........................................ - // vmla.s16 q7, q1, r12 // ..................................................*...................................... - // vsub.u16 q1, q0, q7 // .....................................................*................................... - // vadd.u16 q0, q0, q7 // .......................................................*................................. - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_3)*8)] // ...............................*......................................................... - // vmul.s16 q7, q3, r3 // ....................................................*.................................... - // vqrdmulh.s16 q3, q3, r4 // ......................................................*.................................. - // vmla.s16 q7, q3, r12 // ........................................................*................................ - // vsub.u16 q3, q2, q7 // ..........................................................*.............................. - // vadd.u16 q2, q2, q7 // ............................................................*............................ - // vstrw.u32 q0, [r0], #128 // ......................................................................*.................. - // vstrw.u32 q1, [r0, #(-128+16)] // .........................................................*............................... - // vstrw.u32 q2, [r0, #(-128+32)] // ..............................................................*.......................... - // vstrw.u32 q3, [r0, #(-128+48)] // ..................................................................*...................... - // qrestore q1, QSTACK4 // ................................................................*........................ - // qrestore q2, QSTACK5 // ....................................................................*.................... - // qrestore q3, QSTACK6 // ...................................................*..................................... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_4)*8)] // ......................................*.................................................. - // vmul.s16 q7, q3, r3 // ...........................................................*............................. - // vqrdmulh.s16 q3, q3, r4 // .............................................................*........................... - // vmla.s16 q7, q3, r12 // ...................................................................*..................... - // vsub.u16 q3, q1, q7 // ...........................................................................*............. - // vadd.u16 q1, q1, q7 // .........................................................................*............... - // vmul.s16 q7, q4, r3 // .................................................................*....................... - // vqrdmulh.s16 q4, q4, r4 // ...............................................................*......................... - // vmla.s16 q7, q4, r12 // .....................................................................*................... - // vsub.u16 q4, q2, q7 // .......................................................................*................. - // vadd.u16 q2, q2, q7 // .............................................................................*........... - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_5)*8)] // ..........................................*.............................................. - // vmul.s16 q7, q2, r3 // ................................................................................*........ - // vqrdmulh.s16 q2, q2, r4 // ..............................................................................*.......... - // vmla.s16 q7, q2, r12 // ...................................................................................*..... - // vsub.u16 q2, q1, q7 // .....................................................................................*... - // vadd.u16 q1, q1, q7 // .......................................................................................*. - // ldrd r3, r4, [r11, #((-7 + POS_ROOT_6)*8)] // ....................................*.................................................... - // vmul.s16 q7, q4, r3 // ........................................................................*................ - // vqrdmulh.s16 q4, q4, r4 // ..........................................................................*.............. - // vmla.s16 q7, q4, r12 // ............................................................................*............ - // vsub.u16 q4, q3, q7 // .................................................................................*....... - // vadd.u16 q3, q3, q7 // ...............................................................................*......... - // vstrw.u32 q1, [r0, #(-128+64)] // ........................................................................................* - // vstrw.u32 q2, [r0, #(-128+80)] // ......................................................................................*.. - // vstrw.u32 q3, [r0, #(-128+96)] // ..................................................................................*...... - // vstrw.u32 q4, [r0, #(-128+112)] // ....................................................................................*.... - - le lr, layer345_loop - - // Layer 67 - - // Use a different base register to facilitate Helight being able to - // overlap the first iteration of L67 with the last iteration of L345. - restore inp, STACK0 - mov lr, #8 - .p2align 2 - vld40.32 {q0,q1,q2,q3}, [r1] // *..... - // gap // ...... - vld41.32 {q0,q1,q2,q3}, [r1] // .*.... - // gap // ...... - vld42.32 {q0,q1,q2,q3}, [r1] // ..*... - // gap // ...... - vld43.32 {q0,q1,q2,q3}, [r1]! // ...*.. - // gap // ...... - vldrh.16 q7, [r11, #16] // ....*. - // gap // ...... - vqrdmulh.s16 q5, q2, q7 // .....* - - // original source code - // vld40.32 {q0,q1,q2,q3}, [r1] // *..... - // vld41.32 {q0,q1,q2,q3}, [r1] // .*.... - // vld42.32 {q0,q1,q2,q3}, [r1] // ..*... - // vld43.32 {q0,q1,q2,q3}, [r1]! // ...*.. - // vldrh.16 q7, [r11, #16] // ....*. - // vqrdmulh.s16 q5, q2, q7 // .....* - - sub lr, lr, #1 -.p2align 2 -layer67_loop: - vqrdmulh.s16 q6, q3, q7 // ............*..................... - vldrh.16 q4, [r11] , #96 // ....*............................. - vmul.s16 q7, q3, q4 // ...........*...................... - vldrh.16 q3, [r11, #-48] // .................*................ - vmla.s16 q7, q6, r12 // .............*.................... - vldrh.16 q6, [r11, #-64] // ................*................. - vmul.s16 q4, q2, q4 // ......*........................... - vadd.u16 q2, q1, q7 // ...............*.................. - vmla.s16 q4, q5, r12 // ........*......................... - vsub.u16 q7, q1, q7 // ..............*................... - vqrdmulh.s16 q1, q2, q3 // ...................*.............. - vsub.u16 q5, q0, q4 // .........*........................ - vmul.s16 q2, q2, q6 // ..................*............... - vadd.u16 q4, q0, q4 // ..........*....................... - vmla.s16 q2, q1, r12 // ....................*............. - vldrh.16 q6, [r11, #-16] // ........................*......... - vsub.u16 q1, q4, q2 // .....................*............ - vstrw.u32 q1, [r1, #-48] // ...............................*.. - vadd.u16 q4, q4, q2 // ......................*........... - vstrw.u32 q4, [r1, #-64] // ..............................*... - vld40.32 {q0,q1,q2,q3}, [r1] // e................................. - vqrdmulh.s16 q4, q7, q6 // ..........................*....... - vldrh.16 q6, [r11, #-32] // .......................*.......... - vmul.s16 q7, q7, q6 // .........................*........ - vld41.32 {q0,q1,q2,q3}, [r1] // .e................................ - vmla.s16 q7, q4, r12 // ...........................*...... - vld42.32 {q0,q1,q2,q3}, [r1] // ..e............................... - vadd.u16 q6, q5, q7 // .............................*.... - vld43.32 {q0,q1,q2,q3}, [r1]! // ...e.............................. - vsub.u16 q4, q5, q7 // ............................*..... - vldrh.16 q7, [r11, #16] // .....e............................ - vstrw.u32 q4, [r1, #-80] // .................................* - vqrdmulh.s16 q5, q2, q7 // .......e.......................... - vstrw.u32 q6, [r1, #-96] // ................................*. - - // original source code - // vld40.32 {q0, q1, q2, q3}, [r1] // e.............|...................e............. - // vld41.32 {q0, q1, q2, q3}, [r1] // ....e.........|.......................e......... - // vld42.32 {q0, q1, q2, q3}, [r1] // ......e.......|.........................e....... - // vld43.32 {q0, q1, q2, q3}, [r1]! // ........e.....|...........................e..... - // vldrh.16 q5, [r11], #+96 // ..............|*................................ - // vldrh.16 q6, [r11, #(+16-96)] // ..........e...|.............................e... - // vmul.s16 q7, q2, q5 // ..............|.....*........................... - // vqrdmulh.s16 q2, q2, q6 // ............e.|...............................e. - // vmla.s16 q7, q2, r12 // ..............|.......*......................... - // vsub.u16 q2, q0, q7 // ..............|..........*...................... - // vadd.u16 q0, q0, q7 // ..............|............*.................... - // vmul.s16 q7, q3, q5 // ..............|.*............................... - // vqrdmulh.s16 q3, q3, q6 // ..............*................................. - // vmla.s16 q7, q3, r12 // ..............|...*............................. - // vsub.u16 q3, q1, q7 // ..............|........*........................ - // vadd.u16 q1, q1, q7 // ..............|......*.......................... - // vldrh.16 q5, [r11, #(32 - 96)] // ..............|....*............................ - // vldrh.16 q6, [r11, #(48 - 96)] // ..............|..*.............................. - // vmul.s16 q7, q1, q5 // ..............|...........*..................... - // vqrdmulh.s16 q1, q1, q6 // ..............|.........*....................... - // vmla.s16 q7, q1, r12 // ..............|.............*................... - // vsub.u16 q1, q0, q7 // ..............|...............*................. - // vadd.u16 q0, q0, q7 // ..............|.................*............... - // vldrh.16 q5, [r11, #(64-96)] // ..*...........|.....................*........... - // vldrh.16 q6, [r11, #(80-96)] // ..............|..............*.................. - // vmul.s16 q7, q3, q5 // ...*..........|......................*.......... - // vqrdmulh.s16 q3, q3, q6 // .*............|....................*............ - // vmla.s16 q7, q3, r12 // .....*........|........................*........ - // vsub.u16 q3, q2, q7 // .........*....|............................*.... - // vadd.u16 q2, q2, q7 // .......*......|..........................*...... - // vstrw.u32 q0, [r1, #-64] // ..............|..................*.............. - // vstrw.u32 q1, [r1, #-48] // ..............|................*................ - // vstrw.u32 q2, [r1, #-32] // .............*|................................* - // vstrw.u32 q3, [r1, #-16] // ...........*..|..............................*.. - - le lr, layer67_loop - vqrdmulh.s16 q4, q3, q7 // *........................... - vldrh.16 q6, [r11] , #96 // .*.......................... - vmul.s16 q3, q3, q6 // ..*......................... - vldrh.16 q7, [r11, #-64] // .....*...................... - vmla.s16 q3, q4, r12 // ....*....................... - vldrh.16 q4, [r11, #-48] // ...*........................ - vmul.s16 q6, q2, q6 // ......*..................... - vsub.u16 q2, q1, q3 // .........*.................. - vmla.s16 q6, q5, r12 // ........*................... - vadd.u16 q5, q1, q3 // .......*.................... - vqrdmulh.s16 q1, q5, q4 // ..........*................. - vsub.u16 q4, q0, q6 // ...........*................ - vmul.s16 q5, q5, q7 // ............*............... - vadd.u16 q0, q0, q6 // .............*.............. - vmla.s16 q5, q1, r12 // ..............*............. - vldrh.16 q7, [r11, #-16] // ...............*............ - vadd.u16 q1, q0, q5 // ..................*......... - vstrw.u32 q1, [r1, #-64] // ...................*........ - vqrdmulh.s16 q3, q2, q7 // ....................*....... - vldrh.16 q1, [r11, #-32] // .....................*...... - vmul.s16 q7, q2, q1 // ......................*..... - vsub.u16 q1, q0, q5 // ................*........... - vmla.s16 q7, q3, r12 // .......................*.... - vstrw.u32 q1, [r1, #-48] // .................*.......... - vadd.u16 q1, q4, q7 // ........................*... - vstrw.u32 q1, [r1, #-32] // ...........................* - vsub.u16 q1, q4, q7 // .........................*.. - vstrw.u32 q1, [r1, #-16] // ..........................*. - - // original source code - // vqrdmulh.s16 q6, q3, q7 // *........................... - // vldrh.16 q4, [r11] , #96 // .*.......................... - // vmul.s16 q7, q3, q4 // ..*......................... - // vldrh.16 q3, [r11, #-48] // .....*...................... - // vmla.s16 q7, q6, r12 // ....*....................... - // vldrh.16 q6, [r11, #-64] // ...*........................ - // vmul.s16 q4, q2, q4 // ......*..................... - // vadd.u16 q2, q1, q7 // .........*.................. - // vmla.s16 q4, q5, r12 // ........*................... - // vsub.u16 q7, q1, q7 // .......*.................... - // vqrdmulh.s16 q1, q2, q3 // ..........*................. - // vsub.u16 q5, q0, q4 // ...........*................ - // vmul.s16 q2, q2, q6 // ............*............... - // vadd.u16 q4, q0, q4 // .............*.............. - // vmla.s16 q2, q1, r12 // ..............*............. - // vldrh.16 q6, [r11, #-16] // ...............*............ - // vsub.u16 q1, q4, q2 // .....................*...... - // vstrw.u32 q1, [r1, #-48] // .......................*.... - // vadd.u16 q4, q4, q2 // ................*........... - // vstrw.u32 q4, [r1, #-64] // .................*.......... - // vqrdmulh.s16 q4, q7, q6 // ..................*......... - // vldrh.16 q6, [r11, #-32] // ...................*........ - // vmul.s16 q7, q7, q6 // ....................*....... - // vmla.s16 q7, q4, r12 // ......................*..... - // vadd.u16 q6, q5, q7 // ........................*... - // vsub.u16 q4, q5, q7 // ..........................*. - // vstrw.u32 q4, [r1, #-16] // ...........................* - // vstrw.u32 q6, [r1, #-32] // .........................*.. - - - add sp, sp, #STACK_SIZE - - align_stack_undo - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_12_345_67_twiddles.s b/tests/ntt_kyber/manual/ntt_kyber_12_345_67_twiddles.s deleted file mode 100644 index 5030cb2..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_12_345_67_twiddles.s +++ /dev/null @@ -1,474 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -1600 -.word -15749 -.word -749 -.word -7373 -.word -40 -.word -394 -.word -687 -.word -6762 -.word 1062 -.word 10453 -.word 296 -.word 2914 -.word -882 -.word -8682 -.word -1410 -.word -13879 -.word 1339 -.word 13180 -.word 1476 -.word 14529 -.word 630 -.word 6201 -.word 193 -.word 1900 -.word -283 -.word -2786 -.word 56 -.word 551 -.word 797 -.word 7845 -.word -1089 -.word -10719 -.word 1333 -.word 13121 -.word -1432 -.word -14095 -.word -543 -.word -5345 -.word 1426 -.word 14036 -.word -1235 -.word -12156 -.word -69 -.word -679 -.word 535 -.word 5266 -.word -447 -.word -4400 -.word 848 -.word 8347 -.word 569 -.word 5601 -.word -936 -.word -9213 -.word -450 -.word -4429 -.word -1583 -.word -15582 -.word -1355 -.word -13338 -.word 821 -.word 8081 -// Word count until here: 62 -// Blocked layers start -.short 289 -.short 289 -.short 331 -.short 331 -.short -76 -.short -76 -.short -1573 -.short -1573 -.short 2845 -.short 2845 -.short 3258 -.short 3258 -.short -748 -.short -748 -.short -15483 -.short -15483 -.short 17 -.short 17 -.short 583 -.short 583 -.short 1637 -.short 1637 -.short -1041 -.short -1041 -.short 167 -.short 167 -.short 5739 -.short 5739 -.short 16113 -.short 16113 -.short -10247 -.short -10247 -.short -568 -.short -568 -.short -680 -.short -680 -.short 723 -.short 723 -.short 1100 -.short 1100 -.short -5591 -.short -5591 -.short -6693 -.short -6693 -.short 7117 -.short 7117 -.short 10828 -.short 10828 -.short 1197 -.short 1197 -.short -1025 -.short -1025 -.short -1052 -.short -1052 -.short -1274 -.short -1274 -.short 11782 -.short 11782 -.short -10089 -.short -10089 -.short -10355 -.short -10355 -.short -12540 -.short -12540 -.short 1409 -.short 1409 -.short -48 -.short -48 -.short 756 -.short 756 -.short -314 -.short -314 -.short 13869 -.short 13869 -.short -472 -.short -472 -.short 7441 -.short 7441 -.short -3091 -.short -3091 -.short -667 -.short -667 -.short 233 -.short 233 -.short -1173 -.short -1173 -.short -279 -.short -279 -.short -6565 -.short -6565 -.short 2293 -.short 2293 -.short -11546 -.short -11546 -.short -2746 -.short -2746 -.short 650 -.short 650 -.short -1352 -.short -1352 -.short -816 -.short -816 -.short 632 -.short 632 -.short 6398 -.short 6398 -.short -13308 -.short -13308 -.short -8032 -.short -8032 -.short 6221 -.short 6221 -.short -1626 -.short -1626 -.short -540 -.short -540 -.short -1482 -.short -1482 -.short 1461 -.short 1461 -.short -16005 -.short -16005 -.short -5315 -.short -5315 -.short -14588 -.short -14588 -.short 14381 -.short 14381 -.short 1651 -.short 1651 -.short -1540 -.short -1540 -.short 952 -.short 952 -.short -642 -.short -642 -.short 16251 -.short 16251 -.short -15159 -.short -15159 -.short 9371 -.short 9371 -.short -6319 -.short -6319 -.short -464 -.short -464 -.short 33 -.short 33 -.short 1320 -.short 1320 -.short -1414 -.short -1414 -.short -4567 -.short -4567 -.short 325 -.short 325 -.short 12993 -.short 12993 -.short -13918 -.short -13918 -.short 939 -.short 939 -.short -892 -.short -892 -.short 733 -.short 733 -.short 268 -.short 268 -.short 9243 -.short 9243 -.short -8780 -.short -8780 -.short 7215 -.short 7215 -.short 2638 -.short 2638 -.short -1021 -.short -1021 -.short -941 -.short -941 -.short -992 -.short -992 -.short 641 -.short 641 -.short -10050 -.short -10050 -.short -9262 -.short -9262 -.short -9764 -.short -9764 -.short 6309 -.short 6309 -.short -1010 -.short -1010 -.short 1435 -.short 1435 -.short 807 -.short 807 -.short 452 -.short 452 -.short -9942 -.short -9942 -.short 14125 -.short 14125 -.short 7943 -.short 7943 -.short 4449 -.short 4449 -.short 1584 -.short 1584 -.short -1292 -.short -1292 -.short 375 -.short 375 -.short -1239 -.short -1239 -.short 15592 -.short 15592 -.short -12717 -.short -12717 -.short 3691 -.short 3691 -.short -12196 -.short -12196 -.short -1031 -.short -1031 -.short -109 -.short -109 -.short -780 -.short -780 -.short 1645 -.short 1645 -.short -10148 -.short -10148 -.short -1073 -.short -1073 -.short -7678 -.short -7678 -.short 16192 -.short 16192 -.short 1438 -.short 1438 -.short -461 -.short -461 -.short 1534 -.short 1534 -.short -927 -.short -927 -.short 14155 -.short 14155 -.short -4538 -.short -4538 -.short 15099 -.short 15099 -.short -9125 -.short -9125 -.short 1063 -.short 1063 -.short -556 -.short -556 -.short -1230 -.short -1230 -.short -863 -.short -863 -.short 10463 -.short 10463 -.short -5473 -.short -5473 -.short -12107 -.short -12107 -.short -8495 -.short -8495 -.short 319 -.short 319 -.short 757 -.short 757 -.short 561 -.short 561 -.short -735 -.short -735 -.short 3140 -.short 3140 -.short 7451 -.short 7451 -.short 5522 -.short 5522 -.short -7235 -.short -7235 -.short -682 -.short -682 -.short -712 -.short -712 -.short 1481 -.short 1481 -.short 648 -.short 648 -.short -6713 -.short -6713 -.short -7008 -.short -7008 -.short 14578 -.short 14578 -.short 6378 -.short 6378 -.short -525 -.short -525 -.short 403 -.short 403 -.short 1143 -.short 1143 -.short -554 -.short -554 -.short -5168 -.short -5168 -.short 3967 -.short 3967 -.short 11251 -.short 11251 -.short -5453 -.short -5453 -.short 1092 -.short 1092 -.short 1026 -.short 1026 -.short -1179 -.short -1179 -.short 886 -.short 886 -.short 10749 -.short 10749 -.short 10099 -.short 10099 -.short -11605 -.short -11605 -.short 8721 -.short 8721 -.short -855 -.short -855 -.short -219 -.short -219 -.short 1227 -.short 1227 -.short 910 -.short 910 -.short -8416 -.short -8416 -.short -2156 -.short -2156 -.short 12078 -.short 12078 -.short 8957 -.short 8957 -.short -1607 -.short -1607 -.short -1455 -.short -1455 -.short -1219 -.short -1219 -.short 885 -.short 885 -.short -15818 -.short -15818 -.short -14322 -.short -14322 -.short -11999 -.short -11999 -.short 8711 -.short 8711 -.short 1212 -.short 1212 -.short 1029 -.short 1029 -.short -394 -.short -394 -.short -1175 -.short -1175 -.short 11930 -.short 11930 -.short 10129 -.short 10129 -.short -3878 -.short -3878 -.short -11566 -.short -11566 \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans.s deleted file mode 100644 index 143ab6e..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans.s +++ /dev/null @@ -1,217 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_kyber_1_23_45_67_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_twisted - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -.macro load_first_root root0, root0_twisted - ldrd root0, root0_twisted, [root_ptr], #+8 -.endm - -.macro load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_1_23_45_67_no_trans, %function -.global ntt_kyber_1_23_45_67_no_trans -ntt_kyber_1_23_45_67_no_trans: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -3329 - movw modulus, #:lower16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*64) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1 */ - - load_first_root root0, root0_twisted - - mov lr, #16 -layer1_loop: - vldrw.u32 data0, [in_low] - vldrw.u32 data1, [in_high] - - ct_butterfly data0, data1, root0, root0_twisted - - vstrw.u32 data0, [in_low], #16 - vstrw.u32 data1, [in_high], #16 - - le lr, layer1_loop - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(4*64) - - /* Layers 2,3 */ - - count .req r1 - mov count, #2 - -out_start: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - mov lr, #4 -layer23_loop: - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #(4*1*16)] - vldrw.u32 data2, [in, #(4*2*16)] - vldrw.u32 data3, [in, #(4*3*16)] - - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.u32 data0, [in], #16 - vstrw.u32 data1, [in, #(4*1*16 - 16)] - vstrw.u32 data2, [in, #(4*2*16 - 16)] - vstrw.u32 data3, [in, #(4*3*16 - 16)] - - le lr, layer23_loop - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*128) - - /* Layers 4,5 */ - - mov lr, #8 -layer45_loop: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #16] - vldrw.u32 data2, [in, #32] - vldrw.u32 data3, [in, #48] - - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - - vst40.u32 {data0, data1, data2, data3}, [in] - vst41.u32 {data0, data1, data2, data3}, [in] - vst42.u32 {data0, data1, data2, data3}, [in] - vst43.u32 {data0, data1, data2, data3}, [in]! - - le lr, layer45_loop - - sub in, in, #(4*128) - - /* Layers 6,7 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #8 -layer67_loop: - vldrw.u32 data0, [in], #64 - vldrw.u32 data1, [in, #(16 - 64)] - vldrw.u32 data2, [in, #(32 - 64)] - vldrw.u32 data3, [in, #(48 - 64)] - - vldrh.u16 root0, [root_ptr], #+96 - vldrh.u16 root0_twisted, [root_ptr, #(+16-96)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - - vldrh.u16 root1, [root_ptr, #(32 - 96)] - vldrh.u16 root1_twisted, [root_ptr, #(48 - 96)] - ct_butterfly data0, data1, root1, root1_twisted - - vldrh.u16 root2, [root_ptr, #(64-96)] - vldrh.u16 root2_twisted, [root_ptr, #(80-96)] - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.32 data0, [in, #( 0 - 64)] - vstrw.32 data1, [in, #(16 - 64)] - vstrw.32 data2, [in, #(32 - 64)] - vstrw.32 data3, [in, #(48 - 64)] - le lr, layer67_loop - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m55.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m55.s deleted file mode 100644 index 364e8f9..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m55.s +++ /dev/null @@ -1,620 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_kyber_1_23_45_67_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_twisted - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -.macro load_first_root root0, root0_twisted - ldrd root0, root0_twisted, [root_ptr], #+8 -.endm - -.macro load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_1_23_45_67_no_trans_opt_m55, %function -.global ntt_kyber_1_23_45_67_no_trans_opt_m55 -ntt_kyber_1_23_45_67_no_trans_opt_m55: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -3329 - movw modulus, #:lower16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*64) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1 */ - - load_first_root root0, root0_twisted - - mov lr, #16 - vldrw.u32 q0, [r1] // * - - // original source code - // vldrw.u32 q0, [r1] // * - - sub lr, lr, #1 -.p2align 2 -layer1_loop: - vmul.s16 q7, q0, r2 // ..*...... - // gap // ......... - vqrdmulh.s16 q4, q0, r3 // ...*..... - vldrw.u32 q5, [r0] // *........ - vmla.s16 q7, q4, r12 // ....*.... - vldrw.u32 q0, [r1, #16] // .e....... - vsub.u16 q4, q5, q7 // .....*... - vstrw.u32 q4, [r1] , #16 // ........* - vadd.u16 q4, q5, q7 // ......*.. - vstrw.u32 q4, [r0] , #16 // .......*. - - // original source code - // vldrw.u32 q0, [r0] // .....|.*...... - // vldrw.u32 q1, [r1] // e....|...e.... - // vmul.s16 q4, q1, r2 // .....*........ - // vqrdmulh.s16 q1, q1, r3 // .....|*....... - // vmla.s16 q4, q1, r12 // .....|..*..... - // vsub.u16 q1, q0, q4 // .*...|....*... - // vadd.u16 q0, q0, q4 // ...*.|......*. - // vstrw.u32 q0, [r0], #16 // ....*|.......* - // vstrw.u32 q1, [r1], #16 // ..*..|.....*.. - - le lr, layer1_loop - vmul.s16 q6, q0, r2 // *....... - vldrw.u32 q1, [r0] // ..*..... - vqrdmulh.s16 q0, q0, r3 // .*...... - // gap // ........ - vmla.s16 q6, q0, r12 // ...*.... - // gap // ........ - vadd.u16 q3, q1, q6 // ......*. - vstrw.u32 q3, [r0] , #16 // .......* - vsub.u16 q1, q1, q6 // ....*... - vstrw.u32 q1, [r1] , #16 // .....*.. - - // original source code - // vmul.s16 q7, q0, r2 // *....... - // vqrdmulh.s16 q4, q0, r3 // ..*..... - // vldrw.u32 q5, [r0] // .*...... - // vmla.s16 q7, q4, r12 // ...*.... - // vsub.u16 q4, q5, q7 // ......*. - // vstrw.u32 q4, [r1] , #16 // .......* - // vadd.u16 q4, q5, q7 // ....*... - // vstrw.u32 q4, [r0] , #16 // .....*.. - - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(4*64) - - /* Layers 2,3 */ - - count .req r1 - mov count, #2 - -out_start: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - mov lr, #4 - vldrw.u32 q4, [r0, #192] // .*... - vmul.s16 q7, q4, r2 // ...*. - vldrw.u32 q1, [r0, #64] // *.... - vqrdmulh.s16 q4, q4, r3 // ..*.. - // gap // ..... - vmla.s16 q7, q4, r12 // ....* - - // original source code - // vldrw.u32 q1, [r0, #64] // ..*.. - // vldrw.u32 q7, [r0, #192] // *.... - // vqrdmulh.s16 q3, q7, r3 // ...*. - // vmul.s16 q7, q7, r2 // .*... - // vmla.s16 q7, q3, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer23_loop: - vadd.u16 q3, q1, q7 // .............*.............. - vqrdmulh.s16 q6, q3, r5 // ...............*............ - vldrw.u32 q0, [r0, #128] // ..*......................... - vmul.s16 q2, q0, r2 // ....*....................... - vldrw.u32 q5, [r0] // *........................... - vqrdmulh.s16 q4, q0, r3 // .....*...................... - vsub.u16 q0, q1, q7 // ............*............... - vmla.s16 q2, q4, r12 // ......*..................... - vldrw.u32 q1, [r0, #80] // .e.......................... - vmul.s16 q4, q3, r4 // ..............*............. - vadd.u16 q3, q5, q2 // ........*................... - vmla.s16 q4, q6, r12 // ................*........... - vldrw.u32 q7, [r0, #208] // ...e........................ - vsub.u16 q6, q3, q4 // .................*.......... - vstrw.u32 q6, [r0, #64] // .........................*.. - vmul.s16 q6, q0, r6 // ...................*........ - vsub.u16 q5, q5, q2 // .......*.................... - vqrdmulh.s16 q0, q0, r7 // ....................*....... - vadd.u16 q4, q3, q4 // ..................*......... - vmla.s16 q6, q0, r12 // .....................*...... - vstrw.u32 q4, [r0] , #16 // ........................*... - vadd.u16 q0, q5, q6 // .......................*.... - vqrdmulh.s16 q3, q7, r3 // ..........e................. - vstrw.u32 q0, [r0, #112] // ..........................*. - vmul.s16 q7, q7, r2 // .........e.................. - vsub.u16 q4, q5, q6 // ......................*..... - vmla.s16 q7, q3, r12 // ...........e................ - vstrw.u32 q4, [r0, #176] // ...........................* - - // original source code - // vldrw.u32 q0, [r0] // ....................|...*....................... - // vldrw.u32 q1, [r0, #(4*1*16)] // e...................|.......e................... - // vldrw.u32 q2, [r0, #(4*2*16)] // ....................|.*......................... - // vldrw.u32 q3, [r0, #(4*3*16)] // ....e...............|...........e............... - // vmul.s16 q4, q2, r2 // ....................|..*........................ - // vqrdmulh.s16 q2, q2, r3 // ....................|....*...................... - // vmla.s16 q4, q2, r12 // ....................|......*.................... - // vsub.u16 q2, q0, q4 // ........*...........|...............*........... - // vadd.u16 q0, q0, q4 // ..*.................|.........*................. - // vmul.s16 q4, q3, r2 // ................e...|.......................e... - // vqrdmulh.s16 q3, q3, r3 // ..............e.....|.....................e..... - // vmla.s16 q4, q3, r12 // ..................e.|.........................e. - // vsub.u16 q3, q1, q4 // ....................|.....*..................... - // vadd.u16 q1, q1, q4 // ....................*........................... - // vmul.s16 q4, q1, r4 // .*..................|........*.................. - // vqrdmulh.s16 q1, q1, r5 // ....................|*.......................... - // vmla.s16 q4, q1, r12 // ...*................|..........*................ - // vsub.u16 q1, q0, q4 // .....*..............|............*.............. - // vadd.u16 q0, q0, q4 // ..........*.........|.................*......... - // vmul.s16 q4, q3, r6 // .......*............|..............*............ - // vqrdmulh.s16 q3, q3, r7 // .........*..........|................*.......... - // vmla.s16 q4, q3, r12 // ...........*........|..................*........ - // vsub.u16 q3, q2, q4 // .................*..|........................*.. - // vadd.u16 q2, q2, q4 // .............*......|....................*...... - // vstrw.u32 q0, [r0], #16 // ............*.......|...................*....... - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ......*.............|.............*............. - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // ...............*....|......................*.... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // ...................*|..........................* - - le lr, layer23_loop - vldrw.u32 q0, [r0, #128] // ..*.................... - vmul.s16 q6, q0, r2 // ...*................... - vadd.u16 q5, q1, q7 // *...................... - vqrdmulh.s16 q3, q5, r5 // .*..................... - vsub.u16 q4, q1, q7 // ......*................ - vqrdmulh.s16 q2, q0, r3 // .....*................. - vldrw.u32 q1, [r0] // ....*.................. - vmla.s16 q6, q2, r12 // .......*............... - // gap // ....................... - vmul.s16 q7, q5, r4 // ........*.............. - vadd.u16 q0, q1, q6 // .........*............. - vmla.s16 q7, q3, r12 // ..........*............ - vsub.u16 q6, q1, q6 // ..............*........ - vqrdmulh.s16 q2, q4, r7 // ...............*....... - vsub.u16 q5, q0, q7 // ...........*........... - vmul.s16 q4, q4, r6 // .............*......... - vstrw.u32 q5, [r0, #64] // ............*.......... - vmla.s16 q4, q2, r12 // .................*..... - vadd.u16 q0, q0, q7 // ................*...... - vstrw.u32 q0, [r0] , #16 // ..................*.... - vadd.u16 q5, q6, q4 // ...................*... - vstrw.u32 q5, [r0, #112] // ....................*.. - vsub.u16 q2, q6, q4 // .....................*. - vstrw.u32 q2, [r0, #176] // ......................* - - // original source code - // vadd.u16 q3, q1, q7 // ..*.................... - // vqrdmulh.s16 q6, q3, r5 // ...*................... - // vldrw.u32 q0, [r0, #128] // *...................... - // vmul.s16 q2, q0, r2 // .*..................... - // vldrw.u32 q5, [r0] // ......*................ - // vqrdmulh.s16 q4, q0, r3 // .....*................. - // vsub.u16 q0, q1, q7 // ....*.................. - // vmla.s16 q2, q4, r12 // .......*............... - // vmul.s16 q4, q3, r4 // ........*.............. - // vadd.u16 q3, q5, q2 // .........*............. - // vmla.s16 q4, q6, r12 // ..........*............ - // vsub.u16 q6, q3, q4 // .............*......... - // vstrw.u32 q6, [r0, #64] // ...............*....... - // vmul.s16 q6, q0, r6 // ..............*........ - // vsub.u16 q5, q5, q2 // ...........*........... - // vqrdmulh.s16 q0, q0, r7 // ............*.......... - // vadd.u16 q4, q3, q4 // .................*..... - // vmla.s16 q6, q0, r12 // ................*...... - // vstrw.u32 q4, [r0] , #16 // ..................*.... - // vadd.u16 q0, q5, q6 // ...................*... - // vstrw.u32 q0, [r0, #112] // ....................*.. - // vsub.u16 q4, q5, q6 // .....................*. - // vstrw.u32 q4, [r0, #176] // ......................* - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*128) - - /* Layers 4,5 */ - - mov lr, #8 - ldrd r8, r3, [r11] , #24 // ..*.... - vldrw.u32 q4, [r0, #48] // .*..... - vmul.s16 q0, q4, r8 // ....*.. - ldrd r5, r7, [r11, #-16] // ...*... - vqrdmulh.s16 q4, q4, r3 // .....*. - vldrw.u32 q2, [r0, #16] // *...... - vmla.s16 q0, q4, r12 // ......* - - // original source code - // vldrw.u32 q2, [r0, #16] // .....*. - // vldrw.u32 q3, [r0, #48] // .*..... - // ldrd r8, r3, [r11] , #24 // *...... - // ldrd r5, r7, [r11, #-16] // ...*... - // vmul.s16 q0, q3, r8 // ..*.... - // vqrdmulh.s16 q3, q3, r3 // ....*.. - // vmla.s16 q0, q3, r12 // ......* - - sub lr, lr, #1 -.p2align 2 -layer45_loop: - vadd.u16 q7, q2, q0 // ................*.............. - vqrdmulh.s16 q6, q7, r7 // ..................*............ - vldrw.u32 q4, [r0, #32] // .....*......................... - vmul.s16 q5, q4, r8 // .......*....................... - vldrw.u32 q1, [r0] // ...*........................... - vqrdmulh.s16 q4, q4, r3 // ........*...................... - vsub.u16 q0, q2, q0 // ...............*............... - vmla.s16 q5, q4, r12 // .........*..................... - ldrd r1, r2, [r11, #-8] // ..*............................ - vmul.s16 q2, q7, r5 // .................*............. - vadd.u16 q4, q1, q5 // ...........*................... - vmla.s16 q2, q6, r12 // ...................*........... - vsub.u16 q1, q1, q5 // ..........*.................... - vmul.s16 q7, q0, r1 // ......................*........ - vsub.u16 q5, q4, q2 // ....................*.......... - vqrdmulh.s16 q0, q0, r2 // .......................*....... - vadd.u16 q4, q4, q2 // .....................*......... - vmla.s16 q7, q0, r12 // ........................*...... - vldrw.u32 q2, [r0, #80] // ....e.......................... - vadd.u16 q6, q1, q7 // ..........................*.... - vldrw.u32 q3, [r0, #112] // ......e........................ - vsub.u16 q7, q1, q7 // .........................*..... - ldrd r8, r3, [r11] , #24 // e.............................. - ldrd r5, r7, [r11, #-16] // .e............................. - vst40.u32 {q4,q5,q6,q7}, [r0] // ...........................*... - vmul.s16 q0, q3, r8 // ............e.................. - vst41.u32 {q4,q5,q6,q7}, [r0] // ............................*.. - vqrdmulh.s16 q3, q3, r3 // .............e................. - vst42.u32 {q4,q5,q6,q7}, [r0] // .............................*. - vmla.s16 q0, q3, r12 // ..............e................ - vst43.u32 {q4,q5,q6,q7}, [r0]! // ..............................* - - // original source code - // ldrd r2, r3, [r11], #+24 // ....e........|.....................e........ - // ldrd r4, r5, [r11, #(-16)] // .....e.......|......................e....... - // ldrd r6, r7, [r11, #(-8)] // .............|.......*...................... - // vldrw.u32 q0, [r0] // .............|...*.......................... - // vldrw.u32 q1, [r0, #16] // e............|.................e............ - // vldrw.u32 q2, [r0, #32] // .............|.*............................ - // vldrw.u32 q3, [r0, #48] // ..e..........|...................e.......... - // vmul.s16 q4, q2, r2 // .............|..*........................... - // vqrdmulh.s16 q2, q2, r3 // .............|....*......................... - // vmla.s16 q4, q2, r12 // .............|......*....................... - // vsub.u16 q2, q0, q4 // .............|...........*.................. - // vadd.u16 q0, q0, q4 // .............|.........*.................... - // vmul.s16 q4, q3, r2 // .......e.....|........................e..... - // vqrdmulh.s16 q3, q3, r3 // .........e...|..........................e... - // vmla.s16 q4, q3, r12 // ...........e.|............................e. - // vsub.u16 q3, q1, q4 // .............|.....*........................ - // vadd.u16 q1, q1, q4 // .............*.............................. - // vmul.s16 q4, q1, r4 // .............|........*..................... - // vqrdmulh.s16 q1, q1, r5 // .............|*............................. - // vmla.s16 q4, q1, r12 // .............|..........*................... - // vsub.u16 q1, q0, q4 // .............|.............*................ - // vadd.u16 q0, q0, q4 // .............|...............*.............. - // vmul.s16 q4, q3, r6 // .............|............*................. - // vqrdmulh.s16 q3, q3, r7 // .............|..............*............... - // vmla.s16 q4, q3, r12 // .............|................*............. - // vsub.u16 q3, q2, q4 // ...*.........|....................*......... - // vadd.u16 q2, q2, q4 // .*...........|..................*........... - // vst40.u32 {q0, q1, q2, q3}, [r0] // ......*......|.......................*...... - // vst41.u32 {q0, q1, q2, q3}, [r0] // ........*....|.........................*.... - // vst42.u32 {q0, q1, q2, q3}, [r0] // ..........*..|...........................*.. - // vst43.u32 {q0, q1, q2, q3}, [r0]! // ............*|.............................* - - le lr, layer45_loop - vadd.u16 q6, q2, q0 // *....................... - vmul.s16 q3, q6, r5 // .........*.............. - vldrw.u32 q1, [r0, #32] // ..*..................... - vqrdmulh.s16 q5, q1, r3 // .....*.................. - vsub.u16 q4, q2, q0 // ......*................. - vmul.s16 q1, q1, r8 // ...*.................... - ldrd r9, r10, [r11, #-8] // ........*............... - vmla.s16 q1, q5, r12 // .......*................ - vldrw.u32 q5, [r0] // ....*................... - vqrdmulh.s16 q2, q6, r7 // .*...................... - vadd.u16 q7, q5, q1 // ..........*............. - vmla.s16 q3, q2, r12 // ...........*............ - vsub.u16 q1, q5, q1 // ............*........... - vmul.s16 q5, q4, r9 // .............*.......... - vadd.u16 q2, q7, q3 // ................*....... - vqrdmulh.s16 q0, q4, r10 // ...............*........ - vsub.u16 q3, q7, q3 // ..............*......... - vmla.s16 q5, q0, r12 // .................*...... - // gap // ........................ - vadd.u16 q4, q1, q5 // ..................*..... - // gap // ........................ - vsub.u16 q5, q1, q5 // ...................*.... - // gap // ........................ - // gap // ........................ - vst40.u32 {q2,q3,q4,q5}, [r0] // ....................*... - // gap // ........................ - vst41.u32 {q2,q3,q4,q5}, [r0] // .....................*.. - // gap // ........................ - vst42.u32 {q2,q3,q4,q5}, [r0] // ......................*. - // gap // ........................ - vst43.u32 {q2,q3,q4,q5}, [r0]! // .......................* - - // original source code - // vadd.u16 q7, q2, q0 // *....................... - // vqrdmulh.s16 q6, q7, r7 // .........*.............. - // vldrw.u32 q4, [r0, #32] // ..*..................... - // vmul.s16 q5, q4, r8 // .....*.................. - // vldrw.u32 q1, [r0] // ........*............... - // vqrdmulh.s16 q4, q4, r3 // ...*.................... - // vsub.u16 q0, q2, q0 // ....*................... - // vmla.s16 q5, q4, r12 // .......*................ - // ldrd r1, r2, [r11, #-8] // ......*................. - // vmul.s16 q2, q7, r5 // .*...................... - // vadd.u16 q4, q1, q5 // ..........*............. - // vmla.s16 q2, q6, r12 // ...........*............ - // vsub.u16 q1, q1, q5 // ............*........... - // vmul.s16 q7, q0, r1 // .............*.......... - // vsub.u16 q5, q4, q2 // ................*....... - // vqrdmulh.s16 q0, q0, r2 // ...............*........ - // vadd.u16 q4, q4, q2 // ..............*......... - // vmla.s16 q7, q0, r12 // .................*...... - // vadd.u16 q6, q1, q7 // ..................*..... - // vsub.u16 q7, q1, q7 // ...................*.... - // vst40.u32 {q4,q5,q6,q7}, [r0] // ....................*... - // vst41.u32 {q4,q5,q6,q7}, [r0] // .....................*.. - // vst42.u32 {q4,q5,q6,q7}, [r0] // ......................*. - // vst43.u32 {q4,q5,q6,q7}, [r0]! // .......................* - - - sub in, in, #(4*128) - - /* Layers 6,7 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #8 - vldrh.u16 q4, [r11, #16] // *....... - // gap // ........ - vldrw.u32 q7, [r0, #48] // ..*..... - vqrdmulh.s16 q5, q7, q4 // ....*... - vldrh.u16 q3, [r11] , #96 // .*...... - vmul.s16 q0, q7, q3 // ......*. - vldrw.u32 q7, [r0, #16] // .....*.. - vmla.s16 q0, q5, r12 // .......* - vldrw.u32 q1, [r0, #32] // ...*.... - - // original source code - // vldrh.u16 q4, [r11, #16] // *....... - // vldrh.u16 q3, [r11] , #96 // ...*.... - // vldrw.u32 q0, [r0, #48] // .*...... - // vldrw.u32 q1, [r0, #32] // .......* - // vqrdmulh.s16 q2, q0, q4 // ..*..... - // vldrw.u32 q7, [r0, #16] // .....*.. - // vmul.s16 q0, q0, q3 // ....*... - // vmla.s16 q0, q2, r12 // ......*. - - sub lr, lr, #1 -.p2align 2 -layer67_loop: - vmul.s16 q6, q1, q3 // ......*........................... - vadd.u16 q2, q7, q0 // ...............*.................. - vldrw.u32 q3, [r0] , #64 // *................................. - vsub.u16 q7, q7, q0 // ..............*................... - vqrdmulh.s16 q0, q1, q4 // .......*.......................... - vldrh.u16 q4, [r11, #16] // .....e............................ - vmla.s16 q6, q0, r12 // ........*......................... - vldrh.u16 q1, [r11, #-32] // .......................*.......... - vsub.u16 q0, q3, q6 // .........*........................ - vmul.s16 q5, q7, q1 // .........................*........ - vadd.u16 q6, q3, q6 // ..........*....................... - vldrh.u16 q3, [r11, #-16] // ........................*......... - vqrdmulh.s16 q7, q7, q3 // ..........................*....... - vldrh.u16 q3, [r11] , #96 // ....e............................. - vmla.s16 q5, q7, r12 // ...........................*...... - vldrh.u16 q7, [r11, #-144] // .................*................ - vadd.u16 q1, q0, q5 // .............................*.... - vstrw.u32 q1, [r0, #-32] // ................................*. - vsub.u16 q5, q0, q5 // ............................*..... - vqrdmulh.s16 q1, q2, q7 // ...................*.............. - vldrh.u16 q7, [r11, #-160] // ................*................. - vmul.s16 q7, q2, q7 // ..................*............... - vldrw.u32 q0, [r0, #48] // ...e.............................. - vmla.s16 q7, q1, r12 // ....................*............. - vldrw.u32 q1, [r0, #32] // ..e............................... - vsub.u16 q2, q6, q7 // .....................*............ - vstrw.u32 q2, [r0, #-48] // ...............................*.. - vadd.u16 q6, q6, q7 // ......................*........... - vqrdmulh.s16 q2, q0, q4 // ............e..................... - vldrw.u32 q7, [r0, #16] // .e................................ - vmul.s16 q0, q0, q3 // ...........e...................... - vstrw.u32 q6, [r0, #-64] // ..............................*... - vmla.s16 q0, q2, r12 // .............e.................... - vstrw.u32 q5, [r0, #-16] // .................................* - - // original source code - // vldrw.u32 q0, [r0], #64 // .............................|.*............................... - // vldrw.u32 q1, [r0, #(16 - 64)] // ........................e....|............................e.... - // vldrw.u32 q2, [r0, #(32 - 64)] // ...................e.........|.......................e......... - // vldrw.u32 q3, [r0, #(48 - 64)] // .................e...........|.....................e........... - // vldrh.u16 q5, [r11], #+96 // ........e....................|............e.................... - // vldrh.u16 q6, [r11, #(+16-96)] // e............................|....e............................ - // vmul.s16 q4, q2, q5 // .............................*................................. - // vqrdmulh.s16 q2, q2, q6 // .............................|...*............................. - // vmla.s16 q4, q2, r12 // .*...........................|.....*........................... - // vsub.u16 q2, q0, q4 // ...*.........................|.......*......................... - // vadd.u16 q0, q0, q4 // .....*.......................|.........*....................... - // vmul.s16 q4, q3, q5 // .........................e...|.............................e... - // vqrdmulh.s16 q3, q3, q6 // .......................e.....|...........................e..... - // vmla.s16 q4, q3, r12 // ...........................e.|...............................e. - // vsub.u16 q3, q1, q4 // .............................|..*.............................. - // vadd.u16 q1, q1, q4 // .............................|*................................ - // vldrh.u16 q5, [r11, #(32 - 96)] // ...............*.............|...................*............. - // vldrh.u16 q6, [r11, #(48 - 96)] // ..........*..................|..............*.................. - // vmul.s16 q4, q1, q5 // ................*............|....................*............ - // vqrdmulh.s16 q1, q1, q6 // ..............*..............|..................*.............. - // vmla.s16 q4, q1, r12 // ..................*..........|......................*.......... - // vsub.u16 q1, q0, q4 // ....................*........|........................*........ - // vadd.u16 q0, q0, q4 // ......................*......|..........................*...... - // vldrh.u16 q5, [r11, #(64-96)] // ..*..........................|......*.......................... - // vldrh.u16 q6, [r11, #(80-96)] // ......*......................|..........*...................... - // vmul.s16 q4, q3, q5 // ....*........................|........*........................ - // vqrdmulh.s16 q3, q3, q6 // .......*.....................|...........*..................... - // vmla.s16 q4, q3, r12 // .........*...................|.............*................... - // vsub.u16 q3, q2, q4 // .............*...............|.................*............... - // vadd.u16 q2, q2, q4 // ...........*.................|...............*................. - // vstrw.32 q0, [r0, #( 0 - 64)] // ..........................*..|..............................*.. - // vstrw.32 q1, [r0, #(16 - 64)] // .....................*.......|.........................*....... - // vstrw.32 q2, [r0, #(32 - 64)] // ............*................|................*................ - // vstrw.32 q3, [r0, #(48 - 64)] // ............................*|................................* - - le lr, layer67_loop - vmul.s16 q5, q1, q3 // *......................... - vadd.u16 q6, q7, q0 // .*........................ - vqrdmulh.s16 q4, q1, q4 // ....*..................... - vsub.u16 q7, q7, q0 // ...*...................... - vmla.s16 q5, q4, r12 // .....*.................... - vldrh.u16 q4, [r11, #-64] // ..................*....... - vmul.s16 q0, q6, q4 // ...................*...... - vldrh.u16 q4, [r11, #-48] // .............*............ - vqrdmulh.s16 q4, q6, q4 // .................*........ - vldrh.u16 q6, [r11, #-32] // ......*................... - vmul.s16 q6, q7, q6 // ........*................. - vldrh.u16 q2, [r11, #-16] // ..........*............... - vqrdmulh.s16 q7, q7, q2 // ...........*.............. - vldrw.u32 q2, [r0] , #64 // ..*....................... - vsub.u16 q3, q2, q5 // .......*.................. - vmla.s16 q0, q4, r12 // ....................*..... - vadd.u16 q4, q2, q5 // .........*................ - vmla.s16 q6, q7, r12 // ............*............. - vsub.u16 q7, q4, q0 // .....................*.... - vstrw.u32 q7, [r0, #-48] // ......................*... - vadd.u16 q7, q3, q6 // ..............*........... - vstrw.u32 q7, [r0, #-32] // ...............*.......... - vsub.u16 q7, q3, q6 // ................*......... - vstrw.u32 q7, [r0, #-16] // .........................* - vadd.u16 q4, q4, q0 // .......................*.. - vstrw.u32 q4, [r0, #-64] // ........................*. - - // original source code - // vmul.s16 q6, q1, q3 // *......................... - // vadd.u16 q2, q7, q0 // .*........................ - // vldrw.u32 q3, [r0] , #64 // .............*............ - // vsub.u16 q7, q7, q0 // ...*...................... - // vqrdmulh.s16 q0, q1, q4 // ..*....................... - // vmla.s16 q6, q0, r12 // ....*..................... - // vldrh.u16 q1, [r11, #-32] // .........*................ - // vsub.u16 q0, q3, q6 // ..............*........... - // vmul.s16 q5, q7, q1 // ..........*............... - // vadd.u16 q6, q3, q6 // ................*......... - // vldrh.u16 q3, [r11, #-16] // ...........*.............. - // vqrdmulh.s16 q7, q7, q3 // ............*............. - // vmla.s16 q5, q7, r12 // .................*........ - // vldrh.u16 q7, [r11, #-48] // .......*.................. - // vadd.u16 q1, q0, q5 // ....................*..... - // vstrw.u32 q1, [r0, #-32] // .....................*.... - // vsub.u16 q5, q0, q5 // ......................*... - // vqrdmulh.s16 q1, q2, q7 // ........*................. - // vldrh.u16 q7, [r11, #-64] // .....*.................... - // vmul.s16 q7, q2, q7 // ......*................... - // vmla.s16 q7, q1, r12 // ...............*.......... - // vsub.u16 q2, q6, q7 // ..................*....... - // vstrw.u32 q2, [r0, #-48] // ...................*...... - // vadd.u16 q6, q6, q7 // ........................*. - // vstrw.u32 q6, [r0, #-64] // .........................* - // vstrw.u32 q5, [r0, #-16] // .......................*.. - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m85.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m85.s deleted file mode 100644 index 125d5b0..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_opt_m85.s +++ /dev/null @@ -1,618 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_kyber_1_23_45_67_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_twisted - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -.macro load_first_root root0, root0_twisted - ldrd root0, root0_twisted, [root_ptr], #+8 -.endm - -.macro load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_1_23_45_67_no_trans_opt_m85, %function -.global ntt_kyber_1_23_45_67_no_trans_opt_m85 -ntt_kyber_1_23_45_67_no_trans_opt_m85: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -3329 - movw modulus, #:lower16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*64) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1 */ - - load_first_root root0, root0_twisted - - mov lr, #16 - vldrw.u32 q0, [r1] // *. - vqrdmulh.s16 q7, q0, r3 // .* - - // original source code - // vldrw.u32 q0, [r1] // *. - // vqrdmulh.s16 q7, q0, r3 // .* - - sub lr, lr, #1 -.p2align 2 -layer1_loop: - vmul.s16 q6, q0, r2 // ..*...... - vldrw.u32 q5, [r0] // *........ - vmla.s16 q6, q7, r12 // ....*.... - vldrw.u32 q0, [r1, #16] // .e....... - vsub.u16 q4, q5, q6 // .....*... - vstrw.u32 q4, [r1] , #16 // ........* - vadd.u16 q4, q5, q6 // ......*.. - vqrdmulh.s16 q7, q0, r3 // ...e..... - vstrw.u32 q4, [r0] , #16 // .......*. - - // original source code - // vldrw.u32 q0, [r0] // ......|*....... - // vldrw.u32 q1, [r1] // e.....|..e..... - // vmul.s16 q4, q1, r2 // ......*........ - // vqrdmulh.s16 q1, q1, r3 // ....e.|......e. - // vmla.s16 q4, q1, r12 // ......|.*...... - // vsub.u16 q1, q0, q4 // .*....|...*.... - // vadd.u16 q0, q0, q4 // ...*..|.....*.. - // vstrw.u32 q0, [r0], #16 // .....*|.......* - // vstrw.u32 q1, [r1], #16 // ..*...|....*... - - le lr, layer1_loop - vmul.s16 q0, q0, r2 // *...... - // gap // ....... - vmla.s16 q0, q7, r12 // ..*.... - vldrw.u32 q7, [r0] // .*..... - vsub.u16 q4, q7, q0 // ...*... - vstrw.u32 q4, [r1] , #16 // ....*.. - vadd.u16 q4, q7, q0 // .....*. - vstrw.u32 q4, [r0] , #16 // ......* - - // original source code - // vmul.s16 q6, q0, r2 // *...... - // vldrw.u32 q5, [r0] // ..*.... - // vmla.s16 q6, q7, r12 // .*..... - // vsub.u16 q4, q5, q6 // ...*... - // vstrw.u32 q4, [r1] , #16 // ....*.. - // vadd.u16 q4, q5, q6 // .....*. - // vstrw.u32 q4, [r0] , #16 // ......* - - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(4*64) - - /* Layers 2,3 */ - - count .req r1 - mov count, #2 - -out_start: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - mov lr, #4 - vldrw.u32 q1, [r0, #128] // *. - vqrdmulh.s16 q3, q1, r3 // .* - - // original source code - // vldrw.u32 q1, [r0, #128] // *. - // vqrdmulh.s16 q3, q1, r3 // .* - - sub lr, lr, #1 -.p2align 2 -layer23_loop: - vmul.s16 q2, q1, r2 // ....*....................... - vldrw.u32 q4, [r0, #192] // ...*........................ - vqrdmulh.s16 q6, q4, r3 // ..........*................. - vldrw.u32 q1, [r0, #144] // ..e......................... - vmul.s16 q7, q4, r2 // .........*.................. - vldrw.u32 q5, [r0] // *........................... - vmla.s16 q7, q6, r12 // ...........*................ - vldrw.u32 q6, [r0, #64] // .*.......................... - vmla.s16 q2, q3, r12 // ......*..................... - vsub.u16 q0, q6, q7 // ............*............... - vmul.s16 q4, q0, r6 // ...................*........ - vadd.u16 q7, q6, q7 // .............*.............. - vqrdmulh.s16 q3, q0, r7 // ....................*....... - vadd.u16 q0, q5, q2 // ........*................... - vmla.s16 q4, q3, r12 // .....................*...... - vsub.u16 q2, q5, q2 // .......*.................... - vqrdmulh.s16 q3, q1, r3 // .....e...................... - vsub.u16 q5, q2, q4 // ......................*..... - vqrdmulh.s16 q6, q7, r5 // ...............*............ - vstrw.u32 q5, [r0, #192] // ...........................* - vmul.s16 q5, q7, r4 // ..............*............. - vadd.u16 q7, q2, q4 // .......................*.... - vmla.s16 q5, q6, r12 // ................*........... - vstrw.u32 q7, [r0, #128] // ..........................*. - vadd.u16 q2, q0, q5 // ..................*......... - vstrw.u32 q2, [r0] , #16 // ........................*... - vsub.u16 q4, q0, q5 // .................*.......... - vstrw.u32 q4, [r0, #48] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ..*......................|....*...................... - // vldrw.u32 q1, [r0, #(4*1*16)] // ....*....................|......*.................... - // vldrw.u32 q2, [r0, #(4*2*16)] // e........................|..e........................ - // vldrw.u32 q3, [r0, #(4*3*16)] // .........................|*.......................... - // vmul.s16 q4, q2, r2 // .........................*........................... - // vqrdmulh.s16 q2, q2, r3 // .............e...........|...............e........... - // vmla.s16 q4, q2, r12 // .....*...................|.......*................... - // vsub.u16 q2, q0, q4 // ............*............|..............*............ - // vadd.u16 q0, q0, q4 // ..........*..............|............*.............. - // vmul.s16 q4, q3, r2 // .*.......................|...*....................... - // vqrdmulh.s16 q3, q3, r3 // .........................|.*......................... - // vmla.s16 q4, q3, r12 // ...*.....................|.....*..................... - // vsub.u16 q3, q1, q4 // ......*..................|........*.................. - // vadd.u16 q1, q1, q4 // ........*................|..........*................ - // vmul.s16 q4, q1, r4 // .................*.......|...................*....... - // vqrdmulh.s16 q1, q1, r5 // ...............*.........|.................*......... - // vmla.s16 q4, q1, r12 // ...................*.....|.....................*..... - // vsub.u16 q1, q0, q4 // .......................*.|.........................*. - // vadd.u16 q0, q0, q4 // .....................*...|.......................*... - // vmul.s16 q4, q3, r6 // .......*.................|.........*................. - // vqrdmulh.s16 q3, q3, r7 // .........*...............|...........*............... - // vmla.s16 q4, q3, r12 // ...........*.............|.............*............. - // vsub.u16 q3, q2, q4 // ..............*..........|................*.......... - // vadd.u16 q2, q2, q4 // ..................*......|....................*...... - // vstrw.u32 q0, [r0], #16 // ......................*..|........................*.. - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ........................*|..........................* - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // ....................*....|......................*.... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // ................*........|..................*........ - - le lr, layer23_loop - vmul.s16 q4, q1, r2 // *......................... - vldrw.u32 q7, [r0, #192] // .*........................ - vmul.s16 q6, q7, r2 // ...*...................... - // gap // .......................... - vqrdmulh.s16 q2, q7, r3 // ..*....................... - vldrw.u32 q5, [r0] // ....*..................... - vmla.s16 q6, q2, r12 // .....*.................... - vldrw.u32 q2, [r0, #64] // ......*................... - vmla.s16 q4, q3, r12 // .......*.................. - vadd.u16 q3, q2, q6 // ..........*............... - vqrdmulh.s16 q7, q3, r5 // ................*......... - vsub.u16 q0, q5, q4 // ..............*........... - vmul.s16 q1, q3, r4 // ..................*....... - vadd.u16 q5, q5, q4 // ............*............. - vmla.s16 q1, q7, r12 // ....................*..... - vsub.u16 q3, q2, q6 // ........*................. - vqrdmulh.s16 q2, q3, r7 // ...........*.............. - vsub.u16 q6, q5, q1 // ........................*. - vmul.s16 q3, q3, r6 // .........*................ - vadd.u16 q1, q5, q1 // ......................*... - vstrw.u32 q6, [r0, #64] // .........................* - vmla.s16 q3, q2, r12 // .............*............ - vstrw.u32 q1, [r0] , #16 // .......................*.. - vsub.u16 q4, q0, q3 // ...............*.......... - vstrw.u32 q4, [r0, #176] // .................*........ - vadd.u16 q1, q0, q3 // ...................*...... - vstrw.u32 q1, [r0, #112] // .....................*.... - - // original source code - // vmul.s16 q2, q1, r2 // *......................... - // vldrw.u32 q4, [r0, #192] // .*........................ - // vqrdmulh.s16 q6, q4, r3 // ...*...................... - // vmul.s16 q7, q4, r2 // ..*....................... - // vldrw.u32 q5, [r0] // ....*..................... - // vmla.s16 q7, q6, r12 // .....*.................... - // vldrw.u32 q6, [r0, #64] // ......*................... - // vmla.s16 q2, q3, r12 // .......*.................. - // vsub.u16 q0, q6, q7 // ..............*........... - // vmul.s16 q4, q0, r6 // .................*........ - // vadd.u16 q7, q6, q7 // ........*................. - // vqrdmulh.s16 q3, q0, r7 // ...............*.......... - // vadd.u16 q0, q5, q2 // ............*............. - // vmla.s16 q4, q3, r12 // ....................*..... - // vsub.u16 q2, q5, q2 // ..........*............... - // vsub.u16 q5, q2, q4 // ......................*... - // vqrdmulh.s16 q6, q7, r5 // .........*................ - // vstrw.u32 q5, [r0, #192] // .......................*.. - // vmul.s16 q5, q7, r4 // ...........*.............. - // vadd.u16 q7, q2, q4 // ........................*. - // vmla.s16 q5, q6, r12 // .............*............ - // vstrw.u32 q7, [r0, #128] // .........................* - // vadd.u16 q2, q0, q5 // ..................*....... - // vstrw.u32 q2, [r0] , #16 // .....................*.... - // vsub.u16 q4, q0, q5 // ................*......... - // vstrw.u32 q4, [r0, #48] // ...................*...... - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*128) - - /* Layers 4,5 */ - - mov lr, #8 - ldrd r10, r8, [r11] , #24 // *...... - vldrw.u32 q4, [r0, #32] // .*..... - vmul.s16 q6, q4, r10 // ....*.. - vldrw.u32 q7, [r0, #48] // ..*.... - vmul.s16 q5, q7, r10 // .....*. - // gap // ....... - vqrdmulh.s16 q0, q4, r8 // ...*... - // gap // ....... - vqrdmulh.s16 q7, q7, r8 // ......* - - // original source code - // ldrd r8, r10, [r11] , #24 // *...... - // vldrw.u32 q5, [r0, #32] // .*..... - // vldrw.u32 q7, [r0, #48] // ...*... - // vqrdmulh.s16 q0, q5, r10 // .....*. - // vmul.s16 q6, q5, r8 // ..*.... - // vmul.s16 q5, q7, r8 // ....*.. - // vqrdmulh.s16 q7, q7, r10 // ......* - - sub lr, lr, #1 -.p2align 2 -layer45_loop: - ldrd r2, r3, [r11, #-8] // ..*............................ - ldrd r8, r10, [r11, #-16] // .*............................. - vmla.s16 q6, q0, r12 // .........*..................... - vldrw.u32 q2, [r0] // ...*........................... - vmla.s16 q5, q7, r12 // ..............*................ - vldrw.u32 q4, [r0, #16] // ....*.......................... - vadd.u16 q0, q4, q5 // ................*.............. - vqrdmulh.s16 q7, q0, r10 // ..................*............ - vadd.u16 q3, q2, q6 // ...........*................... - vmul.s16 q0, q0, r8 // .................*............. - vsub.u16 q6, q2, q6 // ..........*.................... - vmla.s16 q0, q7, r12 // ...................*........... - vsub.u16 q7, q4, q5 // ...............*............... - vmul.s16 q4, q7, r2 // ......................*........ - vadd.u16 q1, q3, q0 // .....................*......... - vqrdmulh.s16 q7, q7, r3 // .......................*....... - vsub.u16 q2, q3, q0 // ....................*.......... - vmla.s16 q4, q7, r12 // ........................*...... - ldrd r8, r10, [r11] , #24 // e.............................. - vadd.u16 q3, q6, q4 // ..........................*.... - vldrw.u32 q5, [r0, #96] // .....e......................... - vsub.u16 q4, q6, q4 // .........................*..... - vldrw.u32 q7, [r0, #112] // ......e........................ - vqrdmulh.s16 q0, q5, r10 // ........e...................... - vst40.u32 {q1,q2,q3,q4}, [r0] // ...........................*... - vmul.s16 q6, q5, r8 // .......e....................... - vst41.u32 {q1,q2,q3,q4}, [r0] // ............................*.. - vmul.s16 q5, q7, r8 // ............e.................. - vst42.u32 {q1,q2,q3,q4}, [r0] // .............................*. - vqrdmulh.s16 q7, q7, r10 // .............e................. - vst43.u32 {q1,q2,q3,q4}, [r0]! // ..............................* - - // original source code - // ldrd r2, r3, [r11], #+24 // e............|.................e............ - // ldrd r4, r5, [r11, #(-16)] // .............|*............................. - // ldrd r6, r7, [r11, #(-8)] // .............*.............................. - // vldrw.u32 q0, [r0] // .............|..*........................... - // vldrw.u32 q1, [r0, #16] // .............|....*......................... - // vldrw.u32 q2, [r0, #32] // ..e..........|...................e.......... - // vldrw.u32 q3, [r0, #48] // ....e........|.....................e........ - // vmul.s16 q4, q2, r2 // .......e.....|........................e..... - // vqrdmulh.s16 q2, q2, r3 // .....e.......|......................e....... - // vmla.s16 q4, q2, r12 // .............|.*............................ - // vsub.u16 q2, q0, q4 // .............|.........*.................... - // vadd.u16 q0, q0, q4 // .............|.......*...................... - // vmul.s16 q4, q3, r2 // .........e...|..........................e... - // vqrdmulh.s16 q3, q3, r3 // ...........e.|............................e. - // vmla.s16 q4, q3, r12 // .............|...*.......................... - // vsub.u16 q3, q1, q4 // .............|...........*.................. - // vadd.u16 q1, q1, q4 // .............|.....*........................ - // vmul.s16 q4, q1, r4 // .............|........*..................... - // vqrdmulh.s16 q1, q1, r5 // .............|......*....................... - // vmla.s16 q4, q1, r12 // .............|..........*................... - // vsub.u16 q1, q0, q4 // .............|...............*.............. - // vadd.u16 q0, q0, q4 // .............|.............*................ - // vmul.s16 q4, q3, r6 // .............|............*................. - // vqrdmulh.s16 q3, q3, r7 // .............|..............*............... - // vmla.s16 q4, q3, r12 // .............|................*............. - // vsub.u16 q3, q2, q4 // ...*.........|....................*......... - // vadd.u16 q2, q2, q4 // .*...........|..................*........... - // vst40.u32 {q0, q1, q2, q3}, [r0] // ......*......|.......................*...... - // vst41.u32 {q0, q1, q2, q3}, [r0] // ........*....|.........................*.... - // vst42.u32 {q0, q1, q2, q3}, [r0] // ..........*..|...........................*.. - // vst43.u32 {q0, q1, q2, q3}, [r0]! // ............*|.............................* - - le lr, layer45_loop - vmla.s16 q6, q0, r12 // ..*..................... - ldrd r8, r10, [r11, #-8] // *....................... - vmla.s16 q5, q7, r12 // ....*................... - vldrw.u32 q7, [r0, #16] // .....*.................. - vsub.u16 q2, q7, q5 // ............*........... - vqrdmulh.s16 q1, q2, r10 // ...............*........ - vldrw.u32 q0, [r0] // ...*.................... - vadd.u16 q3, q0, q6 // ........*............... - vmul.s16 q4, q2, r8 // .............*.......... - vsub.u16 q2, q0, q6 // ..........*............. - vmla.s16 q4, q1, r12 // .................*...... - ldrd r9, r1, [r11, #-16] // .*...................... - vadd.u16 q7, q7, q5 // ......*................. - vmul.s16 q1, q7, r9 // .........*.............. - vadd.u16 q6, q2, q4 // ..................*..... - vqrdmulh.s16 q0, q7, r1 // .......*................ - vsub.u16 q7, q2, q4 // ...................*.... - vmla.s16 q1, q0, r12 // ...........*............ - // gap // ........................ - vadd.u16 q4, q3, q1 // ..............*......... - // gap // ........................ - vsub.u16 q5, q3, q1 // ................*....... - // gap // ........................ - // gap // ........................ - vst40.u32 {q4,q5,q6,q7}, [r0] // ....................*... - // gap // ........................ - vst41.u32 {q4,q5,q6,q7}, [r0] // .....................*.. - // gap // ........................ - vst42.u32 {q4,q5,q6,q7}, [r0] // ......................*. - // gap // ........................ - vst43.u32 {q4,q5,q6,q7}, [r0]! // .......................* - - // original source code - // ldrd r2, r3, [r11, #-8] // .*...................... - // ldrd r8, r10, [r11, #-16] // ...........*............ - // vmla.s16 q6, q0, r12 // *....................... - // vldrw.u32 q2, [r0] // ......*................. - // vmla.s16 q5, q7, r12 // ..*..................... - // vldrw.u32 q4, [r0, #16] // ...*.................... - // vadd.u16 q0, q4, q5 // ............*........... - // vqrdmulh.s16 q7, q0, r10 // ...............*........ - // vadd.u16 q3, q2, q6 // .......*................ - // vmul.s16 q0, q0, r8 // .............*.......... - // vsub.u16 q6, q2, q6 // .........*.............. - // vmla.s16 q0, q7, r12 // .................*...... - // vsub.u16 q7, q4, q5 // ....*................... - // vmul.s16 q4, q7, r2 // ........*............... - // vadd.u16 q1, q3, q0 // ..................*..... - // vqrdmulh.s16 q7, q7, r3 // .....*.................. - // vsub.u16 q2, q3, q0 // ...................*.... - // vmla.s16 q4, q7, r12 // ..........*............. - // vadd.u16 q3, q6, q4 // ..............*......... - // vsub.u16 q4, q6, q4 // ................*....... - // vst40.u32 {q1,q2,q3,q4}, [r0] // ....................*... - // vst41.u32 {q1,q2,q3,q4}, [r0] // .....................*.. - // vst42.u32 {q1,q2,q3,q4}, [r0] // ......................*. - // vst43.u32 {q1,q2,q3,q4}, [r0]! // .......................* - - - sub in, in, #(4*128) - - /* Layers 6,7 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #8 - vldrh.u16 q1, [r11] , #96 // * - - // original source code - // vldrh.u16 q1, [r11] , #96 // * - - sub lr, lr, #1 -.p2align 2 -layer67_loop: - vldrw.u32 q6, [r0, #32] // ..*............................... - vmul.s16 q2, q6, q1 // ......*........................... - vldrh.u16 q5, [r11, #-80] // .....*............................ - vqrdmulh.s16 q0, q6, q5 // .......*.......................... - vldrw.u32 q7, [r0, #48] // ...*.............................. - vmla.s16 q2, q0, r12 // ........*......................... - vldrh.u16 q3, [r11, #-32] // .......................*.......... - vmul.s16 q0, q7, q1 // ...........*...................... - vldrw.u32 q4, [r0] , #64 // *................................. - vsub.u16 q1, q4, q2 // .........*........................ - vqrdmulh.s16 q7, q7, q5 // ............*..................... - vadd.u16 q5, q4, q2 // ..........*....................... - vmla.s16 q0, q7, r12 // .............*.................... - vldrw.u32 q6, [r0, #-48] // .*................................ - vsub.u16 q4, q6, q0 // ..............*................... - vmul.s16 q2, q4, q3 // .........................*........ - vldrh.u16 q3, [r11, #-16] // ........................*......... - vqrdmulh.s16 q3, q4, q3 // ..........................*....... - vadd.u16 q6, q6, q0 // ...............*.................. - vmla.s16 q2, q3, r12 // ...........................*...... - vldrh.u16 q4, [r11, #-64] // ................*................. - vsub.u16 q7, q1, q2 // ............................*..... - vstrw.u32 q7, [r0, #-16] // .................................* - vadd.u16 q3, q1, q2 // .............................*.... - vmul.s16 q4, q6, q4 // ..................*............... - vldrh.u16 q7, [r11, #-48] // .................*................ - vqrdmulh.s16 q0, q6, q7 // ...................*.............. - vldrh.u16 q1, [r11] , #96 // ....e............................. - vmla.s16 q4, q0, r12 // ....................*............. - vstrw.u32 q3, [r0, #-32] // ................................*. - vadd.u16 q7, q5, q4 // ......................*........... - vstrw.u32 q7, [r0, #-64] // ..............................*... - vsub.u16 q0, q5, q4 // .....................*............ - vstrw.u32 q0, [r0, #-48] // ...............................*.. - - // original source code - // vldrw.u32 q0, [r0], #64 // .......|.......*......................... - // vldrw.u32 q1, [r0, #(16 - 64)] // .......|............*.................... - // vldrw.u32 q2, [r0, #(32 - 64)] // .......*................................. - // vldrw.u32 q3, [r0, #(48 - 64)] // .......|...*............................. - // vldrh.u16 q5, [r11], #+96 // e......|..........................e...... - // vldrh.u16 q6, [r11, #(+16-96)] // .......|.*............................... - // vmul.s16 q4, q2, q5 // .......|*................................ - // vqrdmulh.s16 q2, q2, q6 // .......|..*.............................. - // vmla.s16 q4, q2, r12 // .......|....*............................ - // vsub.u16 q2, q0, q4 // .......|........*........................ - // vadd.u16 q0, q0, q4 // .......|..........*...................... - // vmul.s16 q4, q3, q5 // .......|......*.......................... - // vqrdmulh.s16 q3, q3, q6 // .......|.........*....................... - // vmla.s16 q4, q3, r12 // .......|...........*..................... - // vsub.u16 q3, q1, q4 // .......|.............*................... - // vadd.u16 q1, q1, q4 // .......|.................*............... - // vldrh.u16 q5, [r11, #(32 - 96)] // .......|...................*............. - // vldrh.u16 q6, [r11, #(48 - 96)] // .......|........................*........ - // vmul.s16 q4, q1, q5 // .......|.......................*......... - // vqrdmulh.s16 q1, q1, q6 // .......|.........................*....... - // vmla.s16 q4, q1, r12 // .*.....|...........................*..... - // vsub.u16 q1, q0, q4 // .....*.|...............................*. - // vadd.u16 q0, q0, q4 // ...*...|.............................*... - // vldrh.u16 q5, [r11, #(64-96)] // .......|.....*........................... - // vldrh.u16 q6, [r11, #(80-96)] // .......|...............*................. - // vmul.s16 q4, q3, q5 // .......|..............*.................. - // vqrdmulh.s16 q3, q3, q6 // .......|................*................ - // vmla.s16 q4, q3, r12 // .......|..................*.............. - // vsub.u16 q3, q2, q4 // .......|....................*............ - // vadd.u16 q2, q2, q4 // .......|......................*.......... - // vstrw.32 q0, [r0, #( 0 - 64)] // ....*..|..............................*.. - // vstrw.32 q1, [r0, #(16 - 64)] // ......*|................................* - // vstrw.32 q2, [r0, #(32 - 64)] // ..*....|............................*.... - // vstrw.32 q3, [r0, #(48 - 64)] // .......|.....................*........... - - le lr, layer67_loop - vldrw.u32 q4, [r0, #32] // *................................ - vmul.s16 q7, q4, q1 // .*............................... - vldrw.u32 q0, [r0, #48] // ....*............................ - vmul.s16 q5, q0, q1 // .......*......................... - vldrh.u16 q6, [r11, #-80] // ..*.............................. - vqrdmulh.s16 q4, q4, q6 // ...*............................. - vldrh.u16 q2, [r11, #-32] // ......*.......................... - vmla.s16 q7, q4, r12 // .....*........................... - vldrw.u32 q4, [r0] , #64 // ........*........................ - vqrdmulh.s16 q0, q0, q6 // ..........*...................... - vsub.u16 q6, q4, q7 // .........*....................... - vmla.s16 q5, q0, r12 // ............*.................... - vldrw.u32 q0, [r0, #-48] // .............*................... - vsub.u16 q3, q0, q5 // ..............*.................. - vmul.s16 q2, q3, q2 // ...............*................. - vadd.u16 q4, q4, q7 // ...........*..................... - vldrh.u16 q7, [r11, #-16] // ................*................ - vqrdmulh.s16 q7, q3, q7 // .................*............... - vadd.u16 q0, q0, q5 // ..................*.............. - vmla.s16 q2, q7, r12 // ...................*............. - vldrh.u16 q7, [r11, #-64] // ....................*............ - vsub.u16 q5, q6, q2 // .....................*........... - vmul.s16 q7, q0, q7 // ........................*........ - vadd.u16 q6, q6, q2 // .......................*......... - vldrh.u16 q2, [r11, #-48] // .........................*....... - vqrdmulh.s16 q0, q0, q2 // ..........................*...... - vstrw.u32 q5, [r0, #-16] // ......................*.......... - vmla.s16 q7, q0, r12 // ...........................*..... - vstrw.u32 q6, [r0, #-32] // ............................*.... - vadd.u16 q0, q4, q7 // .............................*... - vstrw.u32 q0, [r0, #-64] // ..............................*.. - vsub.u16 q4, q4, q7 // ...............................*. - vstrw.u32 q4, [r0, #-48] // ................................* - - // original source code - // vldrw.u32 q6, [r0, #32] // *................................ - // vmul.s16 q2, q6, q1 // .*............................... - // vldrh.u16 q5, [r11, #-80] // ....*............................ - // vqrdmulh.s16 q0, q6, q5 // .....*........................... - // vldrw.u32 q7, [r0, #48] // ..*.............................. - // vmla.s16 q2, q0, r12 // .......*......................... - // vldrh.u16 q3, [r11, #-32] // ......*.......................... - // vmul.s16 q0, q7, q1 // ...*............................. - // vldrw.u32 q4, [r0] , #64 // ........*........................ - // vsub.u16 q1, q4, q2 // ..........*...................... - // vqrdmulh.s16 q7, q7, q5 // .........*....................... - // vadd.u16 q5, q4, q2 // ...............*................. - // vmla.s16 q0, q7, r12 // ...........*..................... - // vldrw.u32 q6, [r0, #-48] // ............*.................... - // vsub.u16 q4, q6, q0 // .............*................... - // vmul.s16 q2, q4, q3 // ..............*.................. - // vldrh.u16 q3, [r11, #-16] // ................*................ - // vqrdmulh.s16 q3, q4, q3 // .................*............... - // vadd.u16 q6, q6, q0 // ..................*.............. - // vmla.s16 q2, q3, r12 // ...................*............. - // vldrh.u16 q4, [r11, #-64] // ....................*............ - // vsub.u16 q7, q1, q2 // .....................*........... - // vstrw.u32 q7, [r0, #-16] // ..........................*...... - // vadd.u16 q3, q1, q2 // .......................*......... - // vmul.s16 q4, q6, q4 // ......................*.......... - // vldrh.u16 q7, [r11, #-48] // ........................*........ - // vqrdmulh.s16 q0, q6, q7 // .........................*....... - // vmla.s16 q4, q0, r12 // ...........................*..... - // vstrw.u32 q3, [r0, #-32] // ............................*.... - // vadd.u16 q7, q5, q4 // .............................*... - // vstrw.u32 q7, [r0, #-64] // ..............................*.. - // vsub.u16 q0, q5, q4 // ...............................*. - // vstrw.u32 q0, [r0, #-48] // ................................* - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4.s deleted file mode 100644 index cef949c..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4.s +++ /dev/null @@ -1,217 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_kyber_1_23_45_67_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_twisted - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -.macro load_first_root root0, root0_twisted - ldrd root0, root0_twisted, [root_ptr], #+8 -.endm - -.macro load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_1_23_45_67_no_trans_vld4, %function -.global ntt_kyber_1_23_45_67_no_trans_vld4 -ntt_kyber_1_23_45_67_no_trans_vld4: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -3329 - movw modulus, #:lower16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*64) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1 */ - - load_first_root root0, root0_twisted - - mov lr, #16 -layer1_loop: - vldrw.u32 data0, [in_low] - vldrw.u32 data1, [in_high] - - ct_butterfly data0, data1, root0, root0_twisted - - vstrw.u32 data0, [in_low], #16 - vstrw.u32 data1, [in_high], #16 - - le lr, layer1_loop - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(4*64) - - /* Layers 2,3 */ - - count .req r1 - mov count, #2 - -out_start: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - mov lr, #4 -layer23_loop: - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #(4*1*16)] - vldrw.u32 data2, [in, #(4*2*16)] - vldrw.u32 data3, [in, #(4*3*16)] - - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.u32 data0, [in], #16 - vstrw.u32 data1, [in, #(4*1*16 - 16)] - vstrw.u32 data2, [in, #(4*2*16 - 16)] - vstrw.u32 data3, [in, #(4*3*16 - 16)] - - le lr, layer23_loop - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*128) - - /* Layers 4,5 */ - - mov lr, #8 -layer45_loop: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - vldrw.u32 data0, [in] - vldrw.u32 data1, [in, #16] - vldrw.u32 data2, [in, #32] - vldrw.u32 data3, [in, #48] - - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - ct_butterfly data0, data1, root1, root1_twisted - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.u32 data0, [in], #64 - vstrw.u32 data1, [in, #(-64+16)] - vstrw.u32 data2, [in, #(-64+32)] - vstrw.u32 data3, [in, #(-64+48)] - - le lr, layer45_loop - - sub in, in, #(4*128) - - /* Layers 6,7 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #8 -layer67_loop: - vld40.u32 {data0, data1, data2, data3}, [in] - vld41.u32 {data0, data1, data2, data3}, [in] - vld42.u32 {data0, data1, data2, data3}, [in] - vld43.u32 {data0, data1, data2, data3}, [in]! - - vldrh.u16 root0, [root_ptr], #+96 - vldrh.u16 root0_twisted, [root_ptr, #(+16-96)] - ct_butterfly data0, data2, root0, root0_twisted - ct_butterfly data1, data3, root0, root0_twisted - - vldrh.u16 root1, [root_ptr, #(32 - 96)] - vldrh.u16 root1_twisted, [root_ptr, #(48 - 96)] - ct_butterfly data0, data1, root1, root1_twisted - - vldrh.u16 root2, [root_ptr, #(64-96)] - vldrh.u16 root2_twisted, [root_ptr, #(80-96)] - ct_butterfly data2, data3, root2, root2_twisted - - vstrw.32 data0, [in, #( 0 - 64)] - vstrw.32 data1, [in, #(16 - 64)] - vstrw.32 data2, [in, #(32 - 64)] - vstrw.32 data3, [in, #(48 - 64)] - le lr, layer67_loop - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55.s deleted file mode 100644 index 0d1cf03..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55.s +++ /dev/null @@ -1,622 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_kyber_1_23_45_67_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_twisted - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -.macro load_first_root root0, root0_twisted - ldrd root0, root0_twisted, [root_ptr], #+8 -.endm - -.macro load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55, %function -.global ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55 -ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -3329 - movw modulus, #:lower16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*64) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1 */ - - load_first_root root0, root0_twisted - - mov lr, #16 - vldrw.u32 q0, [r1] // * - - // original source code - // vldrw.u32 q0, [r1] // * - - sub lr, lr, #1 -.p2align 2 -layer1_loop: - vmul.s16 q7, q0, r2 // ..*...... - // gap // ......... - vqrdmulh.s16 q4, q0, r3 // ...*..... - vldrw.u32 q5, [r0] // *........ - vmla.s16 q7, q4, r12 // ....*.... - vldrw.u32 q0, [r1, #16] // .e....... - vsub.u16 q4, q5, q7 // .....*... - vstrw.u32 q4, [r1] , #16 // ........* - vadd.u16 q4, q5, q7 // ......*.. - vstrw.u32 q4, [r0] , #16 // .......*. - - // original source code - // vldrw.u32 q0, [r0] // .....|.*...... - // vldrw.u32 q1, [r1] // e....|...e.... - // vmul.s16 q4, q1, r2 // .....*........ - // vqrdmulh.s16 q1, q1, r3 // .....|*....... - // vmla.s16 q4, q1, r12 // .....|..*..... - // vsub.u16 q1, q0, q4 // .*...|....*... - // vadd.u16 q0, q0, q4 // ...*.|......*. - // vstrw.u32 q0, [r0], #16 // ....*|.......* - // vstrw.u32 q1, [r1], #16 // ..*..|.....*.. - - le lr, layer1_loop - vqrdmulh.s16 q4, q0, r3 // .*...... - vldrw.u32 q7, [r0] // ..*..... - vmul.s16 q0, q0, r2 // *....... - // gap // ........ - vmla.s16 q0, q4, r12 // ...*.... - // gap // ........ - vsub.u16 q4, q7, q0 // ....*... - vstrw.u32 q4, [r1] , #16 // .....*.. - vadd.u16 q4, q7, q0 // ......*. - vstrw.u32 q4, [r0] , #16 // .......* - - // original source code - // vmul.s16 q7, q0, r2 // ..*..... - // vqrdmulh.s16 q4, q0, r3 // *....... - // vldrw.u32 q5, [r0] // .*...... - // vmla.s16 q7, q4, r12 // ...*.... - // vsub.u16 q4, q5, q7 // ....*... - // vstrw.u32 q4, [r1] , #16 // .....*.. - // vadd.u16 q4, q5, q7 // ......*. - // vstrw.u32 q4, [r0] , #16 // .......* - - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(4*64) - - /* Layers 2,3 */ - - count .req r1 - mov count, #2 - -out_start: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - mov lr, #4 - vldrw.u32 q4, [r0, #192] // *.... - vqrdmulh.s16 q6, q4, r3 // ..*.. - vldrw.u32 q2, [r0, #64] // .*... - vmul.s16 q7, q4, r2 // ...*. - // gap // ..... - vmla.s16 q7, q6, r12 // ....* - - // original source code - // vldrw.u32 q7, [r0, #192] // *.... - // vldrw.u32 q2, [r0, #64] // ..*.. - // vqrdmulh.s16 q3, q7, r3 // .*... - // vmul.s16 q7, q7, r2 // ...*. - // vmla.s16 q7, q3, r12 // ....* - - sub lr, lr, #1 -.p2align 2 -layer23_loop: - vsub.u16 q6, q2, q7 // ............*............... - vmul.s16 q4, q6, r6 // ...................*........ - vldrw.u32 q5, [r0, #128] // ..*......................... - vmul.s16 q0, q5, r2 // ....*....................... - vldrw.u32 q1, [r0] // *........................... - vqrdmulh.s16 q5, q5, r3 // .....*...................... - vadd.u16 q3, q2, q7 // .............*.............. - vmla.s16 q0, q5, r12 // ......*..................... - vldrw.u32 q7, [r0, #208] // ...e........................ - vsub.u16 q5, q1, q0 // .......*.................... - vqrdmulh.s16 q2, q6, r7 // ....................*....... - vadd.u16 q0, q1, q0 // ........*................... - vmla.s16 q4, q2, r12 // .....................*...... - vldrw.u32 q2, [r0, #80] // .e.......................... - vadd.u16 q6, q5, q4 // .......................*.... - vmul.s16 q1, q3, r4 // ..............*............. - vsub.u16 q4, q5, q4 // ......................*..... - vqrdmulh.s16 q3, q3, r5 // ...............*............ - vstrw.u32 q4, [r0, #192] // ...........................* - vmla.s16 q1, q3, r12 // ................*........... - vstrw.u32 q6, [r0, #128] // ..........................*. - vadd.u16 q5, q0, q1 // ..................*......... - vqrdmulh.s16 q3, q7, r3 // ..........e................. - vstrw.u32 q5, [r0] , #16 // ........................*... - vmul.s16 q7, q7, r2 // .........e.................. - vsub.u16 q5, q0, q1 // .................*.......... - vmla.s16 q7, q3, r12 // ...........e................ - vstrw.u32 q5, [r0, #48] // .........................*.. - - // original source code - // vldrw.u32 q0, [r0] // ....................|...*....................... - // vldrw.u32 q1, [r0, #(4*1*16)] // .....e..............|............e.............. - // vldrw.u32 q2, [r0, #(4*2*16)] // ....................|.*......................... - // vldrw.u32 q3, [r0, #(4*3*16)] // e...................|.......e................... - // vmul.s16 q4, q2, r2 // ....................|..*........................ - // vqrdmulh.s16 q2, q2, r3 // ....................|....*...................... - // vmla.s16 q4, q2, r12 // ....................|......*.................... - // vsub.u16 q2, q0, q4 // .*..................|........*.................. - // vadd.u16 q0, q0, q4 // ...*................|..........*................ - // vmul.s16 q4, q3, r2 // ................e...|.......................e... - // vqrdmulh.s16 q3, q3, r3 // ..............e.....|.....................e..... - // vmla.s16 q4, q3, r12 // ..................e.|.........................e. - // vsub.u16 q3, q1, q4 // ....................*........................... - // vadd.u16 q1, q1, q4 // ....................|.....*..................... - // vmul.s16 q4, q1, r4 // .......*............|..............*............ - // vqrdmulh.s16 q1, q1, r5 // .........*..........|................*.......... - // vmla.s16 q4, q1, r12 // ...........*........|..................*........ - // vsub.u16 q1, q0, q4 // .................*..|........................*.. - // vadd.u16 q0, q0, q4 // .............*......|....................*...... - // vmul.s16 q4, q3, r6 // ....................|*.......................... - // vqrdmulh.s16 q3, q3, r7 // ..*.................|.........*................. - // vmla.s16 q4, q3, r12 // ....*...............|...........*............... - // vsub.u16 q3, q2, q4 // ........*...........|...............*........... - // vadd.u16 q2, q2, q4 // ......*.............|.............*............. - // vstrw.u32 q0, [r0], #16 // ...............*....|......................*.... - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ...................*|..........................* - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // ............*.......|...................*....... - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // ..........*.........|.................*......... - - le lr, layer23_loop - vldrw.u32 q4, [r0, #128] // ..*.................... - vmul.s16 q0, q4, r2 // ...*................... - vsub.u16 q5, q2, q7 // *...................... - vqrdmulh.s16 q4, q4, r3 // .....*................. - vadd.u16 q3, q2, q7 // ......*................ - vmla.s16 q0, q4, r12 // .......*............... - vldrw.u32 q4, [r0] // ....*.................. - vmul.s16 q1, q3, r4 // .............*......... - vsub.u16 q6, q4, q0 // ........*.............. - vqrdmulh.s16 q2, q3, r5 // ...............*....... - vadd.u16 q0, q4, q0 // ..........*............ - vmla.s16 q1, q2, r12 // .................*..... - // gap // ....................... - vmul.s16 q2, q5, r6 // .*..................... - vsub.u16 q7, q0, q1 // .....................*. - vqrdmulh.s16 q4, q5, r7 // .........*............. - vstrw.u32 q7, [r0, #64] // ......................* - vmla.s16 q2, q4, r12 // ...........*........... - vadd.u16 q4, q0, q1 // ...................*... - vstrw.u32 q4, [r0] , #16 // ....................*.. - vadd.u16 q3, q6, q2 // ............*.......... - vstrw.u32 q3, [r0, #112] // ..................*.... - vsub.u16 q5, q6, q2 // ..............*........ - vstrw.u32 q5, [r0, #176] // ................*...... - - // original source code - // vsub.u16 q6, q2, q7 // ..*.................... - // vmul.s16 q4, q6, r6 // ............*.......... - // vldrw.u32 q5, [r0, #128] // *...................... - // vmul.s16 q0, q5, r2 // .*..................... - // vldrw.u32 q1, [r0] // ......*................ - // vqrdmulh.s16 q5, q5, r3 // ...*................... - // vadd.u16 q3, q2, q7 // ....*.................. - // vmla.s16 q0, q5, r12 // .....*................. - // vsub.u16 q5, q1, q0 // ........*.............. - // vqrdmulh.s16 q2, q6, r7 // ..............*........ - // vadd.u16 q0, q1, q0 // ..........*............ - // vmla.s16 q4, q2, r12 // ................*...... - // vadd.u16 q6, q5, q4 // ...................*... - // vmul.s16 q1, q3, r4 // .......*............... - // vsub.u16 q4, q5, q4 // .....................*. - // vqrdmulh.s16 q3, q3, r5 // .........*............. - // vstrw.u32 q4, [r0, #192] // ......................* - // vmla.s16 q1, q3, r12 // ...........*........... - // vstrw.u32 q6, [r0, #128] // ....................*.. - // vadd.u16 q5, q0, q1 // .................*..... - // vstrw.u32 q5, [r0] , #16 // ..................*.... - // vsub.u16 q5, q0, q1 // .............*......... - // vstrw.u32 q5, [r0, #48] // ...............*....... - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*128) - - /* Layers 4,5 */ - - mov lr, #8 - ldrd r9, r4, [r11] , #24 // *...... - vldrw.u32 q4, [r0, #32] // .*..... - vqrdmulh.s16 q7, q4, r4 // ..*.... - vldrw.u32 q1, [r0, #48] // ......* - vmul.s16 q0, q4, r9 // ...*... - vldrw.u32 q4, [r0] // .....*. - vmla.s16 q0, q7, r12 // ....*.. - - // original source code - // ldrd r9, r4, [r11] , #24 // *...... - // vldrw.u32 q2, [r0, #32] // .*..... - // vqrdmulh.s16 q5, q2, r4 // ..*.... - // vmul.s16 q0, q2, r9 // ....*.. - // vmla.s16 q0, q5, r12 // ......* - // vldrw.u32 q4, [r0] // .....*. - // vldrw.u32 q1, [r0, #48] // ...*... - - sub lr, lr, #1 -.p2align 2 -layer45_loop: - vmul.s16 q7, q1, r9 // ............*.................. - vsub.u16 q6, q4, q0 // ..........*.................... - vqrdmulh.s16 q3, q1, r4 // .............*................. - ldrd r9, r4, [r11] , #24 // e.............................. - vmla.s16 q7, q3, r12 // ..............*................ - vldrw.u32 q2, [r0, #96] // .....e......................... - vqrdmulh.s16 q5, q2, r4 // ........e...................... - vadd.u16 q3, q4, q0 // ...........*................... - vmul.s16 q0, q2, r9 // .......e....................... - ldrd r6, r5, [r11, #-32] // ..*............................ - vmla.s16 q0, q5, r12 // .........e..................... - vldrw.u32 q5, [r0, #16] // ....*.......................... - vsub.u16 q2, q5, q7 // ...............*............... - vqrdmulh.s16 q4, q2, r5 // .......................*....... - vadd.u16 q1, q5, q7 // ................*.............. - vmul.s16 q7, q2, r6 // ......................*........ - ldrd r7, r2, [r11, #-40] // .*............................. - vmla.s16 q7, q4, r12 // ........................*...... - vldrw.u32 q4, [r0, #64] // ...e........................... - vadd.u16 q2, q6, q7 // ..........................*.... - vmul.s16 q5, q1, r7 // .................*............. - vstrw.u32 q2, [r0, #32] // .............................*. - vsub.u16 q6, q6, q7 // .........................*..... - vqrdmulh.s16 q7, q1, r2 // ..................*............ - vldrw.u32 q1, [r0, #112] // ......e........................ - vmla.s16 q5, q7, r12 // ...................*........... - vstrw.u32 q6, [r0, #48] // ..............................* - vadd.u16 q2, q3, q5 // .....................*......... - vstrw.u32 q2, [r0] , #64 // ...........................*... - vsub.u16 q7, q3, q5 // ....................*.......... - vstrw.u32 q7, [r0, #-48] // ............................*.. - - // original source code - // ldrd r2, r3, [r11], #+24 // e...........................|..e........................... - // ldrd r4, r5, [r11, #(-16)] // .............*..............|...............*.............. - // ldrd r6, r7, [r11, #(-8)] // ......*.....................|........*..................... - // vldrw.u32 q0, [r0] // ...............e............|.................e............ - // vldrw.u32 q1, [r0, #16] // ........*...................|..........*................... - // vldrw.u32 q2, [r0, #32] // ..e.........................|....e......................... - // vldrw.u32 q3, [r0, #48] // .....................e......|.......................e...... - // vmul.s16 q4, q2, r2 // .....e......................|.......e...................... - // vqrdmulh.s16 q2, q2, r3 // ...e........................|.....e........................ - // vmla.s16 q4, q2, r12 // .......e....................|.........e.................... - // vsub.u16 q2, q0, q4 // ............................|*............................. - // vadd.u16 q0, q0, q4 // ....*.......................|......*....................... - // vmul.s16 q4, q3, r2 // ............................*.............................. - // vqrdmulh.s16 q3, q3, r3 // ............................|.*............................ - // vmla.s16 q4, q3, r12 // .*..........................|...*.......................... - // vsub.u16 q3, q1, q4 // .........*..................|...........*.................. - // vadd.u16 q1, q1, q4 // ...........*................|.............*................ - // vmul.s16 q4, q1, r4 // .................*..........|...................*.......... - // vqrdmulh.s16 q1, q1, r5 // ....................*.......|......................*....... - // vmla.s16 q4, q1, r12 // ......................*.....|........................*..... - // vsub.u16 q1, q0, q4 // ..........................*.|............................*. - // vadd.u16 q0, q0, q4 // ........................*...|..........................*... - // vmul.s16 q4, q3, r6 // ............*...............|..............*............... - // vqrdmulh.s16 q3, q3, r7 // ..........*.................|............*................. - // vmla.s16 q4, q3, r12 // ..............*.............|................*............. - // vsub.u16 q3, q2, q4 // ...................*........|.....................*........ - // vadd.u16 q2, q2, q4 // ................*...........|..................*........... - // vstrw.u32 q0, [r0], #64 // .........................*..|...........................*.. - // vstrw.u32 q1, [r0, #(-64+16)] // ...........................*|.............................* - // vstrw.u32 q2, [r0, #(-64+32)] // ..................*.........|....................*......... - // vstrw.u32 q3, [r0, #(-64+48)] // .......................*....|.........................*.... - - le lr, layer45_loop - vqrdmulh.s16 q5, q1, r4 // ..*..................... - ldrd r1, r5, [r11, #-16] // ...........*............ - vmul.s16 q1, q1, r9 // *....................... - ldrd r3, r9, [r11, #-8] // .....*.................. - vmla.s16 q1, q5, r12 // ...*.................... - vldrw.u32 q3, [r0, #16] // ......*................. - vsub.u16 q7, q3, q1 // .......*................ - vmul.s16 q5, q7, r3 // ..........*............. - vadd.u16 q6, q3, q1 // .........*.............. - vqrdmulh.s16 q3, q7, r9 // ........*............... - vadd.u16 q7, q4, q0 // ....*................... - vmla.s16 q5, q3, r12 // ............*........... - vsub.u16 q3, q4, q0 // .*...................... - vqrdmulh.s16 q4, q6, r5 // .................*...... - vsub.u16 q1, q3, q5 // ................*....... - vmul.s16 q6, q6, r1 // ..............*......... - vadd.u16 q5, q3, q5 // .............*.......... - vstrw.u32 q1, [r0, #48] // ...................*.... - vmla.s16 q6, q4, r12 // ..................*..... - vstrw.u32 q5, [r0, #32] // ...............*........ - vadd.u16 q4, q7, q6 // ....................*... - vstrw.u32 q4, [r0] , #64 // .....................*.. - vsub.u16 q7, q7, q6 // ......................*. - vstrw.u32 q7, [r0, #-48] // .......................* - - // original source code - // vmul.s16 q7, q1, r9 // ..*..................... - // vsub.u16 q6, q4, q0 // ............*........... - // vqrdmulh.s16 q3, q1, r4 // *....................... - // vmla.s16 q7, q3, r12 // ....*................... - // vadd.u16 q3, q4, q0 // ..........*............. - // ldrd r6, r5, [r11, #-8] // ...*.................... - // vldrw.u32 q5, [r0, #16] // .....*.................. - // vsub.u16 q2, q5, q7 // ......*................. - // vqrdmulh.s16 q4, q2, r5 // .........*.............. - // vadd.u16 q1, q5, q7 // ........*............... - // vmul.s16 q7, q2, r6 // .......*................ - // ldrd r7, r2, [r11, #-16] // .*...................... - // vmla.s16 q7, q4, r12 // ...........*............ - // vadd.u16 q2, q6, q7 // ................*....... - // vmul.s16 q5, q1, r7 // ...............*........ - // vstrw.u32 q2, [r0, #32] // ...................*.... - // vsub.u16 q6, q6, q7 // ..............*......... - // vqrdmulh.s16 q7, q1, r2 // .............*.......... - // vmla.s16 q5, q7, r12 // ..................*..... - // vstrw.u32 q6, [r0, #48] // .................*...... - // vadd.u16 q2, q3, q5 // ....................*... - // vstrw.u32 q2, [r0] , #64 // .....................*.. - // vsub.u16 q7, q3, q5 // ......................*. - // vstrw.u32 q7, [r0, #-48] // .......................* - - - sub in, in, #(4*128) - - /* Layers 6,7 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #8 - vld40.u32 {q0,q1,q2,q3}, [r0] // * - - // original source code - // vld40.u32 {q0,q1,q2,q3}, [r0] // * - - sub lr, lr, #1 -.p2align 2 -layer67_loop: - vld41.u32 {q0,q1,q2,q3}, [r0] // .*................................ - // gap // .................................. - vld42.u32 {q0,q1,q2,q3}, [r0] // ..*............................... - // gap // .................................. - vld43.u32 {q0,q1,q2,q3}, [r0]! // ...*.............................. - // gap // .................................. - vldrh.u16 q5, [r11, #16] // .....*............................ - vqrdmulh.s16 q4, q2, q5 // .......*.......................... - vldrh.u16 q7, [r11] , #96 // ....*............................. - vmul.s16 q6, q3, q7 // ...........*...................... - // gap // .................................. - vqrdmulh.s16 q3, q3, q5 // ............*..................... - vldrh.u16 q5, [r11, #-64] // ................*................. - vmla.s16 q6, q3, r12 // .............*.................... - vldrh.u16 q3, [r11, #-16] // ........................*......... - vmul.s16 q7, q2, q7 // ......*........................... - vadd.u16 q2, q1, q6 // ...............*.................. - vmla.s16 q7, q4, r12 // ........*......................... - vsub.u16 q1, q1, q6 // ..............*................... - vmul.s16 q5, q2, q5 // ..................*............... - vadd.u16 q4, q0, q7 // ..........*....................... - vldrh.u16 q6, [r11, #-48] // .................*................ - vqrdmulh.s16 q6, q2, q6 // ...................*.............. - vsub.u16 q7, q0, q7 // .........*........................ - vmla.s16 q5, q6, r12 // ....................*............. - vldrh.u16 q0, [r11, #-32] // .......................*.......... - vadd.u16 q6, q4, q5 // ......................*........... - vstrw.u32 q6, [r0, #-64] // ..............................*... - vmul.s16 q6, q1, q0 // .........................*........ - vsub.u16 q4, q4, q5 // .....................*............ - vqrdmulh.s16 q5, q1, q3 // ..........................*....... - vld40.u32 {q0,q1,q2,q3}, [r0] // e................................. - vmla.s16 q6, q5, r12 // ...........................*...... - vstrw.u32 q4, [r0, #-48] // ...............................*.. - vadd.u16 q4, q7, q6 // .............................*.... - vstrw.u32 q4, [r0, #-32] // ................................*. - vsub.u16 q6, q7, q6 // ............................*..... - vstrw.u32 q6, [r0, #-16] // .................................* - // gap // .................................. - // gap // .................................. - - // original source code - // vld40.u32 {q0, q1, q2, q3}, [r0] // e......|..........................e...... - // vld41.u32 {q0, q1, q2, q3}, [r0] // .......*................................. - // vld42.u32 {q0, q1, q2, q3}, [r0] // .......|*................................ - // vld43.u32 {q0, q1, q2, q3}, [r0]! // .......|.*............................... - // vldrh.u16 q5, [r11], #+96 // .......|....*............................ - // vldrh.u16 q6, [r11, #(+16-96)] // .......|..*.............................. - // vmul.s16 q4, q2, q5 // .......|..........*...................... - // vqrdmulh.s16 q2, q2, q6 // .......|...*............................. - // vmla.s16 q4, q2, r12 // .......|............*.................... - // vsub.u16 q2, q0, q4 // .......|..................*.............. - // vadd.u16 q0, q0, q4 // .......|...............*................. - // vmul.s16 q4, q3, q5 // .......|.....*........................... - // vqrdmulh.s16 q3, q3, q6 // .......|......*.......................... - // vmla.s16 q4, q3, r12 // .......|........*........................ - // vsub.u16 q3, q1, q4 // .......|.............*................... - // vadd.u16 q1, q1, q4 // .......|...........*..................... - // vldrh.u16 q5, [r11, #(32 - 96)] // .......|.......*......................... - // vldrh.u16 q6, [r11, #(48 - 96)] // .......|................*................ - // vmul.s16 q4, q1, q5 // .......|..............*.................. - // vqrdmulh.s16 q1, q1, q6 // .......|.................*............... - // vmla.s16 q4, q1, r12 // .......|...................*............. - // vsub.u16 q1, q0, q4 // .......|........................*........ - // vadd.u16 q0, q0, q4 // .......|.....................*........... - // vldrh.u16 q5, [r11, #(64-96)] // .......|....................*............ - // vldrh.u16 q6, [r11, #(80-96)] // .......|.........*....................... - // vmul.s16 q4, q3, q5 // .......|.......................*......... - // vqrdmulh.s16 q3, q3, q6 // .......|.........................*....... - // vmla.s16 q4, q3, r12 // .*.....|...........................*..... - // vsub.u16 q3, q2, q4 // .....*.|...............................*. - // vadd.u16 q2, q2, q4 // ...*...|.............................*... - // vstrw.32 q0, [r0, #( 0 - 64)] // .......|......................*.......... - // vstrw.32 q1, [r0, #(16 - 64)] // ..*....|............................*.... - // vstrw.32 q2, [r0, #(32 - 64)] // ....*..|..............................*.. - // vstrw.32 q3, [r0, #(48 - 64)] // ......*|................................* - - le lr, layer67_loop - vld41.u32 {q0,q1,q2,q3}, [r0] // *................................ - // gap // ................................. - vld42.u32 {q0,q1,q2,q3}, [r0] // .*............................... - // gap // ................................. - vld43.u32 {q0,q1,q2,q3}, [r0]! // ..*.............................. - // gap // ................................. - vldrh.u16 q6, [r11] , #96 // .....*........................... - vmul.s16 q4, q2, q6 // ...........*..................... - vldrh.u16 q5, [r11, #-80] // ...*............................. - vqrdmulh.s16 q2, q2, q5 // ....*............................ - vldrh.u16 q7, [r11, #-32] // .....................*........... - vmla.s16 q4, q2, r12 // .............*................... - // gap // ................................. - vmul.s16 q2, q3, q6 // ......*.......................... - vadd.u16 q6, q0, q4 // ................*................ - vqrdmulh.s16 q5, q3, q5 // .......*......................... - vsub.u16 q0, q0, q4 // ...................*............. - vmla.s16 q2, q5, r12 // .........*....................... - vldrh.u16 q4, [r11, #-16] // ..........*...................... - vsub.u16 q5, q1, q2 // ..............*.................. - vqrdmulh.s16 q4, q5, q4 // ..........................*...... - vadd.u16 q3, q1, q2 // ............*.................... - vmul.s16 q7, q5, q7 // ........................*........ - vldrh.u16 q2, [r11, #-64] // ........*........................ - vmla.s16 q7, q4, r12 // ...........................*..... - vldrh.u16 q5, [r11, #-48] // .................*............... - vsub.u16 q4, q0, q7 // ...............................*. - vmul.s16 q2, q3, q2 // ...............*................. - vadd.u16 q7, q0, q7 // .............................*... - vqrdmulh.s16 q0, q3, q5 // ..................*.............. - vstrw.u32 q4, [r0, #-16] // ................................* - vmla.s16 q2, q0, r12 // ....................*............ - vstrw.u32 q7, [r0, #-32] // ..............................*.. - vadd.u16 q4, q6, q2 // ......................*.......... - vstrw.u32 q4, [r0, #-64] // .......................*......... - vsub.u16 q4, q6, q2 // .........................*....... - vstrw.u32 q4, [r0, #-48] // ............................*.... - - // original source code - // vld41.u32 {q0,q1,q2,q3}, [r0] // *................................ - // vld42.u32 {q0,q1,q2,q3}, [r0] // .*............................... - // vld43.u32 {q0,q1,q2,q3}, [r0]! // ..*.............................. - // vldrh.u16 q5, [r11, #16] // .....*........................... - // vqrdmulh.s16 q4, q2, q5 // ......*.......................... - // vldrh.u16 q7, [r11] , #96 // ...*............................. - // vmul.s16 q6, q3, q7 // .........*....................... - // vqrdmulh.s16 q3, q3, q5 // ...........*..................... - // vldrh.u16 q5, [r11, #-64] // ...................*............. - // vmla.s16 q6, q3, r12 // .............*................... - // vldrh.u16 q3, [r11, #-16] // ..............*.................. - // vmul.s16 q7, q2, q7 // ....*............................ - // vadd.u16 q2, q1, q6 // .................*............... - // vmla.s16 q7, q4, r12 // ........*........................ - // vsub.u16 q1, q1, q6 // ...............*................. - // vmul.s16 q5, q2, q5 // .......................*......... - // vadd.u16 q4, q0, q7 // ..........*...................... - // vldrh.u16 q6, [r11, #-48] // .....................*........... - // vqrdmulh.s16 q6, q2, q6 // .........................*....... - // vsub.u16 q7, q0, q7 // ............*.................... - // vmla.s16 q5, q6, r12 // ...........................*..... - // vldrh.u16 q0, [r11, #-32] // .......*......................... - // vadd.u16 q6, q4, q5 // .............................*... - // vstrw.u32 q6, [r0, #-64] // ..............................*.. - // vmul.s16 q6, q1, q0 // ..................*.............. - // vsub.u16 q4, q4, q5 // ...............................*. - // vqrdmulh.s16 q5, q1, q3 // ................*................ - // vmla.s16 q6, q5, r12 // ....................*............ - // vstrw.u32 q4, [r0, #-48] // ................................* - // vadd.u16 q4, q7, q6 // ........................*........ - // vstrw.u32 q4, [r0, #-32] // ............................*.... - // vsub.u16 q6, q7, q6 // ......................*.......... - // vstrw.u32 q6, [r0, #-16] // ..........................*...... - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85.s deleted file mode 100644 index aace032..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85.s +++ /dev/null @@ -1,545 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots: -#include "ntt_kyber_1_23_45_67_twiddles.s" -.text - -// Barrett multiplication -.macro mulmod dst, src, const, const_twisted - vmul.s16 \dst, \src, \const - vqrdmulh.s16 \src, \src, \const_twisted - vmla.s16 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u16 \b, \a, tmp - vadd.u16 \a, \a, tmp -.endm - -.macro load_first_root root0, root0_twisted - ldrd root0, root0_twisted, [root_ptr], #+8 -.endm - -.macro load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #(-16)] - ldrd root2, root2_twisted, [root_ptr, #(-8)] -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85, %function -.global ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85 -ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, -3329 - movw modulus, #:lower16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*64) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1 */ - - load_first_root root0, root0_twisted - - mov lr, #16 - vldrw.u32 q4, [r1] // *. - vqrdmulh.s16 q2, q4, r3 // .* - - // original source code - // vldrw.u32 q4, [r1] // *. - // vqrdmulh.s16 q2, q4, r3 // .* - - sub lr, lr, #1 -.p2align 2 -layer1_loop: - vmul.s16 q6, q4, r2 // ..*...... - vldrw.u32 q7, [r0] // *........ - vmla.s16 q6, q2, r12 // ....*.... - vldrw.u32 q4, [r1, #16] // .e....... - vsub.u16 q1, q7, q6 // .....*... - vstrw.u32 q1, [r1] , #16 // ........* - vadd.u16 q1, q7, q6 // ......*.. - vqrdmulh.s16 q2, q4, r3 // ...e..... - vstrw.u32 q1, [r0] , #16 // .......*. - - // original source code - // vldrw.u32 q0, [r0] // ......|*....... - // vldrw.u32 q1, [r1] // e.....|..e..... - // vmul.s16 q4, q1, r2 // ......*........ - // vqrdmulh.s16 q1, q1, r3 // ....e.|......e. - // vmla.s16 q4, q1, r12 // ......|.*...... - // vsub.u16 q1, q0, q4 // .*....|...*.... - // vadd.u16 q0, q0, q4 // ...*..|.....*.. - // vstrw.u32 q0, [r0], #16 // .....*|.......* - // vstrw.u32 q1, [r1], #16 // ..*...|....*... - - le lr, layer1_loop - vmul.s16 q3, q4, r2 // *...... - vldrw.u32 q0, [r0] // .*..... - vmla.s16 q3, q2, r12 // ..*.... - // gap // ....... - vsub.u16 q6, q0, q3 // ...*... - vstrw.u32 q6, [r1] , #16 // ....*.. - vadd.u16 q0, q0, q3 // .....*. - vstrw.u32 q0, [r0] , #16 // ......* - - // original source code - // vmul.s16 q6, q4, r2 // *...... - // vldrw.u32 q7, [r0] // .*..... - // vmla.s16 q6, q2, r12 // ..*.... - // vsub.u16 q1, q7, q6 // ...*... - // vstrw.u32 q1, [r1] , #16 // ....*.. - // vadd.u16 q1, q7, q6 // .....*. - // vstrw.u32 q1, [r0] , #16 // ......* - - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(4*64) - - /* Layers 2,3 */ - - count .req r1 - mov count, #2 - -out_start: - load_next_roots root0, root0_twisted, root1, root1_twisted, root2, root2_twisted - - mov lr, #4 - vldrw.u32 q7, [r0, #128] // *. - vmul.s16 q6, q7, r2 // .* - - // original source code - // vldrw.u32 q7, [r0, #128] // *. - // vmul.s16 q6, q7, r2 // .* - - sub lr, lr, #1 -.p2align 2 -layer23_loop: - vqrdmulh.s16 q3, q7, r3 // .....*...................... - vldrw.u32 q1, [r0, #192] // ...*........................ - vqrdmulh.s16 q2, q1, r3 // ..........*................. - vldrw.u32 q7, [r0, #144] // ..e......................... - vmul.s16 q5, q1, r2 // .........*.................. - vldrw.u32 q1, [r0, #64] // .*.......................... - vmla.s16 q5, q2, r12 // ...........*................ - vldrw.u32 q2, [r0] // *........................... - vmla.s16 q6, q3, r12 // ......*..................... - vadd.u16 q3, q1, q5 // .............*.............. - vmul.s16 q4, q3, r4 // ..............*............. - vsub.u16 q0, q2, q6 // .......*.................... - vqrdmulh.s16 q3, q3, r5 // ...............*............ - vadd.u16 q6, q2, q6 // ........*................... - vmla.s16 q4, q3, r12 // ................*........... - vsub.u16 q5, q1, q5 // ............*............... - vqrdmulh.s16 q1, q5, r7 // ....................*....... - vsub.u16 q2, q6, q4 // .................*.......... - vmul.s16 q5, q5, r6 // ...................*........ - vstrw.u32 q2, [r0, #64] // .........................*.. - vmla.s16 q5, q1, r12 // .....................*...... - vadd.u16 q2, q6, q4 // ..................*......... - vmul.s16 q6, q7, r2 // ....e....................... - vstrw.u32 q2, [r0] , #16 // ........................*... - vsub.u16 q2, q0, q5 // ......................*..... - vstrw.u32 q2, [r0, #176] // ...........................* - vadd.u16 q5, q0, q5 // .......................*.... - vstrw.u32 q5, [r0, #112] // ..........................*. - - // original source code - // vldrw.u32 q0, [r0] // ....*....................|......*.................... - // vldrw.u32 q1, [r0, #(4*1*16)] // ..*......................|....*...................... - // vldrw.u32 q2, [r0, #(4*2*16)] // e........................|..e........................ - // vldrw.u32 q3, [r0, #(4*3*16)] // .........................|*.......................... - // vmul.s16 q4, q2, r2 // ...................e.....|.....................e..... - // vqrdmulh.s16 q2, q2, r3 // .........................*........................... - // vmla.s16 q4, q2, r12 // .....*...................|.......*................... - // vsub.u16 q2, q0, q4 // ........*................|..........*................ - // vadd.u16 q0, q0, q4 // ..........*..............|............*.............. - // vmul.s16 q4, q3, r2 // .*.......................|...*....................... - // vqrdmulh.s16 q3, q3, r3 // .........................|.*......................... - // vmla.s16 q4, q3, r12 // ...*.....................|.....*..................... - // vsub.u16 q3, q1, q4 // ............*............|..............*............ - // vadd.u16 q1, q1, q4 // ......*..................|........*.................. - // vmul.s16 q4, q1, r4 // .......*.................|.........*................. - // vqrdmulh.s16 q1, q1, r5 // .........*...............|...........*............... - // vmla.s16 q4, q1, r12 // ...........*.............|.............*............. - // vsub.u16 q1, q0, q4 // ..............*..........|................*.......... - // vadd.u16 q0, q0, q4 // ..................*......|....................*...... - // vmul.s16 q4, q3, r6 // ...............*.........|.................*......... - // vqrdmulh.s16 q3, q3, r7 // .............*...........|...............*........... - // vmla.s16 q4, q3, r12 // .................*.......|...................*....... - // vsub.u16 q3, q2, q4 // .....................*...|.......................*... - // vadd.u16 q2, q2, q4 // .......................*.|.........................*. - // vstrw.u32 q0, [r0], #16 // ....................*....|......................*.... - // vstrw.u32 q1, [r0, #(4*1*16 - 16)] // ................*........|..................*........ - // vstrw.u32 q2, [r0, #(4*2*16 - 16)] // ........................*|..........................* - // vstrw.u32 q3, [r0, #(4*3*16 - 16)] // ......................*..|........................*.. - - le lr, layer23_loop - vqrdmulh.s16 q1, q7, r3 // *......................... - vldrw.u32 q5, [r0, #192] // .*........................ - vmul.s16 q3, q5, r2 // ...*...................... - vldrw.u32 q0, [r0] // ......*................... - vqrdmulh.s16 q4, q5, r3 // ..*....................... - vldrw.u32 q2, [r0, #64] // ....*..................... - vmla.s16 q3, q4, r12 // .....*.................... - // gap // .......................... - vmla.s16 q6, q1, r12 // .......*.................. - vsub.u16 q4, q2, q3 // ..............*........... - vqrdmulh.s16 q5, q4, r7 // ...............*.......... - vadd.u16 q7, q2, q3 // ........*................. - vmul.s16 q3, q4, r6 // .................*........ - vsub.u16 q2, q0, q6 // ..........*............... - vmla.s16 q3, q5, r12 // ...................*...... - vadd.u16 q1, q0, q6 // ............*............. - vqrdmulh.s16 q4, q7, r5 // ...........*.............. - vadd.u16 q5, q2, q3 // ........................*. - vstrw.u32 q5, [r0, #128] // .........................* - vmul.s16 q5, q7, r4 // .........*................ - vsub.u16 q3, q2, q3 // ......................*... - vmla.s16 q5, q4, r12 // .............*............ - vstrw.u32 q3, [r0, #192] // .......................*.. - vsub.u16 q0, q1, q5 // ................*......... - vstrw.u32 q0, [r0, #64] // ..................*....... - vadd.u16 q2, q1, q5 // ....................*..... - vstrw.u32 q2, [r0] , #16 // .....................*.... - - // original source code - // vqrdmulh.s16 q3, q7, r3 // *......................... - // vldrw.u32 q1, [r0, #192] // .*........................ - // vqrdmulh.s16 q2, q1, r3 // ....*..................... - // vmul.s16 q5, q1, r2 // ..*....................... - // vldrw.u32 q1, [r0, #64] // .....*.................... - // vmla.s16 q5, q2, r12 // ......*................... - // vldrw.u32 q2, [r0] // ...*...................... - // vmla.s16 q6, q3, r12 // .......*.................. - // vadd.u16 q3, q1, q5 // ..........*............... - // vmul.s16 q4, q3, r4 // ..................*....... - // vsub.u16 q0, q2, q6 // ............*............. - // vqrdmulh.s16 q3, q3, r5 // ...............*.......... - // vadd.u16 q6, q2, q6 // ..............*........... - // vmla.s16 q4, q3, r12 // ....................*..... - // vsub.u16 q5, q1, q5 // ........*................. - // vqrdmulh.s16 q1, q5, r7 // .........*................ - // vsub.u16 q2, q6, q4 // ......................*... - // vmul.s16 q5, q5, r6 // ...........*.............. - // vstrw.u32 q2, [r0, #64] // .......................*.. - // vmla.s16 q5, q1, r12 // .............*............ - // vadd.u16 q2, q6, q4 // ........................*. - // vstrw.u32 q2, [r0] , #16 // .........................* - // vsub.u16 q2, q0, q5 // ...................*...... - // vstrw.u32 q2, [r0, #176] // .....................*.... - // vadd.u16 q5, q0, q5 // ................*......... - // vstrw.u32 q5, [r0, #112] // .................*........ - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*128) - - /* Layers 4,5 */ - - mov lr, #8 -.p2align 2 -layer45_loop: - ldrd r4, r5, [r11] , #24 // *.............................. - vldrw.u32 q5, [r0, #32] // .....*......................... - vqrdmulh.s16 q3, q5, r5 // ........*...................... - vldrw.u32 q7, [r0, #48] // ......*........................ - vqrdmulh.s16 q0, q7, r5 // .............*................. - ldrd r8, r6, [r11, #-8] // ..*............................ - vmul.s16 q4, q7, r4 // ............*.................. - vldrw.u32 q1, [r0] // ...*........................... - vmla.s16 q4, q0, r12 // ..............*................ - vldrw.u32 q2, [r0, #16] // ....*.......................... - vmul.s16 q7, q5, r4 // .......*....................... - vsub.u16 q6, q2, q4 // ...............*............... - vmul.s16 q0, q6, r8 // ......................*........ - ldrd r2, r1, [r11, #-16] // .*............................. - vmla.s16 q7, q3, r12 // .........*..................... - vadd.u16 q5, q2, q4 // ................*.............. - vqrdmulh.s16 q4, q6, r6 // .......................*....... - vadd.u16 q3, q1, q7 // ...........*................... - vmla.s16 q0, q4, r12 // ........................*...... - vsub.u16 q7, q1, q7 // ..........*.................... - vqrdmulh.s16 q1, q5, r1 // ..................*............ - vadd.u16 q2, q7, q0 // ..........................*.... - vstrw.u32 q2, [r0, #32] // .............................*. - vmul.s16 q2, q5, r2 // .................*............. - vsub.u16 q5, q7, q0 // .........................*..... - vmla.s16 q2, q1, r12 // ...................*........... - vstrw.u32 q5, [r0, #48] // ..............................* - vsub.u16 q5, q3, q2 // ....................*.......... - vstrw.u32 q5, [r0, #16] // ............................*.. - vadd.u16 q6, q3, q2 // .....................*......... - vstrw.u32 q6, [r0] , #64 // ...........................*... - - // original source code - // ldrd r2, r3, [r11], #+24 // *.............................. - // ldrd r4, r5, [r11, #(-16)] // .............*................. - // ldrd r6, r7, [r11, #(-8)] // .....*......................... - // vldrw.u32 q0, [r0] // .......*....................... - // vldrw.u32 q1, [r0, #16] // .........*..................... - // vldrw.u32 q2, [r0, #32] // .*............................. - // vldrw.u32 q3, [r0, #48] // ...*........................... - // vmul.s16 q4, q2, r2 // ..........*.................... - // vqrdmulh.s16 q2, q2, r3 // ..*............................ - // vmla.s16 q4, q2, r12 // ..............*................ - // vsub.u16 q2, q0, q4 // ...................*........... - // vadd.u16 q0, q0, q4 // .................*............. - // vmul.s16 q4, q3, r2 // ......*........................ - // vqrdmulh.s16 q3, q3, r3 // ....*.......................... - // vmla.s16 q4, q3, r12 // ........*...................... - // vsub.u16 q3, q1, q4 // ...........*................... - // vadd.u16 q1, q1, q4 // ...............*............... - // vmul.s16 q4, q1, r4 // .......................*....... - // vqrdmulh.s16 q1, q1, r5 // ....................*.......... - // vmla.s16 q4, q1, r12 // .........................*..... - // vsub.u16 q1, q0, q4 // ...........................*... - // vadd.u16 q0, q0, q4 // .............................*. - // vmul.s16 q4, q3, r6 // ............*.................. - // vqrdmulh.s16 q3, q3, r7 // ................*.............. - // vmla.s16 q4, q3, r12 // ..................*............ - // vsub.u16 q3, q2, q4 // ........................*...... - // vadd.u16 q2, q2, q4 // .....................*......... - // vstrw.u32 q0, [r0], #64 // ..............................* - // vstrw.u32 q1, [r0, #(-64+16)] // ............................*.. - // vstrw.u32 q2, [r0, #(-64+32)] // ......................*........ - // vstrw.u32 q3, [r0, #(-64+48)] // ..........................*.... - - le lr, layer45_loop - - sub in, in, #(4*128) - - /* Layers 6,7 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #8 - vld40.u32 {q1,q2,q3,q4}, [r0] // *..... - // gap // ...... - vld41.u32 {q1,q2,q3,q4}, [r0] // .*.... - // gap // ...... - vld42.u32 {q1,q2,q3,q4}, [r0] // ..*... - // gap // ...... - vld43.u32 {q1,q2,q3,q4}, [r0]! // ...*.. - // gap // ...... - vldrh.u16 q7, [r11] , #96 // ....*. - // gap // ...... - vmul.s16 q5, q3, q7 // .....* - - // original source code - // vld40.u32 {q1,q2,q3,q4}, [r0] // *..... - // vld41.u32 {q1,q2,q3,q4}, [r0] // .*.... - // vld42.u32 {q1,q2,q3,q4}, [r0] // ..*... - // vld43.u32 {q1,q2,q3,q4}, [r0]! // ...*.. - // vldrh.u16 q7, [r11] , #96 // ....*. - // vmul.s16 q5, q3, q7 // .....* - - sub lr, lr, #1 -.p2align 2 -layer67_loop: - vmul.s16 q7, q4, q7 // ...........*...................... - vldrh.u16 q0, [r11, #-80] // .....*............................ - vqrdmulh.s16 q4, q4, q0 // ............*..................... - vldrh.u16 q6, [r11, #-64] // ................*................. - vmla.s16 q7, q4, r12 // .............*.................... - vldrh.u16 q4, [r11, #-48] // .................*................ - vqrdmulh.s16 q3, q3, q0 // .......*.......................... - vadd.u16 q0, q2, q7 // ...............*.................. - vmla.s16 q5, q3, r12 // ........*......................... - vsub.u16 q7, q2, q7 // ..............*................... - vqrdmulh.s16 q3, q0, q4 // ...................*.............. - vadd.u16 q4, q1, q5 // ..........*....................... - vmul.s16 q0, q0, q6 // ..................*............... - vsub.u16 q6, q1, q5 // .........*........................ - vmla.s16 q0, q3, r12 // ....................*............. - vldrh.u16 q5, [r11, #-32] // .......................*.......... - vsub.u16 q2, q4, q0 // .....................*............ - vstrw.u32 q2, [r0, #-48] // ...............................*.. - vadd.u16 q0, q4, q0 // ......................*........... - vld40.u32 {q1,q2,q3,q4}, [r0] // e................................. - vstrw.u32 q0, [r0, #-64] // ..............................*... - vmul.s16 q5, q7, q5 // .........................*........ - vldrh.u16 q0, [r11, #-16] // ........................*......... - vqrdmulh.s16 q0, q7, q0 // ..........................*....... - vld41.u32 {q1,q2,q3,q4}, [r0] // .e................................ - vmla.s16 q5, q0, r12 // ...........................*...... - vld42.u32 {q1,q2,q3,q4}, [r0] // ..e............................... - vsub.u16 q0, q6, q5 // ............................*..... - vld43.u32 {q1,q2,q3,q4}, [r0]! // ...e.............................. - vadd.u16 q6, q6, q5 // .............................*.... - vstrw.u32 q0, [r0, #-80] // .................................* - vldrh.u16 q7, [r11] , #96 // ....e............................. - vmul.s16 q5, q3, q7 // ......e........................... - vstrw.u32 q6, [r0, #-96] // ................................*. - - // original source code - // vld40.u32 {q0, q1, q2, q3}, [r0] // e..............|..................e.............. - // vld41.u32 {q0, q1, q2, q3}, [r0] // .....e.........|.......................e......... - // vld42.u32 {q0, q1, q2, q3}, [r0] // .......e.......|.........................e....... - // vld43.u32 {q0, q1, q2, q3}, [r0]! // .........e.....|...........................e..... - // vldrh.u16 q5, [r11], #+96 // ............e..|..............................e.. - // vldrh.u16 q6, [r11, #(+16-96)] // ...............|*................................ - // vmul.s16 q4, q2, q5 // .............e.|...............................e. - // vqrdmulh.s16 q2, q2, q6 // ...............|.....*........................... - // vmla.s16 q4, q2, r12 // ...............|.......*......................... - // vsub.u16 q2, q0, q4 // ...............|............*.................... - // vadd.u16 q0, q0, q4 // ...............|..........*...................... - // vmul.s16 q4, q3, q5 // ...............*................................. - // vqrdmulh.s16 q3, q3, q6 // ...............|.*............................... - // vmla.s16 q4, q3, r12 // ...............|...*............................. - // vsub.u16 q3, q1, q4 // ...............|........*........................ - // vadd.u16 q1, q1, q4 // ...............|......*.......................... - // vldrh.u16 q5, [r11, #(32 - 96)] // ...............|..*.............................. - // vldrh.u16 q6, [r11, #(48 - 96)] // ...............|....*............................ - // vmul.s16 q4, q1, q5 // ...............|...........*..................... - // vqrdmulh.s16 q1, q1, q6 // ...............|.........*....................... - // vmla.s16 q4, q1, r12 // ...............|.............*................... - // vsub.u16 q1, q0, q4 // ...............|...............*................. - // vadd.u16 q0, q0, q4 // ...............|.................*............... - // vldrh.u16 q5, [r11, #(64-96)] // ...............|..............*.................. - // vldrh.u16 q6, [r11, #(80-96)] // ...*...........|.....................*........... - // vmul.s16 q4, q3, q5 // ..*............|....................*............ - // vqrdmulh.s16 q3, q3, q6 // ....*..........|......................*.......... - // vmla.s16 q4, q3, r12 // ......*........|........................*........ - // vsub.u16 q3, q2, q4 // ........*......|..........................*...... - // vadd.u16 q2, q2, q4 // ..........*....|............................*.... - // vstrw.32 q0, [r0, #( 0 - 64)] // .*.............|...................*............. - // vstrw.32 q1, [r0, #(16 - 64)] // ...............|................*................ - // vstrw.32 q2, [r0, #(32 - 64)] // ..............*|................................* - // vstrw.32 q3, [r0, #(48 - 64)] // ...........*...|.............................*... - - le lr, layer67_loop - vmul.s16 q6, q4, q7 // *........................... - vldrh.u16 q7, [r11, #-80] // .*.......................... - vqrdmulh.s16 q4, q4, q7 // ..*......................... - vldrh.u16 q0, [r11, #-48] // .....*...................... - vmla.s16 q6, q4, r12 // ....*....................... - vldrh.u16 q4, [r11, #-64] // ...*........................ - vqrdmulh.s16 q7, q3, q7 // ......*..................... - vsub.u16 q3, q2, q6 // .........*.................. - vmla.s16 q5, q7, r12 // ........*................... - vadd.u16 q2, q2, q6 // .......*.................... - vmul.s16 q4, q2, q4 // ............*............... - vadd.u16 q7, q1, q5 // ...........*................ - vqrdmulh.s16 q2, q2, q0 // ..........*................. - vsub.u16 q6, q1, q5 // .............*.............. - vmla.s16 q4, q2, r12 // ..............*............. - vldrh.u16 q2, [r11, #-32] // ...............*............ - vadd.u16 q1, q7, q4 // ..................*......... - vstrw.u32 q1, [r0, #-64] // ...................*........ - vmul.s16 q2, q3, q2 // ....................*....... - vldrh.u16 q1, [r11, #-16] // .....................*...... - vqrdmulh.s16 q1, q3, q1 // ......................*..... - vsub.u16 q4, q7, q4 // ................*........... - vmla.s16 q2, q1, r12 // .......................*.... - vstrw.u32 q4, [r0, #-48] // .................*.......... - vsub.u16 q1, q6, q2 // ........................*... - vstrw.u32 q1, [r0, #-16] // ..........................*. - vadd.u16 q1, q6, q2 // .........................*.. - vstrw.u32 q1, [r0, #-32] // ...........................* - - // original source code - // vmul.s16 q7, q4, q7 // *........................... - // vldrh.u16 q0, [r11, #-80] // .*.......................... - // vqrdmulh.s16 q4, q4, q0 // ..*......................... - // vldrh.u16 q6, [r11, #-64] // .....*...................... - // vmla.s16 q7, q4, r12 // ....*....................... - // vldrh.u16 q4, [r11, #-48] // ...*........................ - // vqrdmulh.s16 q3, q3, q0 // ......*..................... - // vadd.u16 q0, q2, q7 // .........*.................. - // vmla.s16 q5, q3, r12 // ........*................... - // vsub.u16 q7, q2, q7 // .......*.................... - // vqrdmulh.s16 q3, q0, q4 // ............*............... - // vadd.u16 q4, q1, q5 // ...........*................ - // vmul.s16 q0, q0, q6 // ..........*................. - // vsub.u16 q6, q1, q5 // .............*.............. - // vmla.s16 q0, q3, r12 // ..............*............. - // vldrh.u16 q5, [r11, #-32] // ...............*............ - // vsub.u16 q2, q4, q0 // .....................*...... - // vstrw.u32 q2, [r0, #-48] // .......................*.... - // vadd.u16 q0, q4, q0 // ................*........... - // vstrw.u32 q0, [r0, #-64] // .................*.......... - // vmul.s16 q5, q7, q5 // ..................*......... - // vldrh.u16 q0, [r11, #-16] // ...................*........ - // vqrdmulh.s16 q0, q7, q0 // ....................*....... - // vmla.s16 q5, q0, r12 // ......................*..... - // vsub.u16 q0, q6, q5 // ........................*... - // vadd.u16 q6, q6, q5 // ..........................*. - // vstrw.u32 q0, [r0, #-16] // .........................*.. - // vstrw.u32 q6, [r0, #-32] // ...........................* - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_twiddles.s b/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_twiddles.s deleted file mode 100644 index 9cfc2ee..0000000 --- a/tests/ntt_kyber/manual/ntt_kyber_1_23_45_67_twiddles.s +++ /dev/null @@ -1,473 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// Copyright (c) 2022 Hanno Becker -/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -1600 -.word -15749 -.word -749 -.word -7373 -.word -687 -.word -6762 -.word 630 -.word 6201 -.word -40 -.word -394 -.word -1432 -.word -14095 -.word 848 -.word 8347 -.word 1062 -.word 10453 -.word 296 -.word 2914 -.word -882 -.word -8682 -.word -1410 -.word -13879 -.word 1339 -.word 13180 -.word 1476 -.word 14529 -.word 193 -.word 1900 -.word -283 -.word -2786 -.word 56 -.word 551 -.word 797 -.word 7845 -.word -1089 -.word -10719 -.word 1333 -.word 13121 -.word -543 -.word -5345 -.word 1426 -.word 14036 -.word -1235 -.word -12156 -.word -69 -.word -679 -.word 535 -.word 5266 -.word -447 -.word -4400 -.word 569 -.word 5601 -.word -936 -.word -9213 -.word -450 -.word -4429 -.word -1583 -.word -15582 -.word -1355 -.word -13338 -.word 821 -.word 8081 -// Blocked layers start -.short 289 -.short 289 -.short 331 -.short 331 -.short -76 -.short -76 -.short -1573 -.short -1573 -.short 2845 -.short 2845 -.short 3258 -.short 3258 -.short -748 -.short -748 -.short -15483 -.short -15483 -.short 17 -.short 17 -.short 583 -.short 583 -.short 1637 -.short 1637 -.short -1041 -.short -1041 -.short 167 -.short 167 -.short 5739 -.short 5739 -.short 16113 -.short 16113 -.short -10247 -.short -10247 -.short -568 -.short -568 -.short -680 -.short -680 -.short 723 -.short 723 -.short 1100 -.short 1100 -.short -5591 -.short -5591 -.short -6693 -.short -6693 -.short 7117 -.short 7117 -.short 10828 -.short 10828 -.short 1197 -.short 1197 -.short -1025 -.short -1025 -.short -1052 -.short -1052 -.short -1274 -.short -1274 -.short 11782 -.short 11782 -.short -10089 -.short -10089 -.short -10355 -.short -10355 -.short -12540 -.short -12540 -.short 1409 -.short 1409 -.short -48 -.short -48 -.short 756 -.short 756 -.short -314 -.short -314 -.short 13869 -.short 13869 -.short -472 -.short -472 -.short 7441 -.short 7441 -.short -3091 -.short -3091 -.short -667 -.short -667 -.short 233 -.short 233 -.short -1173 -.short -1173 -.short -279 -.short -279 -.short -6565 -.short -6565 -.short 2293 -.short 2293 -.short -11546 -.short -11546 -.short -2746 -.short -2746 -.short 650 -.short 650 -.short -1352 -.short -1352 -.short -816 -.short -816 -.short 632 -.short 632 -.short 6398 -.short 6398 -.short -13308 -.short -13308 -.short -8032 -.short -8032 -.short 6221 -.short 6221 -.short -1626 -.short -1626 -.short -540 -.short -540 -.short -1482 -.short -1482 -.short 1461 -.short 1461 -.short -16005 -.short -16005 -.short -5315 -.short -5315 -.short -14588 -.short -14588 -.short 14381 -.short 14381 -.short 1651 -.short 1651 -.short -1540 -.short -1540 -.short 952 -.short 952 -.short -642 -.short -642 -.short 16251 -.short 16251 -.short -15159 -.short -15159 -.short 9371 -.short 9371 -.short -6319 -.short -6319 -.short -464 -.short -464 -.short 33 -.short 33 -.short 1320 -.short 1320 -.short -1414 -.short -1414 -.short -4567 -.short -4567 -.short 325 -.short 325 -.short 12993 -.short 12993 -.short -13918 -.short -13918 -.short 939 -.short 939 -.short -892 -.short -892 -.short 733 -.short 733 -.short 268 -.short 268 -.short 9243 -.short 9243 -.short -8780 -.short -8780 -.short 7215 -.short 7215 -.short 2638 -.short 2638 -.short -1021 -.short -1021 -.short -941 -.short -941 -.short -992 -.short -992 -.short 641 -.short 641 -.short -10050 -.short -10050 -.short -9262 -.short -9262 -.short -9764 -.short -9764 -.short 6309 -.short 6309 -.short -1010 -.short -1010 -.short 1435 -.short 1435 -.short 807 -.short 807 -.short 452 -.short 452 -.short -9942 -.short -9942 -.short 14125 -.short 14125 -.short 7943 -.short 7943 -.short 4449 -.short 4449 -.short 1584 -.short 1584 -.short -1292 -.short -1292 -.short 375 -.short 375 -.short -1239 -.short -1239 -.short 15592 -.short 15592 -.short -12717 -.short -12717 -.short 3691 -.short 3691 -.short -12196 -.short -12196 -.short -1031 -.short -1031 -.short -109 -.short -109 -.short -780 -.short -780 -.short 1645 -.short 1645 -.short -10148 -.short -10148 -.short -1073 -.short -1073 -.short -7678 -.short -7678 -.short 16192 -.short 16192 -.short 1438 -.short 1438 -.short -461 -.short -461 -.short 1534 -.short 1534 -.short -927 -.short -927 -.short 14155 -.short 14155 -.short -4538 -.short -4538 -.short 15099 -.short 15099 -.short -9125 -.short -9125 -.short 1063 -.short 1063 -.short -556 -.short -556 -.short -1230 -.short -1230 -.short -863 -.short -863 -.short 10463 -.short 10463 -.short -5473 -.short -5473 -.short -12107 -.short -12107 -.short -8495 -.short -8495 -.short 319 -.short 319 -.short 757 -.short 757 -.short 561 -.short 561 -.short -735 -.short -735 -.short 3140 -.short 3140 -.short 7451 -.short 7451 -.short 5522 -.short 5522 -.short -7235 -.short -7235 -.short -682 -.short -682 -.short -712 -.short -712 -.short 1481 -.short 1481 -.short 648 -.short 648 -.short -6713 -.short -6713 -.short -7008 -.short -7008 -.short 14578 -.short 14578 -.short 6378 -.short 6378 -.short -525 -.short -525 -.short 403 -.short 403 -.short 1143 -.short 1143 -.short -554 -.short -554 -.short -5168 -.short -5168 -.short 3967 -.short 3967 -.short 11251 -.short 11251 -.short -5453 -.short -5453 -.short 1092 -.short 1092 -.short 1026 -.short 1026 -.short -1179 -.short -1179 -.short 886 -.short 886 -.short 10749 -.short 10749 -.short 10099 -.short 10099 -.short -11605 -.short -11605 -.short 8721 -.short 8721 -.short -855 -.short -855 -.short -219 -.short -219 -.short 1227 -.short 1227 -.short 910 -.short 910 -.short -8416 -.short -8416 -.short -2156 -.short -2156 -.short 12078 -.short 12078 -.short 8957 -.short 8957 -.short -1607 -.short -1607 -.short -1455 -.short -1455 -.short -1219 -.short -1219 -.short 885 -.short 885 -.short -15818 -.short -15818 -.short -14322 -.short -14322 -.short -11999 -.short -11999 -.short 8711 -.short 8711 -.short 1212 -.short 1212 -.short 1029 -.short 1029 -.short -394 -.short -394 -.short -1175 -.short -1175 -.short 11930 -.short 11930 -.short 10129 -.short 10129 -.short -3878 -.short -3878 -.short -11566 -.short -11566 \ No newline at end of file diff --git a/tests/ntt_kyber/ntt_kyber.mk b/tests/ntt_kyber/ntt_kyber.mk new file mode 100644 index 0000000..14af270 --- /dev/null +++ b/tests/ntt_kyber/ntt_kyber.mk @@ -0,0 +1,17 @@ +TESTS += ntt_kyber + +NTT_KYBER_PLATFORMS += m55-an547 +NTT_KYBER_PLATFORMS += m85-an555 + +NTT_KYBER_SOURCES += main.c + +NTT_KYBER_ASM_DIR=../../asm/manual/ntt_kyber +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_opt_m55.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_opt_m85.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m55.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_vld4_opt_m85.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_vld4.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_opt_size_m55.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_opt_size_m85.s +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67.s \ No newline at end of file From 70cb440366404509d85d5bab10d0a139edb49291 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 11 Jul 2024 16:44:08 +0800 Subject: [PATCH 02/32] update slothy --- slothy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slothy b/slothy index 5552c48..8e702a7 160000 --- a/slothy +++ b/slothy @@ -1 +1 @@ -Subproject commit 5552c48e591d3af5805fece3d7eb1acbec218a6c +Subproject commit 8e702a7e5b70fb9cb7f02799f64b63d679a4bf89 From 441e13148f7e3c2727a266ebdaa10cf169719480 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 10:36:51 +0800 Subject: [PATCH 03/32] add CI build --- .github/workflows/build.yaml | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 .github/workflows/build.yaml diff --git a/.github/workflows/build.yaml b/.github/workflows/build.yaml new file mode 100644 index 0000000..1484595 --- /dev/null +++ b/.github/workflows/build.yaml @@ -0,0 +1,16 @@ +name: Build everything +on: + pull_request: + branches: [ "main" ] +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + with: + submodules: true + - name: install dependencies + run: sudo apt install gcc-arm-none-eabi + - name: Make all + run: | + make \ No newline at end of file From d22a52d0ded366a8a89a3cbf878eea26840230f3 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 10:47:43 +0800 Subject: [PATCH 04/32] fix nproc --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index aa532c8..3aca7ac 100644 --- a/Makefile +++ b/Makefile @@ -22,7 +22,7 @@ test = $(lastword $(subst --, ,$*)) .PHONY: ${builds} ${builds}: build-%: - make -j$(nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' + make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' .PHONY: ${runs} ${runs}: run-%: From 8465efbf127fa63798ed67a7e82610c827951b38 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 10:55:18 +0800 Subject: [PATCH 05/32] git ignore build dirs --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..954f187 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +**/build/ +**/*.elf From 4831124180e7050497bba00d1c041583071315eb Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 10:55:43 +0800 Subject: [PATCH 06/32] remove obsolete init scripts --- envsetup.sh | 13 ------------- init.sh | 12 ------------ 2 files changed, 25 deletions(-) delete mode 100644 envsetup.sh delete mode 100755 init.sh diff --git a/envsetup.sh b/envsetup.sh deleted file mode 100644 index 489512d..0000000 --- a/envsetup.sh +++ /dev/null @@ -1,13 +0,0 @@ -echo "Setting up Python environment" - -PATH_ROOT=$(builtin cd $(dirname $0) && pwd) - -# Path to root of specification -export PYTHONPATH=$PATH_ROOT/asm/gen:$PYTHONPATH - -if [[ -d ${PATH_ROOT}/venv ]]; then - source ${PATH_ROOT}/venv/bin/activate -else - echo "Error: ${PATH_ROOT}/venv not found - have you run ./init.sh ?" - return 1 -fi diff --git a/init.sh b/init.sh deleted file mode 100755 index 43bc16e..0000000 --- a/init.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/bin/bash - -set -e - -PATH_ROOT=$(builtin cd $(dirname $0) && pwd) -PATH_VENV=${PATH_ROOT}/venv - -if [[ ! -d ${PATH_VENV} ]]; then - python3 -mvenv ${PATH_VENV} -fi - -${PATH_VENV}/bin/pip install -r ${PATH_ROOT}/requirements.txt From 61e0a7656f33acc8a159761f8847051b445a157c Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 11:00:01 +0800 Subject: [PATCH 07/32] add CI running tests on qemu platforms --- .github/workflows/run.yaml | 16 ++++++++++++++++ Makefile | 3 +++ 2 files changed, 19 insertions(+) create mode 100644 .github/workflows/run.yaml diff --git a/.github/workflows/run.yaml b/.github/workflows/run.yaml new file mode 100644 index 0000000..bfac3e7 --- /dev/null +++ b/.github/workflows/run.yaml @@ -0,0 +1,16 @@ +name: Run tests on qemu platforms +on: + pull_request: + branches: [ "main" ] +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + with: + submodules: true + - name: install dependencies + run: sudo apt install gcc-arm-none-eabi qemu-system + - name: Make all + run: | + make run \ No newline at end of file diff --git a/Makefile b/Makefile index 3aca7ac..d0b8b7b 100644 --- a/Makefile +++ b/Makefile @@ -28,6 +28,9 @@ ${builds}: build-%: ${runs}: run-%: make -C envs/$(platform) run SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' +.PHONY: run +run: ${runs} + .PHONY: ${cleans} ${cleans}: clean-%: make -C envs/$(platform) clean From e2bce9e6bee065ed41588f26d61f68a5d46101b5 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 11:03:08 +0800 Subject: [PATCH 08/32] break test --- tests/ntt_dilithium/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/ntt_dilithium/main.c b/tests/ntt_dilithium/main.c index de905fe..d54f176 100644 --- a/tests/ntt_dilithium/main.c +++ b/tests/ntt_dilithium/main.c @@ -271,7 +271,7 @@ int main(void) // base ret |= test_ntt_l2222(); - if( ret != 0 ) + if( ret == 0 ) return( 1 ); ret |= test_ntt_l2222_no_trans_vld4(); if( ret != 0 ) From 59df8196bcb9280086df7c05a9f76c3ff9771bce Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 13:09:32 +0800 Subject: [PATCH 09/32] build one elf for each test/platform combination --- Makefile | 5 +++-- envs/m55-an547/Makefile | 7 ++----- envs/m85-an555/Makefile | 7 ++----- 3 files changed, 7 insertions(+), 12 deletions(-) diff --git a/Makefile b/Makefile index d0b8b7b..fcd689d 100644 --- a/Makefile +++ b/Makefile @@ -6,6 +6,7 @@ totestname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') totestsources = $(addprefix ../../tests/$(1)/,$($(call totestname,$(1))_SOURCES)) totestasm = $(addprefix ../../tests/$(1)/,$($(call totestname,$(1))_ASMS)) toplatform = $(addsuffix --$(1),$($(call totestname,$(1))_PLATFORMS)) +toelfname = $(addsuffix -test.elf,$(1)) platformtests := $(foreach test,$(TESTS), $(call toplatform,$(test))) @@ -22,11 +23,11 @@ test = $(lastword $(subst --, ,$*)) .PHONY: ${builds} ${builds}: build-%: - make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' + make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' TARGET=$(call toelfname,$(test)) .PHONY: ${runs} ${runs}: run-%: - make -C envs/$(platform) run SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' + make -C envs/$(platform) run SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' TARGET=$(call toelfname,$(test)) .PHONY: run run: ${runs} diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index adaca36..8478183 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -3,8 +3,6 @@ CC = arm-none-eabi-gcc LD := $(CC) -TARGET=test.elf - SRC_DIR=./src BUILD_DIR=./build @@ -67,8 +65,7 @@ $(OBJECTS_ASM): $(BUILD_DIR)/%.o: % mkdir -p $(@D) $(CC) -x assembler-with-cpp $(CFLAGS) -c -o $@ $< -.PHONY: test.elf -test.elf: $(OBJECTS) $(LDSCRIPT) +$(TARGET): $(OBJECTS) $(LDSCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJECTS) .PHONY: build @@ -78,5 +75,5 @@ run: $(TARGET) qemu-system-arm -M mps3-an547 -nographic -semihosting -kernel $(TARGET) clean: - rm -f $(TARGET) + rm -f *.elf rm -rf $(BUILD_DIR) \ No newline at end of file diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 1c72a81..6f4c6cf 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -3,8 +3,6 @@ CC = arm-none-eabi-gcc LD := $(CC) -TARGET=test.elf - SRC_DIR=./src BUILD_DIR=./build @@ -67,8 +65,7 @@ $(OBJECTS_ASM): $(BUILD_DIR)/%.o: % mkdir -p $(@D) $(CC) -x assembler-with-cpp $(CFLAGS) -c -o $@ $< -.PHONY: test.elf -test.elf: $(OBJECTS) $(LDSCRIPT) +$(TARGET): $(OBJECTS) $(LDSCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJECTS) .PHONY: build @@ -78,5 +75,5 @@ run: @echo "WARNING: AN555 is not supported by qemu. Skipping" clean: - rm -f $(TARGET) + rm -f *.elf rm -rf $(BUILD_DIR) \ No newline at end of file From 56ec5995253216f51c5cb20f1d1331ead02958db Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 13:14:01 +0800 Subject: [PATCH 10/32] CI: use more threads for make --- .github/workflows/build.yaml | 2 +- .github/workflows/run.yaml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build.yaml b/.github/workflows/build.yaml index 1484595..89e8422 100644 --- a/.github/workflows/build.yaml +++ b/.github/workflows/build.yaml @@ -13,4 +13,4 @@ jobs: run: sudo apt install gcc-arm-none-eabi - name: Make all run: | - make \ No newline at end of file + make -j$(nproc) \ No newline at end of file diff --git a/.github/workflows/run.yaml b/.github/workflows/run.yaml index bfac3e7..b49ac5a 100644 --- a/.github/workflows/run.yaml +++ b/.github/workflows/run.yaml @@ -13,4 +13,4 @@ jobs: run: sudo apt install gcc-arm-none-eabi qemu-system - name: Make all run: | - make run \ No newline at end of file + make -j$(nproc) run \ No newline at end of file From 70a9aa2290f6d3efdac045f9a04218a4f6e9efac Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 14:02:02 +0800 Subject: [PATCH 11/32] add support to build standalone artifacts --- .github/workflows/run.yaml | 2 +- .github/workflows/standalone.yaml | 19 +++++++++++++++++++ .gitignore | 1 + Makefile | 28 ++++++++++++++++++++-------- envs/m55-an547/Makefile | 9 +++++++-- envs/m55-an547/common | 1 + envs/m85-an555/Makefile | 9 +++++++-- envs/m85-an555/common | 1 + tests/ntt_dilithium/ntt_dilithium.mk | 12 +++++++++++- tests/ntt_kyber/ntt_kyber.mk | 12 +++++++++++- 10 files changed, 79 insertions(+), 15 deletions(-) create mode 100644 .github/workflows/standalone.yaml create mode 120000 envs/m55-an547/common create mode 120000 envs/m85-an555/common diff --git a/.github/workflows/run.yaml b/.github/workflows/run.yaml index b49ac5a..fa75f90 100644 --- a/.github/workflows/run.yaml +++ b/.github/workflows/run.yaml @@ -11,6 +11,6 @@ jobs: submodules: true - name: install dependencies run: sudo apt install gcc-arm-none-eabi qemu-system - - name: Make all + - name: Make all and run run: | make -j$(nproc) run \ No newline at end of file diff --git a/.github/workflows/standalone.yaml b/.github/workflows/standalone.yaml new file mode 100644 index 0000000..fd633a0 --- /dev/null +++ b/.github/workflows/standalone.yaml @@ -0,0 +1,19 @@ +name: Package standalone artifacts and build +on: + pull_request: + branches: [ "main" ] +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + with: + submodules: true + - name: install dependencies + run: sudo apt install gcc-arm-none-eabi qemu-system + - name: Package standalone artificats + run: | + make -j$(nproc) standalone + - name: Make sure all standalone artifacts build + run: | + for standalone in standalone/*; do make -C $standalone; done \ No newline at end of file diff --git a/.gitignore b/.gitignore index 954f187..a0c61b5 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ **/build/ **/*.elf +standalone diff --git a/Makefile b/Makefile index fcd689d..d9a678e 100644 --- a/Makefile +++ b/Makefile @@ -3,16 +3,18 @@ include tests/ntt_dilithium/ntt_dilithium.mk include tests/ntt_kyber/ntt_kyber.mk totestname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') -totestsources = $(addprefix ../../tests/$(1)/,$($(call totestname,$(1))_SOURCES)) -totestasm = $(addprefix ../../tests/$(1)/,$($(call totestname,$(1))_ASMS)) +totestsources = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call totestname,$(1))_SOURCES))) +totestasm = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call totestname,$(1))_ASMS))) +totestother = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call totestname,$(1))_OTHER))) toplatform = $(addsuffix --$(1),$($(call totestname,$(1))_PLATFORMS)) toelfname = $(addsuffix -test.elf,$(1)) platformtests := $(foreach test,$(TESTS), $(call toplatform,$(test))) -builds := $(addprefix build-, $(platformtests)) -runs := $(addprefix run-, $(platformtests)) -cleans := $(addprefix clean-, $(platformtests)) +builds := $(addprefix build-, $(platformtests)) +runs := $(addprefix run-, $(platformtests)) +cleans := $(addprefix clean-, $(platformtests)) +standalones := $(addprefix standalone-, $(platformtests)) .PHONY: all all: ${builds} @@ -20,21 +22,31 @@ all: ${builds} platform = $(firstword $(subst --, ,$*)) test = $(lastword $(subst --, ,$*)) - .PHONY: ${builds} ${builds}: build-%: - make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' TARGET=$(call toelfname,$(test)) + make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test),../../)' ASMS='$(call totestasm,$(test),../../)' TARGET=$(call toelfname,$(test)) .PHONY: ${runs} ${runs}: run-%: - make -C envs/$(platform) run SOURCES='$(call totestsources,$(test))' ASMS='$(call totestasm,$(test))' TARGET=$(call toelfname,$(test)) + make -C envs/$(platform) run SOURCES='$(call totestsources,$(test),../../)' ASMS='$(call totestasm,$(test),../../)' TARGET=$(call toelfname,$(test)) .PHONY: run run: ${runs} +.PHONY: ${standalones} +${standalones}: standalone-%: + make -C envs/$(platform) clean + mkdir -p standalone/$@/test_src/ + cp -rL envs/$(platform)/* standalone/$@/ + cp -rL $(call totestsources,$(test),./) $(call totestasm,$(test),./) $(call totestother,$(test),./) standalone/$@/test_src/ + +.PHONY: standalone +standalone: ${standalones} + .PHONY: ${cleans} ${cleans}: clean-%: make -C envs/$(platform) clean .PHONY: clean clean: ${cleans} + rm -rf standalone diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index 8478183..0d2110a 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -1,12 +1,17 @@ # Makefile for images for AN547 +# default values (when not called from the root Makefile - used in standalone artifacts) +TARGET ?= test.elf +SOURCES ?= $(wildcard test_src/*.c) +ASMS ?= $(wildcard test_src/*.s) + CC = arm-none-eabi-gcc LD := $(CC) SRC_DIR=./src BUILD_DIR=./build -COMMON_INC=../common/inc/ +COMMON_INC=common/inc/ ENV_INC=./inc/ SYSROOT := $(shell $(CC) --print-sysroot) @@ -49,7 +54,7 @@ LDFLAGS += \ all: $(TARGET) -HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) +HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard common/src/*.c) OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) diff --git a/envs/m55-an547/common b/envs/m55-an547/common new file mode 120000 index 0000000..60d3b0a --- /dev/null +++ b/envs/m55-an547/common @@ -0,0 +1 @@ +../common \ No newline at end of file diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 6f4c6cf..b4adaa1 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -1,12 +1,17 @@ # Makefile for images for AN555 +# default values (when not called from the root Makefile - used in standalone artifacts) +TARGET ?= test.elf +SOURCES ?= $(wildcard test_src/*.c) +ASMS ?= $(wildcard test_src/*.s) + CC = arm-none-eabi-gcc LD := $(CC) SRC_DIR=./src BUILD_DIR=./build -COMMON_INC=../common/inc/ +COMMON_INC=common/inc/ ENV_INC=./inc/ SYSROOT := $(shell $(CC) --print-sysroot) @@ -49,7 +54,7 @@ LDFLAGS += \ all: $(TARGET) -HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) +HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard common/src/*.c) OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) diff --git a/envs/m85-an555/common b/envs/m85-an555/common new file mode 120000 index 0000000..8332399 --- /dev/null +++ b/envs/m85-an555/common @@ -0,0 +1 @@ +../common/ \ No newline at end of file diff --git a/tests/ntt_dilithium/ntt_dilithium.mk b/tests/ntt_dilithium/ntt_dilithium.mk index fd3261f..e3702ec 100644 --- a/tests/ntt_dilithium/ntt_dilithium.mk +++ b/tests/ntt_dilithium/ntt_dilithium.mk @@ -1,10 +1,16 @@ +# Test name - needs to match the directory name TESTS += ntt_dilithium +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) NTT_DILITHIUM_PLATFORMS += m55-an547 NTT_DILITHIUM_PLATFORMS += m85-an555 +# C sources required for this test NTT_DILITHIUM_SOURCES += main.c +# Assembly sources required for this test NTT_DILITHIUM_ASM_DIR = ../../asm/manual/ntt_dilithium NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m55.s NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_no_trans_vld4_opt_m85.s @@ -14,4 +20,8 @@ NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_opt_m85 NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78.s NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_opt_size_m55.s NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_opt_size_m85.s -NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78.s \ No newline at end of file +NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78.s + +# Additional required files (needed for packaging a standalone artifact) +NTT_DILITHIUM_OTHER += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_twiddles.s +NTT_DILITHIUM_OTHER += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_twiddles.s diff --git a/tests/ntt_kyber/ntt_kyber.mk b/tests/ntt_kyber/ntt_kyber.mk index 14af270..f9f32e8 100644 --- a/tests/ntt_kyber/ntt_kyber.mk +++ b/tests/ntt_kyber/ntt_kyber.mk @@ -1,10 +1,16 @@ +# Test name - needs to match the directory name TESTS += ntt_kyber +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) NTT_KYBER_PLATFORMS += m55-an547 NTT_KYBER_PLATFORMS += m85-an555 +# C sources required for this test NTT_KYBER_SOURCES += main.c +# Assembly sources required for this test NTT_KYBER_ASM_DIR=../../asm/manual/ntt_kyber NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_opt_m55.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_opt_m85.s @@ -14,4 +20,8 @@ NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_vld4.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_opt_size_m55.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_opt_size_m85.s -NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67.s \ No newline at end of file +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67.s + +# Additional required files (needed for packaging a standalone artifact) +NTT_KYBER_OTHER += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_twiddles.s +NTT_KYBER_OTHER += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_twiddles.s From 6ee5cbfd7be530ab27125493cbb7ffcb1c59719b Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 14:04:48 +0800 Subject: [PATCH 12/32] don't need qemu here --- .github/workflows/standalone.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/standalone.yaml b/.github/workflows/standalone.yaml index fd633a0..5e823ad 100644 --- a/.github/workflows/standalone.yaml +++ b/.github/workflows/standalone.yaml @@ -10,7 +10,7 @@ jobs: with: submodules: true - name: install dependencies - run: sudo apt install gcc-arm-none-eabi qemu-system + run: sudo apt install gcc-arm-none-eabi - name: Package standalone artificats run: | make -j$(nproc) standalone From 85f29a4faee70be9576e7febdfe8330abad6330f Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 12 Jul 2024 14:21:13 +0800 Subject: [PATCH 13/32] refactor names --- Makefile | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/Makefile b/Makefile index d9a678e..84b3bb4 100644 --- a/Makefile +++ b/Makefile @@ -2,14 +2,14 @@ include tests/ntt_dilithium/ntt_dilithium.mk include tests/ntt_kyber/ntt_kyber.mk -totestname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') -totestsources = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call totestname,$(1))_SOURCES))) -totestasm = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call totestname,$(1))_ASMS))) -totestother = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call totestname,$(1))_OTHER))) -toplatform = $(addsuffix --$(1),$($(call totestname,$(1))_PLATFORMS)) -toelfname = $(addsuffix -test.elf,$(1)) +testname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') +testsources = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_SOURCES))) +testasms = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_ASMS))) +testother = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_OTHER))) +testplatforms = $(addsuffix --$(1),$($(call testname,$(1))_PLATFORMS)) +elfname = $(addsuffix -test.elf,$(1)) -platformtests := $(foreach test,$(TESTS), $(call toplatform,$(test))) +platformtests := $(foreach test,$(TESTS), $(call testplatforms,$(test))) builds := $(addprefix build-, $(platformtests)) runs := $(addprefix run-, $(platformtests)) @@ -24,11 +24,11 @@ test = $(lastword $(subst --, ,$*)) .PHONY: ${builds} ${builds}: build-%: - make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call totestsources,$(test),../../)' ASMS='$(call totestasm,$(test),../../)' TARGET=$(call toelfname,$(test)) + make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) .PHONY: ${runs} ${runs}: run-%: - make -C envs/$(platform) run SOURCES='$(call totestsources,$(test),../../)' ASMS='$(call totestasm,$(test),../../)' TARGET=$(call toelfname,$(test)) + make -C envs/$(platform) run SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) .PHONY: run run: ${runs} @@ -38,7 +38,7 @@ ${standalones}: standalone-%: make -C envs/$(platform) clean mkdir -p standalone/$@/test_src/ cp -rL envs/$(platform)/* standalone/$@/ - cp -rL $(call totestsources,$(test),./) $(call totestasm,$(test),./) $(call totestother,$(test),./) standalone/$@/test_src/ + cp -rL $(call testsources,$(test),./) $(call testasms,$(test),./) $(call testother,$(test),./) standalone/$@/test_src/ .PHONY: standalone standalone: ${standalones} From 06d488fa91b9a2ee813f65e89631e2943f9c0271 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Mon, 15 Jul 2024 10:00:53 +0800 Subject: [PATCH 14/32] tests checking for last line of output --- .github/workflows/{run.yaml => check.yaml} | 2 +- Makefile | 8 ++++++++ envs/m55-an547/Makefile | 3 +++ envs/m85-an555/Makefile | 4 ++++ tests/ntt_dilithium/main.c | 4 ++-- tests/ntt_kyber/main.c | 4 ++-- 6 files changed, 20 insertions(+), 5 deletions(-) rename .github/workflows/{run.yaml => check.yaml} (91%) diff --git a/.github/workflows/run.yaml b/.github/workflows/check.yaml similarity index 91% rename from .github/workflows/run.yaml rename to .github/workflows/check.yaml index fa75f90..24eb09b 100644 --- a/.github/workflows/run.yaml +++ b/.github/workflows/check.yaml @@ -13,4 +13,4 @@ jobs: run: sudo apt install gcc-arm-none-eabi qemu-system - name: Make all and run run: | - make -j$(nproc) run \ No newline at end of file + make -j$(nproc) check \ No newline at end of file diff --git a/Makefile b/Makefile index 84b3bb4..2745c4e 100644 --- a/Makefile +++ b/Makefile @@ -13,6 +13,7 @@ platformtests := $(foreach test,$(TESTS), $(call testplatforms,$(test))) builds := $(addprefix build-, $(platformtests)) runs := $(addprefix run-, $(platformtests)) +checks := $(addprefix check-, $(platformtests)) cleans := $(addprefix clean-, $(platformtests)) standalones := $(addprefix standalone-, $(platformtests)) @@ -33,6 +34,13 @@ ${runs}: run-%: .PHONY: run run: ${runs} +.PHONY: ${checks} +${checks}: check-%: + make -C envs/$(platform) check SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) + +.PHONY: check +check: ${checks} + .PHONY: ${standalones} ${standalones}: standalone-%: make -C envs/$(platform) clean diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index 0d2110a..d3caf45 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -79,6 +79,9 @@ build: $(TARGET) run: $(TARGET) qemu-system-arm -M mps3-an547 -nographic -semihosting -kernel $(TARGET) +check: $(TARGET) + qemu-system-arm -M mps3-an547 -nographic -semihosting -kernel $(TARGET) | tail -n 1 | grep "ALL GOOD!" || exit 1 + clean: rm -f *.elf rm -rf $(BUILD_DIR) \ No newline at end of file diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index b4adaa1..1e042eb 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -79,6 +79,10 @@ build: $(TARGET) run: @echo "WARNING: AN555 is not supported by qemu. Skipping" +check: + @echo "WARNING: AN555 is not supported by qemu. Skipping" + + clean: rm -f *.elf rm -rf $(BUILD_DIR) \ No newline at end of file diff --git a/tests/ntt_dilithium/main.c b/tests/ntt_dilithium/main.c index d54f176..d36aaaa 100644 --- a/tests/ntt_dilithium/main.c +++ b/tests/ntt_dilithium/main.c @@ -315,8 +315,8 @@ int main(void) bench_ntt_l2222_no_trans_vld4_opt_m85(); bench_ntt_l332_opt_size_m85(); - debug_printf( "Done!\n:" ); + debug_printf( "Done!\n" ); hal_pmu_disable(); - + debug_printf( "ALL GOOD!\n" ); return( ret ); } diff --git a/tests/ntt_kyber/main.c b/tests/ntt_kyber/main.c index a298f78..05b17e6 100644 --- a/tests/ntt_kyber/main.c +++ b/tests/ntt_kyber/main.c @@ -320,8 +320,8 @@ int main(void) bench_ntt_l1222_no_trans_vld4_opt_m85(); bench_ntt_l232_opt_size_m85(); - debug_printf( "Done!\n:" ); + debug_printf( "Done!\n" ); hal_pmu_disable(); - + debug_printf( "ALL GOOD!\n" ); return( ret ); } From fdb0b8953a9bacfbd2ca807d1e981bed7c48d865 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Mon, 15 Jul 2024 10:04:01 +0800 Subject: [PATCH 15/32] change back dilithium test --- tests/ntt_dilithium/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/ntt_dilithium/main.c b/tests/ntt_dilithium/main.c index d36aaaa..6784ca9 100644 --- a/tests/ntt_dilithium/main.c +++ b/tests/ntt_dilithium/main.c @@ -271,7 +271,7 @@ int main(void) // base ret |= test_ntt_l2222(); - if( ret == 0 ) + if( ret != 0 ) return( 1 ); ret |= test_ntt_l2222_no_trans_vld4(); if( ret != 0 ) From 2efc615912de49632ca0cd5108aac2a1e74f48de Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Mon, 15 Jul 2024 10:36:11 +0800 Subject: [PATCH 16/32] move common test files back to the tests folder --- Makefile | 7 ++++--- envs/m55-an547/Makefile | 2 ++ envs/m85-an555/Makefile | 2 ++ {envs/common/src => tests/common}/misc.c | 0 {envs/common/inc => tests/common}/misc.h | 0 {envs/common/src => tests/common}/poly.c | 0 {envs/common/inc => tests/common}/poly.h | 0 tests/ntt_dilithium/misc.c | 1 + tests/ntt_dilithium/misc.h | 1 + tests/ntt_dilithium/ntt_dilithium.mk | 4 ++++ tests/ntt_dilithium/poly.c | 1 + tests/ntt_dilithium/poly.h | 1 + tests/ntt_kyber/misc.c | 1 + tests/ntt_kyber/misc.h | 1 + tests/ntt_kyber/ntt_kyber.mk | 4 ++++ tests/ntt_kyber/poly.c | 1 + tests/ntt_kyber/poly.h | 1 + 17 files changed, 24 insertions(+), 3 deletions(-) rename {envs/common/src => tests/common}/misc.c (100%) rename {envs/common/inc => tests/common}/misc.h (100%) rename {envs/common/src => tests/common}/poly.c (100%) rename {envs/common/inc => tests/common}/poly.h (100%) create mode 120000 tests/ntt_dilithium/misc.c create mode 120000 tests/ntt_dilithium/misc.h create mode 120000 tests/ntt_dilithium/poly.c create mode 120000 tests/ntt_dilithium/poly.h create mode 120000 tests/ntt_kyber/misc.c create mode 120000 tests/ntt_kyber/misc.h create mode 120000 tests/ntt_kyber/poly.c create mode 120000 tests/ntt_kyber/poly.h diff --git a/Makefile b/Makefile index 2745c4e..f4337c0 100644 --- a/Makefile +++ b/Makefile @@ -3,6 +3,7 @@ include tests/ntt_dilithium/ntt_dilithium.mk include tests/ntt_kyber/ntt_kyber.mk testname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') +testdir = $(addprefix $(2),tests/$(1)/) testsources = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_SOURCES))) testasms = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_ASMS))) testother = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_OTHER))) @@ -25,18 +26,18 @@ test = $(lastword $(subst --, ,$*)) .PHONY: ${builds} ${builds}: build-%: - make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) + make -j$(shell nproc) -C envs/$(platform) build SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) TESTDIR=$(call testdir,$(test),../../) .PHONY: ${runs} ${runs}: run-%: - make -C envs/$(platform) run SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) + make -C envs/$(platform) run SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) TESTDIR=$(call testdir,$(test),../../) .PHONY: run run: ${runs} .PHONY: ${checks} ${checks}: check-%: - make -C envs/$(platform) check SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) + make -C envs/$(platform) check SOURCES='$(call testsources,$(test),../../)' ASMS='$(call testasms,$(test),../../)' TARGET=$(call elfname,$(test)) TESTDIR=$(call testdir,$(test),../../) .PHONY: check check: ${checks} diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index d3caf45..704d459 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -4,6 +4,7 @@ TARGET ?= test.elf SOURCES ?= $(wildcard test_src/*.c) ASMS ?= $(wildcard test_src/*.s) +TESTDIR ?= test_src CC = arm-none-eabi-gcc LD := $(CC) @@ -26,6 +27,7 @@ CFLAGS += \ -I$(COMMON_INC) \ -I$(ENV_INC) \ -I$(SRC_DIR) \ + -I$(TESTDIR) \ -I$(SRC_DIR)/platform ARCH_FLAGS += \ diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 1e042eb..09d543f 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -4,6 +4,7 @@ TARGET ?= test.elf SOURCES ?= $(wildcard test_src/*.c) ASMS ?= $(wildcard test_src/*.s) +TESTDIR ?= test_src CC = arm-none-eabi-gcc LD := $(CC) @@ -26,6 +27,7 @@ CFLAGS += \ -I$(COMMON_INC) \ -I$(ENV_INC) \ -I$(SRC_DIR) \ + -I$(TESTDIR) \ -I$(SRC_DIR)/platform ARCH_FLAGS += \ diff --git a/envs/common/src/misc.c b/tests/common/misc.c similarity index 100% rename from envs/common/src/misc.c rename to tests/common/misc.c diff --git a/envs/common/inc/misc.h b/tests/common/misc.h similarity index 100% rename from envs/common/inc/misc.h rename to tests/common/misc.h diff --git a/envs/common/src/poly.c b/tests/common/poly.c similarity index 100% rename from envs/common/src/poly.c rename to tests/common/poly.c diff --git a/envs/common/inc/poly.h b/tests/common/poly.h similarity index 100% rename from envs/common/inc/poly.h rename to tests/common/poly.h diff --git a/tests/ntt_dilithium/misc.c b/tests/ntt_dilithium/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt_dilithium/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt_dilithium/misc.h b/tests/ntt_dilithium/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt_dilithium/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt_dilithium/ntt_dilithium.mk b/tests/ntt_dilithium/ntt_dilithium.mk index e3702ec..82153d2 100644 --- a/tests/ntt_dilithium/ntt_dilithium.mk +++ b/tests/ntt_dilithium/ntt_dilithium.mk @@ -9,6 +9,8 @@ NTT_DILITHIUM_PLATFORMS += m85-an555 # C sources required for this test NTT_DILITHIUM_SOURCES += main.c +NTT_DILITHIUM_SOURCES += poly.c +NTT_DILITHIUM_SOURCES += misc.c # Assembly sources required for this test NTT_DILITHIUM_ASM_DIR = ../../asm/manual/ntt_dilithium @@ -25,3 +27,5 @@ NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78.s # Additional required files (needed for packaging a standalone artifact) NTT_DILITHIUM_OTHER += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_twiddles.s NTT_DILITHIUM_OTHER += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_twiddles.s +NTT_DILITHIUM_OTHER += misc.h +NTT_DILITHIUM_OTHER += poly.h diff --git a/tests/ntt_dilithium/poly.c b/tests/ntt_dilithium/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt_dilithium/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt_dilithium/poly.h b/tests/ntt_dilithium/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt_dilithium/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_kyber/misc.c b/tests/ntt_kyber/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt_kyber/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt_kyber/misc.h b/tests/ntt_kyber/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt_kyber/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt_kyber/ntt_kyber.mk b/tests/ntt_kyber/ntt_kyber.mk index f9f32e8..21a3e7f 100644 --- a/tests/ntt_kyber/ntt_kyber.mk +++ b/tests/ntt_kyber/ntt_kyber.mk @@ -9,6 +9,8 @@ NTT_KYBER_PLATFORMS += m85-an555 # C sources required for this test NTT_KYBER_SOURCES += main.c +NTT_KYBER_SOURCES += poly.c +NTT_KYBER_SOURCES += misc.c # Assembly sources required for this test NTT_KYBER_ASM_DIR=../../asm/manual/ntt_kyber @@ -25,3 +27,5 @@ NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67.s # Additional required files (needed for packaging a standalone artifact) NTT_KYBER_OTHER += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_twiddles.s NTT_KYBER_OTHER += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_twiddles.s +NTT_KYBER_OTHER += poly.h +NTT_KYBER_OTHER += misc.h diff --git a/tests/ntt_kyber/poly.c b/tests/ntt_kyber/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt_kyber/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt_kyber/poly.h b/tests/ntt_kyber/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt_kyber/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file From d2c37d57d319e4c36dec1ca773d76b8ecf8b3a52 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Mon, 15 Jul 2024 13:21:03 +0800 Subject: [PATCH 17/32] change targets to _ + disallow _ in test name; before we had run--- - the double-dash was needed because the platform names contain - and this makes splitting the platform and the target again from the target name too tedious. I changed it to run-_ now, but that means we can't have _ in both platform and test. There are some tests with _, but I think that's bad practice anyway, so I'll rename them. --- Makefile | 12 ++++++------ tests/{ntt_dilithium => ntt-dilithium}/main.c | 0 tests/{ntt_dilithium => ntt-dilithium}/misc.c | 0 tests/{ntt_dilithium => ntt-dilithium}/misc.h | 0 .../ntt-dilithium.mk} | 2 +- tests/{ntt_dilithium => ntt-dilithium}/poly.c | 0 tests/{ntt_dilithium => ntt-dilithium}/poly.h | 0 tests/{ntt_kyber => ntt-kyber}/main.c | 0 tests/{ntt_kyber => ntt-kyber}/misc.c | 0 tests/{ntt_kyber => ntt-kyber}/misc.h | 0 .../ntt_kyber.mk => ntt-kyber/ntt-kyber.mk} | 2 +- tests/{ntt_kyber => ntt-kyber}/poly.c | 0 tests/{ntt_kyber => ntt-kyber}/poly.h | 0 13 files changed, 8 insertions(+), 8 deletions(-) rename tests/{ntt_dilithium => ntt-dilithium}/main.c (100%) rename tests/{ntt_dilithium => ntt-dilithium}/misc.c (100%) rename tests/{ntt_dilithium => ntt-dilithium}/misc.h (100%) rename tests/{ntt_dilithium/ntt_dilithium.mk => ntt-dilithium/ntt-dilithium.mk} (98%) rename tests/{ntt_dilithium => ntt-dilithium}/poly.c (100%) rename tests/{ntt_dilithium => ntt-dilithium}/poly.h (100%) rename tests/{ntt_kyber => ntt-kyber}/main.c (100%) rename tests/{ntt_kyber => ntt-kyber}/misc.c (100%) rename tests/{ntt_kyber => ntt-kyber}/misc.h (100%) rename tests/{ntt_kyber/ntt_kyber.mk => ntt-kyber/ntt-kyber.mk} (98%) rename tests/{ntt_kyber => ntt-kyber}/poly.c (100%) rename tests/{ntt_kyber => ntt-kyber}/poly.h (100%) diff --git a/Makefile b/Makefile index f4337c0..335eb02 100644 --- a/Makefile +++ b/Makefile @@ -1,13 +1,13 @@ # Tests -include tests/ntt_dilithium/ntt_dilithium.mk -include tests/ntt_kyber/ntt_kyber.mk +include tests/ntt-dilithium/ntt-dilithium.mk +include tests/ntt-kyber/ntt-kyber.mk -testname = $(shell echo $(1) | tr '[a-z]' '[A-Z]') +testname = $(shell echo $(1) | tr '[a-z]' '[A-Z]' | tr '-' '_') testdir = $(addprefix $(2),tests/$(1)/) testsources = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_SOURCES))) testasms = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_ASMS))) testother = $(addprefix $(2),$(addprefix tests/$(1)/,$($(call testname,$(1))_OTHER))) -testplatforms = $(addsuffix --$(1),$($(call testname,$(1))_PLATFORMS)) +testplatforms = $(addsuffix _$(1),$($(call testname,$(1))_PLATFORMS)) elfname = $(addsuffix -test.elf,$(1)) platformtests := $(foreach test,$(TESTS), $(call testplatforms,$(test))) @@ -21,8 +21,8 @@ standalones := $(addprefix standalone-, $(platformtests)) .PHONY: all all: ${builds} -platform = $(firstword $(subst --, ,$*)) -test = $(lastword $(subst --, ,$*)) +platform = $(firstword $(subst _, ,$*)) +test = $(lastword $(subst _, ,$*)) .PHONY: ${builds} ${builds}: build-%: diff --git a/tests/ntt_dilithium/main.c b/tests/ntt-dilithium/main.c similarity index 100% rename from tests/ntt_dilithium/main.c rename to tests/ntt-dilithium/main.c diff --git a/tests/ntt_dilithium/misc.c b/tests/ntt-dilithium/misc.c similarity index 100% rename from tests/ntt_dilithium/misc.c rename to tests/ntt-dilithium/misc.c diff --git a/tests/ntt_dilithium/misc.h b/tests/ntt-dilithium/misc.h similarity index 100% rename from tests/ntt_dilithium/misc.h rename to tests/ntt-dilithium/misc.h diff --git a/tests/ntt_dilithium/ntt_dilithium.mk b/tests/ntt-dilithium/ntt-dilithium.mk similarity index 98% rename from tests/ntt_dilithium/ntt_dilithium.mk rename to tests/ntt-dilithium/ntt-dilithium.mk index 82153d2..40689e8 100644 --- a/tests/ntt_dilithium/ntt_dilithium.mk +++ b/tests/ntt-dilithium/ntt-dilithium.mk @@ -1,5 +1,5 @@ # Test name - needs to match the directory name -TESTS += ntt_dilithium +TESTS += ntt-dilithium # All further variables must be prefixed with the capitalized test name diff --git a/tests/ntt_dilithium/poly.c b/tests/ntt-dilithium/poly.c similarity index 100% rename from tests/ntt_dilithium/poly.c rename to tests/ntt-dilithium/poly.c diff --git a/tests/ntt_dilithium/poly.h b/tests/ntt-dilithium/poly.h similarity index 100% rename from tests/ntt_dilithium/poly.h rename to tests/ntt-dilithium/poly.h diff --git a/tests/ntt_kyber/main.c b/tests/ntt-kyber/main.c similarity index 100% rename from tests/ntt_kyber/main.c rename to tests/ntt-kyber/main.c diff --git a/tests/ntt_kyber/misc.c b/tests/ntt-kyber/misc.c similarity index 100% rename from tests/ntt_kyber/misc.c rename to tests/ntt-kyber/misc.c diff --git a/tests/ntt_kyber/misc.h b/tests/ntt-kyber/misc.h similarity index 100% rename from tests/ntt_kyber/misc.h rename to tests/ntt-kyber/misc.h diff --git a/tests/ntt_kyber/ntt_kyber.mk b/tests/ntt-kyber/ntt-kyber.mk similarity index 98% rename from tests/ntt_kyber/ntt_kyber.mk rename to tests/ntt-kyber/ntt-kyber.mk index 21a3e7f..cbcda90 100644 --- a/tests/ntt_kyber/ntt_kyber.mk +++ b/tests/ntt-kyber/ntt-kyber.mk @@ -1,5 +1,5 @@ # Test name - needs to match the directory name -TESTS += ntt_kyber +TESTS += ntt-kyber # All further variables must be prefixed with the capitalized test name diff --git a/tests/ntt_kyber/poly.c b/tests/ntt-kyber/poly.c similarity index 100% rename from tests/ntt_kyber/poly.c rename to tests/ntt-kyber/poly.c diff --git a/tests/ntt_kyber/poly.h b/tests/ntt-kyber/poly.h similarity index 100% rename from tests/ntt_kyber/poly.h rename to tests/ntt-kyber/poly.h From 725e7cfcd9fc216d4f8d0f90aaf236b8d4be6985 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 10:24:03 +0800 Subject: [PATCH 18/32] Revert "add support to build standalone artifacts" This reverts commit 70a9aa2290f6d3efdac045f9a04218a4f6e9efac. --- .github/workflows/check.yaml | 2 +- .github/workflows/standalone.yaml | 19 ------------------- .gitignore | 1 - Makefile | 12 ------------ envs/m55-an547/Makefile | 10 ++-------- envs/m55-an547/common | 1 - envs/m85-an555/Makefile | 10 ++-------- envs/m85-an555/common | 1 - tests/ntt-dilithium/ntt-dilithium.mk | 6 ------ tests/ntt-kyber/ntt-kyber.mk | 8 +------- 10 files changed, 6 insertions(+), 64 deletions(-) delete mode 100644 .github/workflows/standalone.yaml delete mode 120000 envs/m55-an547/common delete mode 120000 envs/m85-an555/common diff --git a/.github/workflows/check.yaml b/.github/workflows/check.yaml index 24eb09b..6e0d885 100644 --- a/.github/workflows/check.yaml +++ b/.github/workflows/check.yaml @@ -11,6 +11,6 @@ jobs: submodules: true - name: install dependencies run: sudo apt install gcc-arm-none-eabi qemu-system - - name: Make all and run + - name: Make all run: | make -j$(nproc) check \ No newline at end of file diff --git a/.github/workflows/standalone.yaml b/.github/workflows/standalone.yaml deleted file mode 100644 index 5e823ad..0000000 --- a/.github/workflows/standalone.yaml +++ /dev/null @@ -1,19 +0,0 @@ -name: Package standalone artifacts and build -on: - pull_request: - branches: [ "main" ] -jobs: - test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v3 - with: - submodules: true - - name: install dependencies - run: sudo apt install gcc-arm-none-eabi - - name: Package standalone artificats - run: | - make -j$(nproc) standalone - - name: Make sure all standalone artifacts build - run: | - for standalone in standalone/*; do make -C $standalone; done \ No newline at end of file diff --git a/.gitignore b/.gitignore index a0c61b5..954f187 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,2 @@ **/build/ **/*.elf -standalone diff --git a/Makefile b/Makefile index 335eb02..022b6d7 100644 --- a/Makefile +++ b/Makefile @@ -16,7 +16,6 @@ builds := $(addprefix build-, $(platformtests)) runs := $(addprefix run-, $(platformtests)) checks := $(addprefix check-, $(platformtests)) cleans := $(addprefix clean-, $(platformtests)) -standalones := $(addprefix standalone-, $(platformtests)) .PHONY: all all: ${builds} @@ -42,20 +41,9 @@ ${checks}: check-%: .PHONY: check check: ${checks} -.PHONY: ${standalones} -${standalones}: standalone-%: - make -C envs/$(platform) clean - mkdir -p standalone/$@/test_src/ - cp -rL envs/$(platform)/* standalone/$@/ - cp -rL $(call testsources,$(test),./) $(call testasms,$(test),./) $(call testother,$(test),./) standalone/$@/test_src/ - -.PHONY: standalone -standalone: ${standalones} - .PHONY: ${cleans} ${cleans}: clean-%: make -C envs/$(platform) clean .PHONY: clean clean: ${cleans} - rm -rf standalone diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index 704d459..e27efa1 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -1,18 +1,12 @@ # Makefile for images for AN547 -# default values (when not called from the root Makefile - used in standalone artifacts) -TARGET ?= test.elf -SOURCES ?= $(wildcard test_src/*.c) -ASMS ?= $(wildcard test_src/*.s) -TESTDIR ?= test_src - CC = arm-none-eabi-gcc LD := $(CC) SRC_DIR=./src BUILD_DIR=./build -COMMON_INC=common/inc/ +COMMON_INC=../common/inc/ ENV_INC=./inc/ SYSROOT := $(shell $(CC) --print-sysroot) @@ -56,7 +50,7 @@ LDFLAGS += \ all: $(TARGET) -HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard common/src/*.c) +HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) diff --git a/envs/m55-an547/common b/envs/m55-an547/common deleted file mode 120000 index 60d3b0a..0000000 --- a/envs/m55-an547/common +++ /dev/null @@ -1 +0,0 @@ -../common \ No newline at end of file diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 09d543f..13e0851 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -1,18 +1,12 @@ # Makefile for images for AN555 -# default values (when not called from the root Makefile - used in standalone artifacts) -TARGET ?= test.elf -SOURCES ?= $(wildcard test_src/*.c) -ASMS ?= $(wildcard test_src/*.s) -TESTDIR ?= test_src - CC = arm-none-eabi-gcc LD := $(CC) SRC_DIR=./src BUILD_DIR=./build -COMMON_INC=common/inc/ +COMMON_INC=../common/inc/ ENV_INC=./inc/ SYSROOT := $(shell $(CC) --print-sysroot) @@ -56,7 +50,7 @@ LDFLAGS += \ all: $(TARGET) -HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard common/src/*.c) +HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) diff --git a/envs/m85-an555/common b/envs/m85-an555/common deleted file mode 120000 index 8332399..0000000 --- a/envs/m85-an555/common +++ /dev/null @@ -1 +0,0 @@ -../common/ \ No newline at end of file diff --git a/tests/ntt-dilithium/ntt-dilithium.mk b/tests/ntt-dilithium/ntt-dilithium.mk index 40689e8..560b09f 100644 --- a/tests/ntt-dilithium/ntt-dilithium.mk +++ b/tests/ntt-dilithium/ntt-dilithium.mk @@ -23,9 +23,3 @@ NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78.s NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_opt_size_m55.s NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_opt_size_m85.s NTT_DILITHIUM_ASMS += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78.s - -# Additional required files (needed for packaging a standalone artifact) -NTT_DILITHIUM_OTHER += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_12_34_56_78_twiddles.s -NTT_DILITHIUM_OTHER += $(NTT_DILITHIUM_ASM_DIR)/ntt_dilithium_123_456_78_twiddles.s -NTT_DILITHIUM_OTHER += misc.h -NTT_DILITHIUM_OTHER += poly.h diff --git a/tests/ntt-kyber/ntt-kyber.mk b/tests/ntt-kyber/ntt-kyber.mk index cbcda90..1bf86d0 100644 --- a/tests/ntt-kyber/ntt-kyber.mk +++ b/tests/ntt-kyber/ntt-kyber.mk @@ -22,10 +22,4 @@ NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans_vld4.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_no_trans.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_opt_size_m55.s NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_opt_size_m85.s -NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67.s - -# Additional required files (needed for packaging a standalone artifact) -NTT_KYBER_OTHER += $(NTT_KYBER_ASM_DIR)/ntt_kyber_1_23_45_67_twiddles.s -NTT_KYBER_OTHER += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67_twiddles.s -NTT_KYBER_OTHER += poly.h -NTT_KYBER_OTHER += misc.h +NTT_KYBER_ASMS += $(NTT_KYBER_ASM_DIR)/ntt_kyber_12_345_67.s \ No newline at end of file From 0cf72f45ddcdfe394bcf0ab97d0523d296751ec3 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 10:35:03 +0800 Subject: [PATCH 19/32] add other tests --- Makefile | 2 ++ tests/chunk/chunk.mk | 16 ++++++++++++++++ tests/chunk/main.c | 32 ++++++++++++++++++-------------- tests/chunk/misc.c | 1 + tests/chunk/misc.h | 1 + tests/crt/crt.mk | 16 ++++++++++++++++ tests/crt/main.c | 1 + tests/crt/misc.c | 1 + tests/crt/misc.h | 1 + 9 files changed, 57 insertions(+), 14 deletions(-) create mode 100644 tests/chunk/chunk.mk create mode 120000 tests/chunk/misc.c create mode 120000 tests/chunk/misc.h create mode 100644 tests/crt/crt.mk create mode 120000 tests/crt/misc.c create mode 120000 tests/crt/misc.h diff --git a/Makefile b/Makefile index 022b6d7..041a3b1 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,6 @@ # Tests +#include tests/chunk/chunk.mk # TODO: this test is failing +include tests/crt/crt.mk include tests/ntt-dilithium/ntt-dilithium.mk include tests/ntt-kyber/ntt-kyber.mk diff --git a/tests/chunk/chunk.mk b/tests/chunk/chunk.mk new file mode 100644 index 0000000..e3bca21 --- /dev/null +++ b/tests/chunk/chunk.mk @@ -0,0 +1,16 @@ +# Test name - needs to match the directory name +TESTS += chunk + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +CHUNK_PLATFORMS += m55-an547 +CHUNK_PLATFORMS += m85-an555 + +# C sources required for this test +CHUNK_SOURCES += main.c +CHUNK_SOURCES += misc.c + +# Assembly sources required for this test +CHUNK_ASMS += chunk.s + diff --git a/tests/chunk/main.c b/tests/chunk/main.c index 7a25eff..e04e00a 100644 --- a/tests/chunk/main.c +++ b/tests/chunk/main.c @@ -97,22 +97,26 @@ int main(void) { int ret = 0; - test_radix11_reduce_x4_m4(); - test_radix11_reduce_x4_m4_v2(); - test_radix11_reduce_x4_m4_v3(); - test_radix11_reduce_x4_lob(); - test_radix11_reduce_x4_lob_64bit(); + ret |= test_radix11_reduce_x4_m4(); + ret |= test_radix11_reduce_x4_m4_v2(); + ret |= test_radix11_reduce_x4_m4_v3(); + ret |= test_radix11_reduce_x4_lob(); + ret |= test_radix11_reduce_x4_lob_64bit(); - test_radix11_reduce_x4_mve_basic(); - test_radix11_reduce_x4_mve_vmla(); - test_radix11_reduce_x4_mve_vmla_v2(); - test_radix11_reduce_x4_mve_vmla_v3(); - test_radix11_reduce_x4_mve_vmla_v4(); + ret |= test_radix11_reduce_x4_mve_basic(); + ret |= test_radix11_reduce_x4_mve_vmla(); + ret |= test_radix11_reduce_x4_mve_vmla_v2(); + ret |= test_radix11_reduce_x4_mve_vmla_v3(); + ret |= test_radix11_reduce_x4_mve_vmla_v4(); - test_radix11_reduce_x4_mve_vqdmlah(); - test_radix11_reduce_x4_mve_vqdmlah_v3(); - test_radix11_reduce_x4_mve_vqdmlah_v4(); - test_radix11_reduce_x4_mve_vqdmlah_v5(); + ret |= test_radix11_reduce_x4_mve_vqdmlah(); + ret |= test_radix11_reduce_x4_mve_vqdmlah_v3(); + ret |= test_radix11_reduce_x4_mve_vqdmlah_v4(); + ret |= test_radix11_reduce_x4_mve_vqdmlah_v5(); + + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } return( ret ); } diff --git a/tests/chunk/misc.c b/tests/chunk/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/chunk/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/chunk/misc.h b/tests/chunk/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/chunk/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/crt/crt.mk b/tests/crt/crt.mk new file mode 100644 index 0000000..b99df4b --- /dev/null +++ b/tests/crt/crt.mk @@ -0,0 +1,16 @@ +# Test name - needs to match the directory name +TESTS += crt + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +CRT_PLATFORMS += m55-an547 +CRT_PLATFORMS += m85-an555 + +# C sources required for this test +CRT_SOURCES += main.c +CRT_SOURCES += misc.c + +# Assembly sources required for this test +CRT_ASMS += crt.s + diff --git a/tests/crt/main.c b/tests/crt/main.c index ac03223..dccf858 100644 --- a/tests/crt/main.c +++ b/tests/crt/main.c @@ -664,5 +664,6 @@ int main(void) } } + debug_printf( "ALL GOOD!\n" ); return( 0 ); } diff --git a/tests/crt/misc.c b/tests/crt/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/crt/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/crt/misc.h b/tests/crt/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/crt/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file From fb467bec4662bdeef5c5f3ac14a6e16554227a8d Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 10:39:29 +0800 Subject: [PATCH 20/32] add other tests --- Makefile | 25 +- tests/ct/ct.mk | 15 + tests/ct/main.c | 5 +- .../{flt_fft => flt-fft}/base_concrete.s.old | 0 tests/{flt_fft => flt-fft}/base_ref.s | 0 .../floatingpoint_radix4_fft_opt_M55.s | 0 .../floatingpoint_radix4_fft_opt_M85.s | 0 tests/flt-fft/flt-fft.mk | 18 + tests/{flt_fft => flt-fft}/flt_fft.h | 0 tests/{flt_fft => flt-fft}/main.c | 8 +- tests/flt-fft/misc.c | 1 + tests/flt-fft/misc.h | 1 + tests/{fx_fft => fx-fft}/base_concrete.s | 0 tests/{fx_fft => fx-fft}/base_symbolic.s | 0 .../fixedpoint_radix4_fft_opt_M55.s | 0 .../fixedpoint_radix4_fft_opt_M85.s | 0 tests/fx-fft/fx-fft.mk | 22 + tests/{fx_fft => fx-fft}/fx_fft.h | 0 tests/{fx_fft => fx-fft}/main.c | 8 +- tests/fx-fft/misc.c | 1 + tests/fx-fft/misc.h | 1 + .../{fx_fft => fx-fft}/ref_handwritten_asm.s | 0 tests/{fx_fft => fx-fft}/ref_intrinsics.s | 0 tests/helloworld/helloworld.mk | 16 + tests/helloworld/main.c | 1 + tests/helloworld/misc.c | 1 + tests/helloworld/misc.h | 1 + tests/intmulntt/intmulntt.mk | 43 + tests/intmulntt/main.c | 1 + tests/intmulntt/misc.c | 1 + tests/intmulntt/misc.h | 1 + tests/intmulntt/poly.c | 1 + tests/intmulntt/poly.h | 1 + tests/karatsuba/karatsuba.mk | 16 + tests/karatsuba/main.c | 4 + tests/karatsuba/misc.c | 1 + tests/karatsuba/misc.h | 1 + tests/karatsuba/poly.h | 1 + tests/montgomery/main.c | 5 + tests/montgomery/misc.c | 1 + tests/montgomery/misc.h | 1 + tests/montgomery/montgomery.mk | 17 + tests/montgomery/poly.c | 1 + tests/montgomery/poly.h | 1 + tests/{ntt_1024 => ntt-1024}/main.c | 0 tests/ntt-1024/misc.c | 1 + tests/ntt-1024/misc.h | 1 + tests/{ntt_1024 => ntt-1024}/montgomery.h | 0 tests/{ntt_1024 => ntt-1024}/montgomery.s | 0 .../{ntt_1024 => ntt-1024}/montgomery_const.h | 0 tests/ntt-1024/ntt-1024.mk | 25 + tests/ntt-1024/poly.c | 1 + tests/ntt-1024/poly.h | 1 + tests/{ntt_192 => ntt-192}/main.c | 4 + tests/ntt-192/misc.c | 1 + tests/ntt-192/misc.h | 1 + tests/{ntt_192 => ntt-192}/montgomery.h | 0 tests/{ntt_192 => ntt-192}/montgomery.s | 0 tests/{ntt_192 => ntt-192}/montgomery_const.h | 0 tests/ntt-192/ntt-192.mk | 37 + tests/{ntt_192 => ntt-192}/ntt_const.h | 0 tests/ntt-192/poly.c | 1 + tests/ntt-192/poly.h | 1 + tests/{ntt_256 => ntt-256}/main.c | 2 + tests/ntt-256/misc.c | 1 + tests/ntt-256/misc.h | 1 + tests/ntt-256/ntt-256.mk | 18 + tests/ntt-256/poly.c | 1 + tests/ntt-256/poly.h | 1 + tests/{ntt_384 => ntt-384}/main.c | 2 + tests/ntt-384/misc.c | 1 + tests/ntt-384/misc.h | 1 + tests/{ntt_384 => ntt-384}/montgomery.h | 0 tests/{ntt_384 => ntt-384}/montgomery.s | 0 tests/{ntt_384 => ntt-384}/montgomery_const.h | 0 tests/ntt-384/ntt-384.mk | 39 + tests/{ntt_384 => ntt-384}/ntt_const.h | 0 tests/ntt-384/poly.c | 1 + tests/ntt-384/poly.h | 1 + tests/{ntt_512 => ntt-512}/main.c | 2 + tests/ntt-512/misc.c | 1 + tests/ntt-512/misc.h | 1 + tests/ntt-512/ntt-512.mk | 19 + tests/ntt-512/poly.c | 1 + tests/ntt-512/poly.h | 1 + tests/{ntt_768 => ntt-768}/main.c | 2 + tests/ntt-768/misc.c | 1 + tests/ntt-768/misc.h | 1 + tests/{ntt_768 => ntt-768}/montgomery.s | 0 tests/{ntt_768 => ntt-768}/montgomery_const.h | 0 tests/ntt-768/ntt-768.mk | 24 + tests/ntt-768/poly.c | 1 + tests/ntt-768/poly.h | 1 + tests/{ntt_n256 => ntt-n256}/main.c | 4 +- tests/ntt-n256/misc.c | 1 + tests/ntt-n256/misc.h | 1 + tests/ntt-n256/ntt-n256.mk | 21 + tests/ntt-n256/poly.c | 1 + tests/ntt-n256/poly.h | 1 + .../inv_ntt_u32_33556993_28678040_complete.s | 3468 ---- ...inv_ntt_u32_33556993_28678040_incomplete.s | 2535 --- .../auto/ntt_u32_33556993_28678040_complete.s | 2915 ---- .../ntt_u32_33556993_28678040_incomplete.s | 2035 --- ..._u32_33556993_28678040_incomplete_double.s | 2342 --- .../ntt_1024_u32_33564673_286215_complete.s | 13759 ---------------- .../ntt_1024_u32_33564673_286215_incomplete.s | 10303 ------------ ...24_u32_33564673_286215_incomplete_bitrev.s | 10303 ------------ ...64673_286215_incomplete_bitrev_skipfirst.s | 9471 ----------- ...24_u32_33564673_286215_incomplete_double.s | 11458 ------------- ...2_33564673_286215_incomplete_double_rev4.s | 11463 ------------- ...1024_u32_33564673_286215_incomplete_rev4.s | 10308 ------------ ...u32_33564673_286215_incomplete_skipfirst.s | 9471 ----------- ...2_u32_106117153_62524596_incomplete_good.s | 1390 -- ...06117153_62524596_incomplete_good_bitrev.s | 1285 -- ...92_u32_108643009_1793055_incomplete_good.s | 1390 -- ...108643009_1793055_incomplete_good_bitrev.s | 1285 -- ...9_1793055_incomplete_good_oop_half_input.s | 1237 -- ..._u32_114826273_107284677_incomplete_good.s | 1390 -- ...4826273_107284677_incomplete_good_bitrev.s | 1285 -- ..._114826273_107284677_incomplete_good_oop.s | 1395 -- ...107284677_incomplete_good_oop_half_input.s | 1237 -- ..._u32_128919937_120423310_incomplete_good.s | 1390 -- ...8919937_120423310_incomplete_good_bitrev.s | 1285 -- ..._128919937_120423310_incomplete_good_oop.s | 1395 -- ...120423310_incomplete_good_oop_half_input.s | 1237 -- ...92_u32_33556993_27792935_incomplete_good.s | 1390 -- ...33556993_27792935_incomplete_good_bitrev.s | 1285 -- ...92_u32_45387457_16877098_incomplete_good.s | 1390 -- ...45387457_16877098_incomplete_good_bitrev.s | 1285 -- ...192_u32_88299073_9670361_incomplete_good.s | 1390 -- ..._88299073_9670361_incomplete_good_bitrev.s | 1285 -- ...3_9670361_incomplete_good_oop_half_input.s | 1237 -- .../ntt_256_u32_33556993_26036764_complete.s | 2907 ---- ...ntt_256_u32_33556993_26036764_incomplete.s | 2027 --- ...92_u32_33556993_27792935_incomplete_good.s | 1484 -- ...33556993_27792935_incomplete_good_bitrev.s | 1383 -- ...84_u32_106117153_1392340_incomplete_good.s | 3383 ---- ...106117153_1392340_incomplete_good_bitrev.s | 3182 ---- ...384_u32_108643009_640922_incomplete_good.s | 3383 ---- ..._108643009_640922_incomplete_good_bitrev.s | 3182 ---- ...u32_108643009_640922_incomplete_good_oop.s | 3388 ---- ...09_640922_incomplete_good_oop_half_input.s | 3075 ---- ...84_u32_114826273_2551686_incomplete_good.s | 3383 ---- ...114826273_2551686_incomplete_good_bitrev.s | 3182 ---- ...32_114826273_2551686_incomplete_good_oop.s | 3388 ---- ...3_2551686_incomplete_good_oop_half_input.s | 3075 ---- ...84_u32_128919937_4666088_incomplete_good.s | 3383 ---- ...128919937_4666088_incomplete_good_bitrev.s | 3182 ---- ...32_128919937_4666088_incomplete_good_oop.s | 3388 ---- ...7_4666088_incomplete_good_oop_half_input.s | 3075 ---- ...84_u32_33556993_15047299_incomplete_good.s | 3383 ---- ...33556993_15047299_incomplete_good_bitrev.s | 3182 ---- ..._384_u32_45387457_923104_incomplete_good.s | 3383 ---- ...2_45387457_923104_incomplete_good_bitrev.s | 3182 ---- ...384_u32_88299073_4883425_incomplete_good.s | 3383 ---- ..._88299073_4883425_incomplete_good_bitrev.s | 3182 ---- ...u32_88299073_4883425_incomplete_good_oop.s | 3388 ---- ...3_4883425_incomplete_good_oop_half_input.s | 3075 ---- .../ntt_512_u32_33564673_21224105_complete.s | 5720 ------- ...ntt_512_u32_33564673_21224105_incomplete.s | 3992 ----- ..._u32_33564673_21224105_incomplete_double.s | 4571 ----- .../ntt_768_u32_33556993_299353_incomplete.s | 7499 --------- ...68_u32_33556993_299353_incomplete_bitrev.s | 7499 --------- ...68_u32_33556993_299353_incomplete_double.s | 8366 ---------- ..._768_u32_33556993_299353_incomplete_good.s | 7130 -------- ...2_33556993_299353_incomplete_good_bitrev.s | 6737 -------- ...2_33556993_299353_incomplete_good_double.s | 8101 --------- ..._768_u32_33556993_299353_incomplete_rev4.s | 7378 --------- ..._ntt_n256_u32_33556993_28678040_complete.s | 3394 ---- ...tt_n256_u32_33556993_28678040_incomplete.s | 2526 --- .../ntt_n256_u32_33556993_28678040_complete.s | 2889 ---- ...tt_n256_u32_33556993_28678040_incomplete.s | 2025 --- ..._u32_33556993_28678040_incomplete_double.s | 2316 --- .../manual/intt_n256_l6_s32_twiddles.s | 150 - .../manual/intt_n256_l8_s32_twiddles.s | 534 - ...vntt_n256_u32_33556993_28678040_complete.s | 1352 -- ...tt_n256_u32_33556993_28678040_incomplete.s | 685 - .../manual/ntt_n256_l6_s32_twiddles.s | 150 - .../manual/ntt_n256_l8_s32_twiddles.s | 534 - .../ntt_n256_u32_33556993_28678040_complete.s | 1226 -- ...tt_n256_u32_33556993_28678040_incomplete.s | 659 - tests/permute/main.c | 1 + tests/permute/misc.c | 1 + tests/permute/misc.h | 1 + tests/permute/permute.mk | 18 + ..._ntt_n256_u32_33556993_28678040_complete.s | 3394 ---- ...tt_n256_u32_33556993_28678040_incomplete.s | 2526 --- .../ntt_n256_u32_33556993_28678040_complete.s | 2889 ---- ...tt_n256_u32_33556993_28678040_incomplete.s | 2025 --- ..._u32_33556993_28678040_incomplete_double.s | 2316 --- .../poly_u16_mul_16_anticyclic_mve_simd.s | 108 - .../poly_u16_mul_16_anticyclic_opt_mve_simd.s | 425 - tests/poly/auto/poly_u16_mul_16_mve_simd.s | 178 - tests/poly/auto/poly_u16_mul_256_toom4_mve.s | 1287 -- ...mul_32_anticyclic_acc_karatsuba_mve_simd.s | 773 - ...mul_32_anticyclic_karatsuba_fwd_mve_simd.s | 268 - ...u16_mul_32_anticyclic_karatsuba_mve_simd.s | 749 - .../poly_u16_mul_32_anticyclic_mve_simd.s | 274 - .../poly_u16_mul_32_anticyclic_opt_mve_simd.s | 274 - tests/poly/auto/poly_u16_mul_32_mve_simd.s | 386 - tests/poly/auto/poly_u16_mul_512_toom4_mve.s | 2501 --- tests/poly/auto/poly_u16_mul_64_toom4_mve.s | 379 - tests/poly/auto/poly_u16_mul_768_toom4_mve.s | 3759 ----- tests/poly/auto/poly_u16_mul_832_toom4_mve.s | 4065 ----- tests/poly/auto/poly_u16_toom4_fwd_256.s | 182 - .../auto/poly_u16_toom4_fwd_256_dual_bottom.s | 198 - ...d_256_dual_packed_limbs_karatsuba_x1_oop.s | 198 - ...d_256_dual_packed_limbs_karatsuba_x2_oop.s | 199 - .../poly_u16_toom4_fwd_256_dual_packed_oop.s | 198 - .../auto/poly_u16_toom4_fwd_256_dual_top.s | 198 - .../poly_u16_toom4_fwd_256_dual_top_oop.s | 198 - tests/poly/auto/poly_u16_toom4_fwd_512.s | 351 - tests/poly/auto/poly_u16_toom4_fwd_768.s | 520 - tests/poly/auto/poly_u16_toom4_fwd_832.s | 562 - .../poly_u16_toom4_fwd_karatsuba_x1_oop_256.s | 199 - .../poly_u16_toom4_fwd_karatsuba_x2_oop_256.s | 200 - tests/poly/auto/poly_u16_toom4_fwd_oop_256.s | 199 - .../auto/poly_u16_toom4_inv_dual_bottom_256.s | 381 - .../poly_u16_toom4_inv_dual_bottom_oop_256.s | 380 - .../auto/poly_u16_toom4_inv_dual_top_256.s | 381 - .../poly_u16_toom4_inv_dual_top_oop_256.s | 380 - tests/poly/auto/poly_u16_toom4_inv_full_256.s | 765 - tests/poly/auto/poly_u16_toom4_inv_full_512.s | 1511 -- tests/poly/auto/poly_u16_toom4_inv_full_768.s | 2303 --- tests/poly/auto/poly_u16_toom4_inv_full_832.s | 2493 --- tests/poly/auto/poly_u16_toom4_inv_half_256.s | 340 - tests/poly/auto/poly_u16_toom4_inv_half_512.s | 661 - tests/poly/auto/poly_u16_toom4_inv_half_768.s | 982 -- tests/poly/auto/poly_u16_toom4_inv_half_832.s | 1062 -- tests/poly/{manual => }/karatsuba.h | 0 tests/poly/{manual => }/karatsuba.s | 0 tests/poly/{manual => }/karatsuba_const.h | 0 tests/poly/main.c | 4 + tests/poly/misc.c | 1 + tests/poly/misc.h | 1 + tests/poly/{manual => }/montgomery.h | 0 tests/poly/{manual => }/montgomery.s | 0 tests/poly/{manual => }/montgomery_const.h | 0 tests/poly/poly.c | 1 + tests/poly/poly.h | 1 + tests/poly/poly.mk | 27 + tests/poly/{manual => }/poly_u16_32.s | 0 tests/poly/{manual => }/poly_u16_32_acc.s | 0 ..._ntt_n256_u32_33556993_28678040_complete.s | 3396 ---- ...tt_n256_u32_33556993_28678040_incomplete.s | 2527 --- .../inv_ntt_u32_33556993_28678040_complete.s | 3468 ---- .../ntt_n256_u32_33556993_28678040_complete.s | 2907 ---- ...tt_n256_u32_33556993_28678040_incomplete.s | 2027 --- ..._u32_33556993_28678040_incomplete_double.s | 2334 --- .../auto/ntt_u32_33556993_28678040_complete.s | 2915 ---- .../poly_u16_mul_16_anticyclic_mve_simd.s | 108 - .../poly_u16_mul_16_anticyclic_opt_mve_simd.s | 425 - tests/saber/auto/poly_u16_mul_16_mve_simd.s | 178 - tests/saber/auto/poly_u16_mul_256_toom4_mve.s | 1287 -- ...mul_32_anticyclic_acc_karatsuba_mve_simd.s | 773 - ...mul_32_anticyclic_karatsuba_fwd_mve_simd.s | 268 - ...u16_mul_32_anticyclic_karatsuba_mve_simd.s | 749 - .../poly_u16_mul_32_anticyclic_mve_simd.s | 274 - .../poly_u16_mul_32_anticyclic_opt_mve_simd.s | 274 - tests/saber/auto/poly_u16_mul_32_mve_simd.s | 386 - tests/saber/auto/poly_u16_mul_512_toom4_mve.s | 2501 --- tests/saber/auto/poly_u16_mul_64_toom4_mve.s | 379 - tests/saber/auto/poly_u16_mul_768_toom4_mve.s | 3759 ----- tests/saber/auto/poly_u16_mul_832_toom4_mve.s | 4065 ----- tests/saber/auto/poly_u16_toom4_fwd_256.s | 182 - .../auto/poly_u16_toom4_fwd_256_dual_bottom.s | 198 - ...d_256_dual_packed_limbs_karatsuba_x1_oop.s | 198 - ...d_256_dual_packed_limbs_karatsuba_x2_oop.s | 199 - ..._u16_toom4_fwd_256_dual_packed_limbs_oop.s | 198 - .../poly_u16_toom4_fwd_256_dual_packed_oop.s | 198 - .../auto/poly_u16_toom4_fwd_256_dual_top.s | 198 - .../poly_u16_toom4_fwd_256_dual_top_oop.s | 198 - tests/saber/auto/poly_u16_toom4_fwd_512.s | 351 - tests/saber/auto/poly_u16_toom4_fwd_768.s | 520 - tests/saber/auto/poly_u16_toom4_fwd_832.s | 562 - .../poly_u16_toom4_fwd_karatsuba_x1_oop_256.s | 199 - .../poly_u16_toom4_fwd_karatsuba_x2_oop_256.s | 200 - tests/saber/auto/poly_u16_toom4_fwd_oop_256.s | 199 - .../auto/poly_u16_toom4_inv_dual_bottom_256.s | 381 - .../poly_u16_toom4_inv_dual_bottom_oop_256.s | 380 - ..._u16_toom4_inv_dual_packed_limbs_oop_256.s | 380 - .../auto/poly_u16_toom4_inv_dual_top_256.s | 381 - .../poly_u16_toom4_inv_dual_top_oop_256.s | 380 - .../saber/auto/poly_u16_toom4_inv_full_256.s | 765 - .../saber/auto/poly_u16_toom4_inv_full_512.s | 1511 -- .../saber/auto/poly_u16_toom4_inv_full_768.s | 2303 --- .../saber/auto/poly_u16_toom4_inv_full_832.s | 2493 --- .../saber/auto/poly_u16_toom4_inv_half_256.s | 340 - .../saber/auto/poly_u16_toom4_inv_half_512.s | 661 - .../saber/auto/poly_u16_toom4_inv_half_768.s | 982 -- .../saber/auto/poly_u16_toom4_inv_half_832.s | 1062 -- tests/saber/{manual => }/karatsuba.h | 0 tests/saber/{manual => }/karatsuba.s | 0 tests/saber/{manual => }/karatsuba_const.h | 0 tests/saber/misc.c | 1 + tests/saber/misc.h | 1 + tests/saber/{manual => }/montgomery.h | 0 tests/saber/{manual => }/montgomery.s | 0 tests/saber/{manual => }/montgomery_const.h | 0 tests/saber/poly.c | 1 + tests/saber/poly.h | 1 + tests/saber/rng.c | 0 tests/saber/rng.h | 1 + tests/saber/saber.mk | 28 + .../auto/poly_u16_mul_128_mve_comba.s | 6664 -------- .../auto/poly_u16_mul_128_mve_schoolbook.s | 8713 ---------- .../poly_u16_mul_16_anticyclic_mve_simd.s | 108 - .../poly_u16_mul_16_anticyclic_opt_mve_simd.s | 425 - .../auto/poly_u16_mul_16_mve_comba.s | 140 - .../auto/poly_u16_mul_16_mve_schoolbook.s | 169 - .../auto/poly_u16_mul_16_mve_simd.s | 178 - ...mul_32_anticyclic_acc_karatsuba_mve_simd.s | 773 - ...u16_mul_32_anticyclic_karatsuba_mve_simd.s | 749 - .../poly_u16_mul_32_anticyclic_opt_mve_simd.s | 274 - .../auto/poly_u16_mul_32_mve_comba.s | 446 - .../auto/poly_u16_mul_32_mve_schoolbook.s | 593 - .../auto/poly_u16_mul_64_mve_comba.s | 1760 -- .../auto/poly_u16_mul_64_mve_schoolbook.s | 2241 --- tests/schoolbook/main.c | 6 +- tests/schoolbook/misc.c | 1 + tests/schoolbook/misc.h | 1 + tests/schoolbook/poly.c | 1 + tests/schoolbook/poly.h | 1 + tests/schoolbook/poly_u16_32.s | 1050 -- .../poly_u16_mul_16_anticyclic_opt_mve_simd.s | 423 - ..._32_anticyclic_karatsuba_mve_simd_manual.s | 276 - ...nticyclic_karatsuba_mve_simd_manual_loop.s | 747 - tests/schoolbook/schoolbook.mk | 19 + tests/sqmag/main.c | 3 +- tests/sqmag/misc.c | 1 + tests/sqmag/misc.h | 1 + tests/sqmag/sqmag.mk | 21 + tests/toom/main.c | 42 +- tests/toom/misc.c | 1 + tests/toom/misc.h | 1 + tests/toom/poly.c | 1 + tests/toom/poly.h | 1 + tests/toom/toom.mk | 58 + tests/transpose/main.c | 2 +- tests/transpose/misc.c | 1 + tests/transpose/misc.h | 1 + tests/transpose/transpose.mk | 15 + tools/README.md | 3 - 343 files changed, 684 insertions(+), 411334 deletions(-) create mode 100644 tests/ct/ct.mk rename tests/{flt_fft => flt-fft}/base_concrete.s.old (100%) rename tests/{flt_fft => flt-fft}/base_ref.s (100%) rename tests/{flt_fft => flt-fft}/floatingpoint_radix4_fft_opt_M55.s (100%) rename tests/{flt_fft => flt-fft}/floatingpoint_radix4_fft_opt_M85.s (100%) create mode 100644 tests/flt-fft/flt-fft.mk rename tests/{flt_fft => flt-fft}/flt_fft.h (100%) rename tests/{flt_fft => flt-fft}/main.c (94%) create mode 120000 tests/flt-fft/misc.c create mode 120000 tests/flt-fft/misc.h rename tests/{fx_fft => fx-fft}/base_concrete.s (100%) rename tests/{fx_fft => fx-fft}/base_symbolic.s (100%) rename tests/{fx_fft => fx-fft}/fixedpoint_radix4_fft_opt_M55.s (100%) rename tests/{fx_fft => fx-fft}/fixedpoint_radix4_fft_opt_M85.s (100%) create mode 100644 tests/fx-fft/fx-fft.mk rename tests/{fx_fft => fx-fft}/fx_fft.h (100%) rename tests/{fx_fft => fx-fft}/main.c (93%) create mode 120000 tests/fx-fft/misc.c create mode 120000 tests/fx-fft/misc.h rename tests/{fx_fft => fx-fft}/ref_handwritten_asm.s (100%) rename tests/{fx_fft => fx-fft}/ref_intrinsics.s (100%) create mode 100644 tests/helloworld/helloworld.mk create mode 120000 tests/helloworld/misc.c create mode 120000 tests/helloworld/misc.h create mode 100644 tests/intmulntt/intmulntt.mk create mode 120000 tests/intmulntt/misc.c create mode 120000 tests/intmulntt/misc.h create mode 120000 tests/intmulntt/poly.c create mode 120000 tests/intmulntt/poly.h create mode 100644 tests/karatsuba/karatsuba.mk create mode 120000 tests/karatsuba/misc.c create mode 120000 tests/karatsuba/misc.h create mode 120000 tests/karatsuba/poly.h create mode 120000 tests/montgomery/misc.c create mode 120000 tests/montgomery/misc.h create mode 100644 tests/montgomery/montgomery.mk create mode 120000 tests/montgomery/poly.c create mode 120000 tests/montgomery/poly.h rename tests/{ntt_1024 => ntt-1024}/main.c (100%) create mode 120000 tests/ntt-1024/misc.c create mode 120000 tests/ntt-1024/misc.h rename tests/{ntt_1024 => ntt-1024}/montgomery.h (100%) rename tests/{ntt_1024 => ntt-1024}/montgomery.s (100%) rename tests/{ntt_1024 => ntt-1024}/montgomery_const.h (100%) create mode 100644 tests/ntt-1024/ntt-1024.mk create mode 120000 tests/ntt-1024/poly.c create mode 120000 tests/ntt-1024/poly.h rename tests/{ntt_192 => ntt-192}/main.c (99%) create mode 120000 tests/ntt-192/misc.c create mode 120000 tests/ntt-192/misc.h rename tests/{ntt_192 => ntt-192}/montgomery.h (100%) rename tests/{ntt_192 => ntt-192}/montgomery.s (100%) rename tests/{ntt_192 => ntt-192}/montgomery_const.h (100%) create mode 100644 tests/ntt-192/ntt-192.mk rename tests/{ntt_192 => ntt-192}/ntt_const.h (100%) create mode 120000 tests/ntt-192/poly.c create mode 120000 tests/ntt-192/poly.h rename tests/{ntt_256 => ntt-256}/main.c (99%) create mode 120000 tests/ntt-256/misc.c create mode 120000 tests/ntt-256/misc.h create mode 100644 tests/ntt-256/ntt-256.mk create mode 120000 tests/ntt-256/poly.c create mode 120000 tests/ntt-256/poly.h rename tests/{ntt_384 => ntt-384}/main.c (99%) create mode 120000 tests/ntt-384/misc.c create mode 120000 tests/ntt-384/misc.h rename tests/{ntt_384 => ntt-384}/montgomery.h (100%) rename tests/{ntt_384 => ntt-384}/montgomery.s (100%) rename tests/{ntt_384 => ntt-384}/montgomery_const.h (100%) create mode 100644 tests/ntt-384/ntt-384.mk rename tests/{ntt_384 => ntt-384}/ntt_const.h (100%) create mode 120000 tests/ntt-384/poly.c create mode 120000 tests/ntt-384/poly.h rename tests/{ntt_512 => ntt-512}/main.c (99%) create mode 120000 tests/ntt-512/misc.c create mode 120000 tests/ntt-512/misc.h create mode 100644 tests/ntt-512/ntt-512.mk create mode 120000 tests/ntt-512/poly.c create mode 120000 tests/ntt-512/poly.h rename tests/{ntt_768 => ntt-768}/main.c (99%) create mode 120000 tests/ntt-768/misc.c create mode 120000 tests/ntt-768/misc.h rename tests/{ntt_768 => ntt-768}/montgomery.s (100%) rename tests/{ntt_768 => ntt-768}/montgomery_const.h (100%) create mode 100644 tests/ntt-768/ntt-768.mk create mode 120000 tests/ntt-768/poly.c create mode 120000 tests/ntt-768/poly.h rename tests/{ntt_n256 => ntt-n256}/main.c (99%) create mode 120000 tests/ntt-n256/misc.c create mode 120000 tests/ntt-n256/misc.h create mode 100644 tests/ntt-n256/ntt-n256.mk create mode 120000 tests/ntt-n256/poly.c create mode 120000 tests/ntt-n256/poly.h delete mode 100644 tests/ntt/auto/inv_ntt_u32_33556993_28678040_complete.s delete mode 100644 tests/ntt/auto/inv_ntt_u32_33556993_28678040_incomplete.s delete mode 100644 tests/ntt/auto/ntt_u32_33556993_28678040_complete.s delete mode 100644 tests/ntt/auto/ntt_u32_33556993_28678040_incomplete.s delete mode 100644 tests/ntt/auto/ntt_u32_33556993_28678040_incomplete_double.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_complete.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double_rev4.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_rev4.s delete mode 100644 tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_skipfirst.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s delete mode 100644 tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_256/auto/ntt_256_u32_33556993_26036764_complete.s delete mode 100644 tests/ntt_256/auto/ntt_256_u32_33556993_26036764_incomplete.s delete mode 100644 tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop.s delete mode 100644 tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s delete mode 100644 tests/ntt_512/auto/ntt_512_u32_33564673_21224105_complete.s delete mode 100644 tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete.s delete mode 100644 tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete_double.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_bitrev.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_double.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_bitrev.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_double.s delete mode 100644 tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_rev4.s delete mode 100644 tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s delete mode 100644 tests/ntt_n256/manual/intt_n256_l6_s32_twiddles.s delete mode 100644 tests/ntt_n256/manual/intt_n256_l8_s32_twiddles.s delete mode 100644 tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/ntt_n256/manual/ntt_n256_l6_s32_twiddles.s delete mode 100644 tests/ntt_n256/manual/ntt_n256_l8_s32_twiddles.s delete mode 100644 tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_incomplete.s create mode 120000 tests/permute/misc.c create mode 120000 tests/permute/misc.h create mode 100644 tests/permute/permute.mk delete mode 100644 tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/poly/auto/ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s delete mode 100644 tests/poly/auto/poly_u16_mul_16_anticyclic_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_16_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_256_toom4_mve.s delete mode 100644 tests/poly/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_32_anticyclic_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_32_mve_simd.s delete mode 100644 tests/poly/auto/poly_u16_mul_512_toom4_mve.s delete mode 100644 tests/poly/auto/poly_u16_mul_64_toom4_mve.s delete mode 100644 tests/poly/auto/poly_u16_mul_768_toom4_mve.s delete mode 100644 tests/poly/auto/poly_u16_mul_832_toom4_mve.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_bottom.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_top.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_top_oop.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_512.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_768.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_832.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_oop_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_dual_bottom_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_dual_top_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_dual_top_oop_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_full_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_full_512.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_full_768.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_full_832.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_half_256.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_half_512.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_half_768.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_half_832.s rename tests/poly/{manual => }/karatsuba.h (100%) rename tests/poly/{manual => }/karatsuba.s (100%) rename tests/poly/{manual => }/karatsuba_const.h (100%) create mode 120000 tests/poly/misc.c create mode 120000 tests/poly/misc.h rename tests/poly/{manual => }/montgomery.h (100%) rename tests/poly/{manual => }/montgomery.s (100%) rename tests/poly/{manual => }/montgomery_const.h (100%) create mode 120000 tests/poly/poly.c create mode 120000 tests/poly/poly.h create mode 100644 tests/poly/poly.mk rename tests/poly/{manual => }/poly_u16_32.s (100%) rename tests/poly/{manual => }/poly_u16_32_acc.s (100%) delete mode 100644 tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/saber/auto/inv_ntt_u32_33556993_28678040_complete.s delete mode 100644 tests/saber/auto/ntt_n256_u32_33556993_28678040_complete.s delete mode 100644 tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete.s delete mode 100644 tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s delete mode 100644 tests/saber/auto/ntt_u32_33556993_28678040_complete.s delete mode 100644 tests/saber/auto/poly_u16_mul_16_anticyclic_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_16_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_256_toom4_mve.s delete mode 100644 tests/saber/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_32_anticyclic_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_32_mve_simd.s delete mode 100644 tests/saber/auto/poly_u16_mul_512_toom4_mve.s delete mode 100644 tests/saber/auto/poly_u16_mul_64_toom4_mve.s delete mode 100644 tests/saber/auto/poly_u16_mul_768_toom4_mve.s delete mode 100644 tests/saber/auto/poly_u16_mul_832_toom4_mve.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_bottom.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_top.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_256_dual_top_oop.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_512.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_768.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_832.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_fwd_oop_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_dual_bottom_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_dual_top_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_dual_top_oop_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_full_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_full_512.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_full_768.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_full_832.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_half_256.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_half_512.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_half_768.s delete mode 100644 tests/saber/auto/poly_u16_toom4_inv_half_832.s rename tests/saber/{manual => }/karatsuba.h (100%) rename tests/saber/{manual => }/karatsuba.s (100%) rename tests/saber/{manual => }/karatsuba_const.h (100%) create mode 120000 tests/saber/misc.c create mode 120000 tests/saber/misc.h rename tests/saber/{manual => }/montgomery.h (100%) rename tests/saber/{manual => }/montgomery.s (100%) rename tests/saber/{manual => }/montgomery_const.h (100%) create mode 120000 tests/saber/poly.c create mode 120000 tests/saber/poly.h mode change 100755 => 100644 tests/saber/rng.c mode change 100755 => 100644 tests/saber/rng.h create mode 100644 tests/saber/saber.mk delete mode 100644 tests/schoolbook/auto/poly_u16_mul_128_mve_comba.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_128_mve_schoolbook.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_16_anticyclic_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_16_mve_comba.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_16_mve_schoolbook.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_16_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_mve_comba.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_mve_schoolbook.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_64_mve_comba.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_64_mve_schoolbook.s create mode 120000 tests/schoolbook/misc.c create mode 120000 tests/schoolbook/misc.h create mode 120000 tests/schoolbook/poly.c create mode 120000 tests/schoolbook/poly.h delete mode 100644 tests/schoolbook/poly_u16_32.s delete mode 100644 tests/schoolbook/poly_u16_mul_16_anticyclic_opt_mve_simd.s delete mode 100644 tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual.s delete mode 100644 tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop.s create mode 100644 tests/schoolbook/schoolbook.mk create mode 120000 tests/sqmag/misc.c create mode 120000 tests/sqmag/misc.h create mode 100644 tests/sqmag/sqmag.mk create mode 120000 tests/toom/misc.c create mode 120000 tests/toom/misc.h create mode 120000 tests/toom/poly.c create mode 120000 tests/toom/poly.h create mode 100644 tests/toom/toom.mk create mode 120000 tests/transpose/misc.c create mode 120000 tests/transpose/misc.h create mode 100644 tests/transpose/transpose.mk delete mode 100644 tools/README.md diff --git a/Makefile b/Makefile index 041a3b1..41fdbdf 100644 --- a/Makefile +++ b/Makefile @@ -1,8 +1,31 @@ # Tests -#include tests/chunk/chunk.mk # TODO: this test is failing +# TODO: commented out tests are failing; need to look into it + +#include tests/chunk/chunk.mk include tests/crt/crt.mk +include tests/ct/ct.mk +include tests/flt-fft/flt-fft.mk +include tests/fx-fft/fx-fft.mk +include tests/helloworld/helloworld.mk +include tests/intmulntt/intmulntt.mk +include tests/karatsuba/karatsuba.mk +#include tests/montgomery/montgomery.mk +include tests/ntt-192/ntt-192.mk +include tests/ntt-256/ntt-256.mk +include tests/ntt-384/ntt-384.mk +include tests/ntt-512/ntt-512.mk +include tests/ntt-768/ntt-768.mk +#include tests/ntt-1024/ntt-1024.mk +include tests/ntt-n256/ntt-n256.mk include tests/ntt-dilithium/ntt-dilithium.mk include tests/ntt-kyber/ntt-kyber.mk +include tests/permute/permute.mk +include tests/poly/poly.mk +#include tests/saber/saber.mk +#include tests/schoolbook/schoolbook.mk +include tests/sqmag/sqmag.mk +include tests/toom/toom.mk +include tests/transpose/transpose.mk testname = $(shell echo $(1) | tr '[a-z]' '[A-Z]' | tr '-' '_') testdir = $(addprefix $(2),tests/$(1)/) diff --git a/tests/ct/ct.mk b/tests/ct/ct.mk new file mode 100644 index 0000000..6b22e34 --- /dev/null +++ b/tests/ct/ct.mk @@ -0,0 +1,15 @@ +# Test name - needs to match the directory name +TESTS += ct + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +CT_PLATFORMS += m55-an547 +CT_PLATFORMS += m85-an555 + +# C sources required for this test +CT_SOURCES += main.c + +# Assembly sources required for this test +CT_ASMS += ct.s + diff --git a/tests/ct/main.c b/tests/ct/main.c index 48285de..b89c8d9 100644 --- a/tests/ct/main.c +++ b/tests/ct/main.c @@ -23,7 +23,6 @@ */ #include -#include #include #include @@ -84,5 +83,9 @@ int main(void) ret |= test_ct_table_lookup(); #endif /* TEST_CT_LOOKUP */ + if (ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } + return( 0 ); } diff --git a/tests/flt_fft/base_concrete.s.old b/tests/flt-fft/base_concrete.s.old similarity index 100% rename from tests/flt_fft/base_concrete.s.old rename to tests/flt-fft/base_concrete.s.old diff --git a/tests/flt_fft/base_ref.s b/tests/flt-fft/base_ref.s similarity index 100% rename from tests/flt_fft/base_ref.s rename to tests/flt-fft/base_ref.s diff --git a/tests/flt_fft/floatingpoint_radix4_fft_opt_M55.s b/tests/flt-fft/floatingpoint_radix4_fft_opt_M55.s similarity index 100% rename from tests/flt_fft/floatingpoint_radix4_fft_opt_M55.s rename to tests/flt-fft/floatingpoint_radix4_fft_opt_M55.s diff --git a/tests/flt_fft/floatingpoint_radix4_fft_opt_M85.s b/tests/flt-fft/floatingpoint_radix4_fft_opt_M85.s similarity index 100% rename from tests/flt_fft/floatingpoint_radix4_fft_opt_M85.s rename to tests/flt-fft/floatingpoint_radix4_fft_opt_M85.s diff --git a/tests/flt-fft/flt-fft.mk b/tests/flt-fft/flt-fft.mk new file mode 100644 index 0000000..e4a80dc --- /dev/null +++ b/tests/flt-fft/flt-fft.mk @@ -0,0 +1,18 @@ +# Test name - needs to match the directory name +TESTS += flt-fft + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +FLT_FFT_PLATFORMS += m55-an547 +FLT_FFT_PLATFORMS += m85-an555 + +# C sources required for this test +FLT_FFT_SOURCES += main.c +FLT_FFT_SOURCES += misc.c + +# Assembly sources required for this test +FLT_FFT_ASMS += base_ref.s +FLT_FFT_ASMS += floatingpoint_radix4_fft_opt_M55.s +FLT_FFT_ASMS += floatingpoint_radix4_fft_opt_M85.s + diff --git a/tests/flt_fft/flt_fft.h b/tests/flt-fft/flt_fft.h similarity index 100% rename from tests/flt_fft/flt_fft.h rename to tests/flt-fft/flt_fft.h diff --git a/tests/flt_fft/main.c b/tests/flt-fft/main.c similarity index 94% rename from tests/flt_fft/main.c rename to tests/flt-fft/main.c index e80d279..a5e95fa 100644 --- a/tests/flt_fft/main.c +++ b/tests/flt-fft/main.c @@ -54,6 +54,7 @@ void hal_pmu_send_stats( char *s, pmu_stats const *stats ); debug_print_buf_s32( src, SIZE, "src" ); \ debug_print_buf_s32( res, SIZE, "res" ); \ debug_printf( "FUNCTIONAL ERROR!\n" ); \ + return 1; \ } \ } while( 0 ) @@ -89,8 +90,13 @@ int main(void) { hal_pmu_enable(); debug_printf( "FLT FFT test!\n" ); - bench_fft(); + int ret = bench_fft(); debug_printf( "Done!\n:" ); + + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } + hal_pmu_disable(); return( 0 ); } diff --git a/tests/flt-fft/misc.c b/tests/flt-fft/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/flt-fft/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/flt-fft/misc.h b/tests/flt-fft/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/flt-fft/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/fx_fft/base_concrete.s b/tests/fx-fft/base_concrete.s similarity index 100% rename from tests/fx_fft/base_concrete.s rename to tests/fx-fft/base_concrete.s diff --git a/tests/fx_fft/base_symbolic.s b/tests/fx-fft/base_symbolic.s similarity index 100% rename from tests/fx_fft/base_symbolic.s rename to tests/fx-fft/base_symbolic.s diff --git a/tests/fx_fft/fixedpoint_radix4_fft_opt_M55.s b/tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s similarity index 100% rename from tests/fx_fft/fixedpoint_radix4_fft_opt_M55.s rename to tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s diff --git a/tests/fx_fft/fixedpoint_radix4_fft_opt_M85.s b/tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s similarity index 100% rename from tests/fx_fft/fixedpoint_radix4_fft_opt_M85.s rename to tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s diff --git a/tests/fx-fft/fx-fft.mk b/tests/fx-fft/fx-fft.mk new file mode 100644 index 0000000..793a1b2 --- /dev/null +++ b/tests/fx-fft/fx-fft.mk @@ -0,0 +1,22 @@ +# Test name - needs to match the directory name +TESTS += fx-fft + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +FX_FFT_PLATFORMS += m55-an547 +FX_FFT_PLATFORMS += m85-an555 + +# C sources required for this test +FX_FFT_SOURCES += main.c +FX_FFT_SOURCES += misc.c + + +# Assembly sources required for this test +FX_FFT_ASMS += base_concrete.s +FX_FFT_ASMS += base_symbolic.s +FX_FFT_ASMS += fixedpoint_radix4_fft_opt_M55.s +FX_FFT_ASMS += fixedpoint_radix4_fft_opt_M85.s +FX_FFT_ASMS += ref_handwritten_asm.s +FX_FFT_ASMS += ref_intrinsics.s + diff --git a/tests/fx_fft/fx_fft.h b/tests/fx-fft/fx_fft.h similarity index 100% rename from tests/fx_fft/fx_fft.h rename to tests/fx-fft/fx_fft.h diff --git a/tests/fx_fft/main.c b/tests/fx-fft/main.c similarity index 93% rename from tests/fx_fft/main.c rename to tests/fx-fft/main.c index e963fca..96ce6e6 100644 --- a/tests/fx_fft/main.c +++ b/tests/fx-fft/main.c @@ -54,6 +54,7 @@ void hal_pmu_send_stats( char *s, pmu_stats const *stats ); debug_print_buf_s32( src, SIZE, "src" ); \ debug_print_buf_s32( res, SIZE, "res" ); \ debug_printf( "FUNCTIONAL ERROR!\n" ); \ + return 1; \ } \ } while( 0 ) @@ -93,8 +94,11 @@ int main(void) { hal_pmu_enable(); debug_printf( "FX FFT test!\n" ); - bench_fft(); - debug_printf( "Done!\n:" ); + int ret = bench_fft(); + debug_printf( "Done!\n" ); + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } hal_pmu_disable(); return( 0 ); } diff --git a/tests/fx-fft/misc.c b/tests/fx-fft/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/fx-fft/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/fx-fft/misc.h b/tests/fx-fft/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/fx-fft/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/fx_fft/ref_handwritten_asm.s b/tests/fx-fft/ref_handwritten_asm.s similarity index 100% rename from tests/fx_fft/ref_handwritten_asm.s rename to tests/fx-fft/ref_handwritten_asm.s diff --git a/tests/fx_fft/ref_intrinsics.s b/tests/fx-fft/ref_intrinsics.s similarity index 100% rename from tests/fx_fft/ref_intrinsics.s rename to tests/fx-fft/ref_intrinsics.s diff --git a/tests/helloworld/helloworld.mk b/tests/helloworld/helloworld.mk new file mode 100644 index 0000000..f41021c --- /dev/null +++ b/tests/helloworld/helloworld.mk @@ -0,0 +1,16 @@ +# Test name - needs to match the directory name +TESTS += helloworld + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +HELLOWORLD_PLATFORMS += m55-an547 +HELLOWORLD_PLATFORMS += m85-an555 + +# C sources required for this test +HELLOWORLD_SOURCES += main.c +HELLOWORLD_SOURCES += misc.c + +# Assembly sources required for this test +HELLOWORLD_ASMS += mve_test.s + diff --git a/tests/helloworld/main.c b/tests/helloworld/main.c index 3dec1ab..00b3a80 100644 --- a/tests/helloworld/main.c +++ b/tests/helloworld/main.c @@ -53,5 +53,6 @@ int main (void) } debug_test_ok(); + debug_printf( "ALL GOOD!\n" ); return( 0 ); } diff --git a/tests/helloworld/misc.c b/tests/helloworld/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/helloworld/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/helloworld/misc.h b/tests/helloworld/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/helloworld/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/intmulntt/intmulntt.mk b/tests/intmulntt/intmulntt.mk new file mode 100644 index 0000000..c978e10 --- /dev/null +++ b/tests/intmulntt/intmulntt.mk @@ -0,0 +1,43 @@ +# Test name - needs to match the directory name +TESTS += intmulntt + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +INTMULNTT_PLATFORMS += m55-an547 +INTMULNTT_PLATFORMS += m85-an555 + +# C sources required for this test +INTMULNTT_SOURCES += main.c +INTMULNTT_SOURCES += misc.c +INTMULNTT_SOURCES += poly.c + +# Assembly sources required for this test +INTMULNTT_ASMS += crt.s +INTMULNTT_ASMS += montgomery.s +INTMULNTT_ASMS += ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_192_u32_33556993_27792935_incomplete_good.s +INTMULNTT_ASMS += ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_192_u32_45387457_16877098_incomplete_good.s +INTMULNTT_ASMS += ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s +INTMULNTT_ASMS += ntt_192_u32_88299073_9670361_incomplete_good.s +INTMULNTT_ASMS += ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_192_u32_106117153_62524596_incomplete_good.s +INTMULNTT_ASMS += ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s +INTMULNTT_ASMS += ntt_192_u32_108643009_1793055_incomplete_good.s +INTMULNTT_ASMS += ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_384_u32_33556993_15047299_incomplete_good.s +INTMULNTT_ASMS += ntt_384_u32_45387457_923104_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_384_u32_45387457_923104_incomplete_good.s +INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s +INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good_oop.s +INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good.s +INTMULNTT_ASMS += ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_384_u32_106117153_1392340_incomplete_good.s +INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good_bitrev.s +INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s +INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good_oop.s +INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good.s diff --git a/tests/intmulntt/main.c b/tests/intmulntt/main.c index 666e170..2c961ec 100644 --- a/tests/intmulntt/main.c +++ b/tests/intmulntt/main.c @@ -583,5 +583,6 @@ int main(void) return( ret ); } + debug_printf( "ALL GOOD!\n" ); return( ret ); } diff --git a/tests/intmulntt/misc.c b/tests/intmulntt/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/intmulntt/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/intmulntt/misc.h b/tests/intmulntt/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/intmulntt/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/intmulntt/poly.c b/tests/intmulntt/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/intmulntt/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/intmulntt/poly.h b/tests/intmulntt/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/intmulntt/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/karatsuba/karatsuba.mk b/tests/karatsuba/karatsuba.mk new file mode 100644 index 0000000..2abccb0 --- /dev/null +++ b/tests/karatsuba/karatsuba.mk @@ -0,0 +1,16 @@ +# Test name - needs to match the directory name +TESTS += karatsuba + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +KARATSUBA_PLATFORMS += m55-an547 +KARATSUBA_PLATFORMS += m85-an555 + +# C sources required for this test +KARATSUBA_SOURCES += main.c +KARATSUBA_SOURCES += misc.c + +# Assembly sources required for this test +KARATSUBA_ASMS += karatsuba.s + diff --git a/tests/karatsuba/main.c b/tests/karatsuba/main.c index 0541954..5a9a6ca 100644 --- a/tests/karatsuba/main.c +++ b/tests/karatsuba/main.c @@ -348,5 +348,9 @@ int main(void) ret |= test_karatsuba_fwd(); #endif /* TEST_KARATSUBA_FWD */ + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } + return( ret ); } diff --git a/tests/karatsuba/misc.c b/tests/karatsuba/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/karatsuba/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/karatsuba/misc.h b/tests/karatsuba/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/karatsuba/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/karatsuba/poly.h b/tests/karatsuba/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/karatsuba/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/montgomery/main.c b/tests/montgomery/main.c index c90d8dd..7063320 100644 --- a/tests/montgomery/main.c +++ b/tests/montgomery/main.c @@ -24,6 +24,7 @@ #include #include +#include #include #include @@ -2181,5 +2182,9 @@ int main() ret |= test_montgomery_pt_u16_round(); #endif /* TEST_MONTGOMERY_PT_U16_ROUND */ + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } + return( ret ); } diff --git a/tests/montgomery/misc.c b/tests/montgomery/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/montgomery/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/montgomery/misc.h b/tests/montgomery/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/montgomery/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/montgomery/montgomery.mk b/tests/montgomery/montgomery.mk new file mode 100644 index 0000000..39dd4b4 --- /dev/null +++ b/tests/montgomery/montgomery.mk @@ -0,0 +1,17 @@ +# Test name - needs to match the directory name +TESTS += montgomery + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +MONTGOMERY_PLATFORMS += m55-an547 +MONTGOMERY_PLATFORMS += m85-an555 + +# C sources required for this test +MONTGOMERY_SOURCES += main.c +MONTGOMERY_SOURCES += misc.c +MONTGOMERY_SOURCES += poly.c + +# Assembly sources required for this test +MONTGOMERY_ASMS += montgomery.s + diff --git a/tests/montgomery/poly.c b/tests/montgomery/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/montgomery/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/montgomery/poly.h b/tests/montgomery/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/montgomery/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_1024/main.c b/tests/ntt-1024/main.c similarity index 100% rename from tests/ntt_1024/main.c rename to tests/ntt-1024/main.c diff --git a/tests/ntt-1024/misc.c b/tests/ntt-1024/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-1024/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-1024/misc.h b/tests/ntt-1024/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-1024/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt_1024/montgomery.h b/tests/ntt-1024/montgomery.h similarity index 100% rename from tests/ntt_1024/montgomery.h rename to tests/ntt-1024/montgomery.h diff --git a/tests/ntt_1024/montgomery.s b/tests/ntt-1024/montgomery.s similarity index 100% rename from tests/ntt_1024/montgomery.s rename to tests/ntt-1024/montgomery.s diff --git a/tests/ntt_1024/montgomery_const.h b/tests/ntt-1024/montgomery_const.h similarity index 100% rename from tests/ntt_1024/montgomery_const.h rename to tests/ntt-1024/montgomery_const.h diff --git a/tests/ntt-1024/ntt-1024.mk b/tests/ntt-1024/ntt-1024.mk new file mode 100644 index 0000000..9b6ad8b --- /dev/null +++ b/tests/ntt-1024/ntt-1024.mk @@ -0,0 +1,25 @@ +# Test name - needs to match the directory name +TESTS += ntt-1024 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_1024_PLATFORMS += m55-an547 +NTT_1024_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_1024_SOURCES += main.c +NTT_1024_SOURCES += misc.c +NTT_1024_SOURCES += poly.c + +# Assembly sources required for this test +NTT_1024_ASM_DIR = ../../asm/auto/ntt_1024 +NTT_1024_ASMS += montgomery.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_complete.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_bitrev.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_double_rev4.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_double.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_rev4.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_skipfirst.s +NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete.s diff --git a/tests/ntt-1024/poly.c b/tests/ntt-1024/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-1024/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-1024/poly.h b/tests/ntt-1024/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-1024/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_192/main.c b/tests/ntt-192/main.c similarity index 99% rename from tests/ntt_192/main.c rename to tests/ntt-192/main.c index 2a65c8f..7514e46 100644 --- a/tests/ntt_192/main.c +++ b/tests/ntt-192/main.c @@ -499,5 +499,9 @@ int main(void) return( 1 ); #endif /* TEST_POLY_MUL */ + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } + return( ret ); } diff --git a/tests/ntt-192/misc.c b/tests/ntt-192/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-192/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-192/misc.h b/tests/ntt-192/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-192/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt_192/montgomery.h b/tests/ntt-192/montgomery.h similarity index 100% rename from tests/ntt_192/montgomery.h rename to tests/ntt-192/montgomery.h diff --git a/tests/ntt_192/montgomery.s b/tests/ntt-192/montgomery.s similarity index 100% rename from tests/ntt_192/montgomery.s rename to tests/ntt-192/montgomery.s diff --git a/tests/ntt_192/montgomery_const.h b/tests/ntt-192/montgomery_const.h similarity index 100% rename from tests/ntt_192/montgomery_const.h rename to tests/ntt-192/montgomery_const.h diff --git a/tests/ntt-192/ntt-192.mk b/tests/ntt-192/ntt-192.mk new file mode 100644 index 0000000..72561bb --- /dev/null +++ b/tests/ntt-192/ntt-192.mk @@ -0,0 +1,37 @@ +# Test name - needs to match the directory name +TESTS += ntt-192 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_192_PLATFORMS += m55-an547 +NTT_192_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_192_SOURCES += main.c +NTT_192_SOURCES += misc.c +NTT_192_SOURCES += poly.c + +# Assembly sources required for this test +NTT_192_ASM_DIR = ../../asm/auto/ntt_192 +NTT_192_ASMS += montgomery.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_33556993_27792935_incomplete_good.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_45387457_16877098_incomplete_good.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_88299073_9670361_incomplete_good.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_106117153_62524596_incomplete_good.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_108643009_1793055_incomplete_good.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_114826273_107284677_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_114826273_107284677_incomplete_good_oop.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_114826273_107284677_incomplete_good.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_128919937_120423310_incomplete_good_bitrev.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_128919937_120423310_incomplete_good_oop.s +NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_128919937_120423310_incomplete_good.s \ No newline at end of file diff --git a/tests/ntt_192/ntt_const.h b/tests/ntt-192/ntt_const.h similarity index 100% rename from tests/ntt_192/ntt_const.h rename to tests/ntt-192/ntt_const.h diff --git a/tests/ntt-192/poly.c b/tests/ntt-192/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-192/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-192/poly.h b/tests/ntt-192/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-192/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_256/main.c b/tests/ntt-256/main.c similarity index 99% rename from tests/ntt_256/main.c rename to tests/ntt-256/main.c index c9084eb..7226534 100644 --- a/tests/ntt_256/main.c +++ b/tests/ntt-256/main.c @@ -190,5 +190,7 @@ int main(void) return( 1 ); #endif /* TEST_NTT */ + debug_printf( "ALL GOOD!\n" ); + return( ret ); } diff --git a/tests/ntt-256/misc.c b/tests/ntt-256/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-256/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-256/misc.h b/tests/ntt-256/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-256/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt-256/ntt-256.mk b/tests/ntt-256/ntt-256.mk new file mode 100644 index 0000000..946d81f --- /dev/null +++ b/tests/ntt-256/ntt-256.mk @@ -0,0 +1,18 @@ +# Test name - needs to match the directory name +TESTS += ntt-256 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_256_PLATFORMS += m55-an547 +NTT_256_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_256_SOURCES += main.c +NTT_256_SOURCES += misc.c +NTT_256_SOURCES += poly.c + +# Assembly sources required for this test +NTT_256_ASM_DIR = ../../asm/auto/ntt_256 +NTT_256_ASMS += $(NTT_256_ASM_DIR)/ntt_256_u32_33556993_26036764_complete.s +NTT_256_ASMS += $(NTT_256_ASM_DIR)/ntt_256_u32_33556993_26036764_incomplete.s diff --git a/tests/ntt-256/poly.c b/tests/ntt-256/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-256/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-256/poly.h b/tests/ntt-256/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-256/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_384/main.c b/tests/ntt-384/main.c similarity index 99% rename from tests/ntt_384/main.c rename to tests/ntt-384/main.c index 8fc807f..d12fd06 100644 --- a/tests/ntt_384/main.c +++ b/tests/ntt-384/main.c @@ -638,5 +638,7 @@ int main(void) return( 1 ); #endif /* TEST_POLY_MUL */ + debug_printf( "ALL GOOD!\n" ); + return( ret ); } diff --git a/tests/ntt-384/misc.c b/tests/ntt-384/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-384/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-384/misc.h b/tests/ntt-384/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-384/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt_384/montgomery.h b/tests/ntt-384/montgomery.h similarity index 100% rename from tests/ntt_384/montgomery.h rename to tests/ntt-384/montgomery.h diff --git a/tests/ntt_384/montgomery.s b/tests/ntt-384/montgomery.s similarity index 100% rename from tests/ntt_384/montgomery.s rename to tests/ntt-384/montgomery.s diff --git a/tests/ntt_384/montgomery_const.h b/tests/ntt-384/montgomery_const.h similarity index 100% rename from tests/ntt_384/montgomery_const.h rename to tests/ntt-384/montgomery_const.h diff --git a/tests/ntt-384/ntt-384.mk b/tests/ntt-384/ntt-384.mk new file mode 100644 index 0000000..1b43ef1 --- /dev/null +++ b/tests/ntt-384/ntt-384.mk @@ -0,0 +1,39 @@ +# Test name - needs to match the directory name +TESTS += ntt-384 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_384_PLATFORMS += m55-an547 +NTT_384_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_384_SOURCES += main.c +NTT_384_SOURCES += misc.c +NTT_384_SOURCES += poly.c + +# Assembly sources required for this test +NTT_384_ASM_DIR = ../../asm/auto/ntt_384 +NTT_384_ASMS += montgomery.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_33556993_15047299_incomplete_good.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_45387457_923104_incomplete_good.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_oop.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_106117153_1392340_incomplete_good.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good_oop.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_114826273_2551686_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_114826273_2551686_incomplete_good_oop.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_114826273_2551686_incomplete_good.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_128919937_4666088_incomplete_good_bitrev.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_128919937_4666088_incomplete_good_oop.s +NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_128919937_4666088_incomplete_good.s \ No newline at end of file diff --git a/tests/ntt_384/ntt_const.h b/tests/ntt-384/ntt_const.h similarity index 100% rename from tests/ntt_384/ntt_const.h rename to tests/ntt-384/ntt_const.h diff --git a/tests/ntt-384/poly.c b/tests/ntt-384/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-384/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-384/poly.h b/tests/ntt-384/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-384/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_512/main.c b/tests/ntt-512/main.c similarity index 99% rename from tests/ntt_512/main.c rename to tests/ntt-512/main.c index 307adc4..3fb49cd 100644 --- a/tests/ntt_512/main.c +++ b/tests/ntt-512/main.c @@ -417,5 +417,7 @@ int main(void) return( 1 ); #endif /* TEST_NTT_DOUBLE */ + debug_printf( "ALL GOOD!\n" ); + return( ret ); } diff --git a/tests/ntt-512/misc.c b/tests/ntt-512/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-512/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-512/misc.h b/tests/ntt-512/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-512/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt-512/ntt-512.mk b/tests/ntt-512/ntt-512.mk new file mode 100644 index 0000000..d240852 --- /dev/null +++ b/tests/ntt-512/ntt-512.mk @@ -0,0 +1,19 @@ +# Test name - needs to match the directory name +TESTS += ntt-512 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_512_PLATFORMS += m55-an547 +NTT_512_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_512_SOURCES += main.c +NTT_512_SOURCES += misc.c +NTT_512_SOURCES += poly.c + +# Assembly sources required for this test +NTT_512_ASM_DIR = ../../asm/auto/ntt_512 +NTT_512_ASMS += $(NTT_512_ASM_DIR)/ntt_512_u32_33564673_21224105_complete.s +NTT_512_ASMS += $(NTT_512_ASM_DIR)/ntt_512_u32_33564673_21224105_incomplete_double.s +NTT_512_ASMS += $(NTT_512_ASM_DIR)/ntt_512_u32_33564673_21224105_incomplete.s \ No newline at end of file diff --git a/tests/ntt-512/poly.c b/tests/ntt-512/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-512/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-512/poly.h b/tests/ntt-512/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-512/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_768/main.c b/tests/ntt-768/main.c similarity index 99% rename from tests/ntt_768/main.c rename to tests/ntt-768/main.c index 236329a..d4f2f66 100644 --- a/tests/ntt_768/main.c +++ b/tests/ntt-768/main.c @@ -813,5 +813,7 @@ int main(void) return( 1 ); #endif /* TEST_POLY_MUL */ + debug_printf( "ALL GOOD!\n" ); + return( ret ); } diff --git a/tests/ntt-768/misc.c b/tests/ntt-768/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-768/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-768/misc.h b/tests/ntt-768/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-768/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt_768/montgomery.s b/tests/ntt-768/montgomery.s similarity index 100% rename from tests/ntt_768/montgomery.s rename to tests/ntt-768/montgomery.s diff --git a/tests/ntt_768/montgomery_const.h b/tests/ntt-768/montgomery_const.h similarity index 100% rename from tests/ntt_768/montgomery_const.h rename to tests/ntt-768/montgomery_const.h diff --git a/tests/ntt-768/ntt-768.mk b/tests/ntt-768/ntt-768.mk new file mode 100644 index 0000000..d38fd5b --- /dev/null +++ b/tests/ntt-768/ntt-768.mk @@ -0,0 +1,24 @@ +# Test name - needs to match the directory name +TESTS += ntt-768 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_768_PLATFORMS += m55-an547 +NTT_768_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_768_SOURCES += main.c +NTT_768_SOURCES += misc.c +NTT_768_SOURCES += poly.c + +# Assembly sources required for this test +NTT_768_ASM_DIR = ../../asm/auto/ntt_768 +NTT_768_ASMS += montgomery.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_bitrev.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_double.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_good_bitrev.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_good_double.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_good.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_rev4.s +NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete.s \ No newline at end of file diff --git a/tests/ntt-768/poly.c b/tests/ntt-768/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-768/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-768/poly.h b/tests/ntt-768/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-768/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt_n256/main.c b/tests/ntt-n256/main.c similarity index 99% rename from tests/ntt_n256/main.c rename to tests/ntt-n256/main.c index 2514bdd..8558717 100644 --- a/tests/ntt_n256/main.c +++ b/tests/ntt-n256/main.c @@ -33,7 +33,7 @@ // #define TEST_NTT_FWD_INV /* NOTE: Need to set `inverse_scaling=0` for this */ // #define TEST_CORE_ONLY /* Enable to build for minimal image // * for performance analysis. */ -// #define NTT_INCOMPLETE /* Enable to compute 6-layer incomplete NTT. */ +#define NTT_INCOMPLETE /* Enable to compute 6-layer incomplete NTT. */ //#define USE_MANUAL_VARIANTS // #define ENABLE_PMU_STATS /* Do not enable when benching for cycle count */ @@ -499,5 +499,7 @@ int main(void) return( 1 ); #endif /* TEST_NTT_FWD_INV */ + debug_printf( "ALL GOOD!\n" ); + return( ret ); } diff --git a/tests/ntt-n256/misc.c b/tests/ntt-n256/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/ntt-n256/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/ntt-n256/misc.h b/tests/ntt-n256/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/ntt-n256/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/ntt-n256/ntt-n256.mk b/tests/ntt-n256/ntt-n256.mk new file mode 100644 index 0000000..924a601 --- /dev/null +++ b/tests/ntt-n256/ntt-n256.mk @@ -0,0 +1,21 @@ +# Test name - needs to match the directory name +TESTS += ntt-n256 + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +NTT_N256_PLATFORMS += m55-an547 +NTT_N256_PLATFORMS += m85-an555 + +# C sources required for this test +NTT_N256_SOURCES += main.c +NTT_N256_SOURCES += misc.c +NTT_N256_SOURCES += poly.c + +# Assembly sources required for this test +NTT_N256_ASM_DIR = ../../asm/auto/ntt_n256 +NTT_N256_ASMS += $(NTT_N256_ASM_DIR)/inv_ntt_n256_u32_33556993_28678040_complete.s +NTT_N256_ASMS += $(NTT_N256_ASM_DIR)/inv_ntt_n256_u32_33556993_28678040_incomplete.s +NTT_N256_ASMS += $(NTT_N256_ASM_DIR)/ntt_n256_u32_33556993_28678040_complete.s +NTT_N256_ASMS += $(NTT_N256_ASM_DIR)/ntt_n256_u32_33556993_28678040_incomplete_double.s +NTT_N256_ASMS += $(NTT_N256_ASM_DIR)/ntt_n256_u32_33556993_28678040_incomplete.s \ No newline at end of file diff --git a/tests/ntt-n256/poly.c b/tests/ntt-n256/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/ntt-n256/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/ntt-n256/poly.h b/tests/ntt-n256/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/ntt-n256/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/ntt/auto/inv_ntt_u32_33556993_28678040_complete.s b/tests/ntt/auto/inv_ntt_u32_33556993_28678040_complete.s deleted file mode 100644 index f7b1e83..0000000 --- a/tests/ntt/auto/inv_ntt_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,3468 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 20558213 // zeta^510 * 2^31 = 28678040^510 * 2^31 -.word 66424611 // zeta^382 * 2^31 = 28678040^382 * 2^31 -.word 59465515 // zeta^446 * 2^31 = 28678040^446 * 2^31 -.word 39560591 // zeta^318 * 2^31 = 28678040^318 * 2^31 -.word 2042724475 // zeta^510 * (q^(-1) mod 2^32) * 2^31 = 28678040^510 * 375649793 * 2^31 -.word 2817904349 // zeta^382 * (q^(-1) mod 2^32) * 2^31 = 28678040^382 * 375649793 * 2^31 -.word 2405453525 // zeta^446 * (q^(-1) mod 2^32) * 2^31 = 28678040^446 * 375649793 * 2^31 -.word 2621436017 // zeta^318 * (q^(-1) mod 2^32) * 2^31 = 28678040^318 * 375649793 * 2^31 -.word 35339857 // zeta^511 * 2^31 = 28678040^511 * 2^31 -.word 13377101 // zeta^447 * 2^31 = 28678040^447 * 2^31 -.word 33252123 // zeta^479 * 2^31 = 28678040^479 * 2^31 -.word 16713319 // zeta^415 * 2^31 = 28678040^415 * 2^31 -.word 3232754607 // zeta^511 * (q^(-1) mod 2^32) * 2^31 = 28678040^511 * 375649793 * 2^31 -.word 2219762611 // zeta^447 * (q^(-1) mod 2^32) * 2^31 = 28678040^447 * 375649793 * 2^31 -.word 3344411365 // zeta^479 * (q^(-1) mod 2^32) * 2^31 = 28678040^479 * 375649793 * 2^31 -.word 2600796057 // zeta^415 * (q^(-1) mod 2^32) * 2^31 = 28678040^415 * 375649793 * 2^31 -.word 10815985 // zeta^383 * 2^31 = 28678040^383 * 2^31 -.word 56247925 // zeta^319 * 2^31 = 28678040^319 * 2^31 -.word 26943959 // zeta^351 * 2^31 = 28678040^351 * 2^31 -.word 51316823 // zeta^287 * 2^31 = 28678040^287 * 2^31 -.word 3650773007 // zeta^383 * (q^(-1) mod 2^32) * 2^31 = 28678040^383 * 375649793 * 2^31 -.word 4021439371 // zeta^319 * (q^(-1) mod 2^32) * 2^31 = 28678040^319 * 375649793 * 2^31 -.word 1538999337 // zeta^351 * (q^(-1) mod 2^32) * 2^31 = 28678040^351 * 375649793 * 2^31 -.word 3611844009 // zeta^287 * (q^(-1) mod 2^32) * 2^31 = 28678040^287 * 375649793 * 2^31 -.word 42042379 // zeta^478 * 2^31 = 28678040^478 * 2^31 -.word 26419651 // zeta^350 * 2^31 = 28678040^350 * 2^31 -.word 61522009 // zeta^414 * 2^31 = 28678040^414 * 2^31 -.word 23758817 // zeta^286 * 2^31 = 28678040^286 * 2^31 -.word 2254105077 // zeta^478 * (q^(-1) mod 2^32) * 2^31 = 28678040^478 * 375649793 * 2^31 -.word 3415374909 // zeta^350 * (q^(-1) mod 2^32) * 2^31 = 28678040^350 * 375649793 * 2^31 -.word 3742677415 // zeta^414 * (q^(-1) mod 2^32) * 2^31 = 28678040^414 * 375649793 * 2^31 -.word 3187687967 // zeta^286 * (q^(-1) mod 2^32) * 2^31 = 28678040^286 * 375649793 * 2^31 -.word 35776599 // zeta^495 * 2^31 = 28678040^495 * 2^31 -.word 6731445 // zeta^431 * 2^31 = 28678040^431 * 2^31 -.word 3030459 // zeta^463 * 2^31 = 28678040^463 * 2^31 -.word 41085059 // zeta^399 * 2^31 = 28678040^399 * 2^31 -.word 351632809 // zeta^495 * (q^(-1) mod 2^32) * 2^31 = 28678040^495 * 375649793 * 2^31 -.word 369646411 // zeta^431 * (q^(-1) mod 2^32) * 2^31 = 28678040^431 * 375649793 * 2^31 -.word 2670661701 // zeta^463 * (q^(-1) mod 2^32) * 2^31 = 28678040^463 * 375649793 * 2^31 -.word 1702245757 // zeta^399 * (q^(-1) mod 2^32) * 2^31 = 28678040^399 * 375649793 * 2^31 -.word 6685305 // zeta^367 * 2^31 = 28678040^367 * 2^31 -.word 24840267 // zeta^303 * 2^31 = 28678040^303 * 2^31 -.word 21119839 // zeta^335 * 2^31 = 28678040^335 * 2^31 -.word 32376869 // zeta^271 * 2^31 = 28678040^271 * 2^31 -.word 2658056071 // zeta^367 * (q^(-1) mod 2^32) * 2^31 = 28678040^367 * 375649793 * 2^31 -.word 495707573 // zeta^303 * (q^(-1) mod 2^32) * 2^31 = 28678040^303 * 375649793 * 2^31 -.word 440627873 // zeta^335 * (q^(-1) mod 2^32) * 2^31 = 28678040^335 * 375649793 * 2^31 -.word 3991890395 // zeta^271 * (q^(-1) mod 2^32) * 2^31 = 28678040^271 * 375649793 * 2^31 -.word 11319751 // zeta^494 * 2^31 = 28678040^494 * 2^31 -.word 57449959 // zeta^366 * 2^31 = 28678040^366 * 2^31 -.word 47736605 // zeta^430 * 2^31 = 28678040^430 * 2^31 -.word 25310795 // zeta^302 * 2^31 = 28678040^302 * 2^31 -.word 316214329 // zeta^494 * (q^(-1) mod 2^32) * 2^31 = 28678040^494 * 375649793 * 2^31 -.word 2994890777 // zeta^366 * (q^(-1) mod 2^32) * 2^31 = 28678040^366 * 375649793 * 2^31 -.word 2883238627 // zeta^430 * (q^(-1) mod 2^32) * 2^31 = 28678040^430 * 375649793 * 2^31 -.word 1834006453 // zeta^302 * (q^(-1) mod 2^32) * 2^31 = 28678040^302 * 375649793 * 2^31 -.word 5649915 // zeta^503 * 2^31 = 28678040^503 * 2^31 -.word 25847843 // zeta^439 * 2^31 = 28678040^439 * 2^31 -.word 62444027 // zeta^471 * 2^31 = 28678040^471 * 2^31 -.word 57855139 // zeta^407 * 2^31 = 28678040^407 * 2^31 -.word 3048839173 // zeta^503 * (q^(-1) mod 2^32) * 2^31 = 28678040^503 * 375649793 * 2^31 -.word 3067803101 // zeta^439 * (q^(-1) mod 2^32) * 2^31 = 28678040^439 * 375649793 * 2^31 -.word 2624519173 // zeta^471 * (q^(-1) mod 2^32) * 2^31 = 28678040^471 * 375649793 * 2^31 -.word 2262798685 // zeta^407 * (q^(-1) mod 2^32) * 2^31 = 28678040^407 * 375649793 * 2^31 -.word 43953263 // zeta^375 * 2^31 = 28678040^375 * 2^31 -.word 3973257 // zeta^311 * 2^31 = 28678040^311 * 2^31 -.word 45754835 // zeta^343 * 2^31 = 28678040^343 * 2^31 -.word 47438647 // zeta^279 * 2^31 = 28678040^279 * 2^31 -.word 1254205841 // zeta^375 * (q^(-1) mod 2^32) * 2^31 = 28678040^375 * 375649793 * 2^31 -.word 3800349047 // zeta^311 * (q^(-1) mod 2^32) * 2^31 = 28678040^311 * 375649793 * 2^31 -.word 3397129261 // zeta^343 * (q^(-1) mod 2^32) * 2^31 = 28678040^343 * 375649793 * 2^31 -.word 3896527561 // zeta^279 * (q^(-1) mod 2^32) * 2^31 = 28678040^279 * 375649793 * 2^31 -.word 34946213 // zeta^462 * 2^31 = 28678040^462 * 2^31 -.word 33401995 // zeta^334 * 2^31 = 28678040^334 * 2^31 -.word 57707227 // zeta^398 * 2^31 = 28678040^398 * 2^31 -.word 43655235 // zeta^270 * 2^31 = 28678040^270 * 2^31 -.word 4090836315 // zeta^462 * (q^(-1) mod 2^32) * 2^31 = 28678040^462 * 375649793 * 2^31 -.word 2389950837 // zeta^334 * (q^(-1) mod 2^32) * 2^31 = 28678040^334 * 375649793 * 2^31 -.word 1383072549 // zeta^398 * (q^(-1) mod 2^32) * 2^31 = 28678040^398 * 375649793 * 2^31 -.word 2793176509 // zeta^270 * (q^(-1) mod 2^32) * 2^31 = 28678040^270 * 375649793 * 2^31 -.word 30218957 // zeta^487 * 2^31 = 28678040^487 * 2^31 -.word 13073717 // zeta^423 * 2^31 = 28678040^423 * 2^31 -.word 41547715 // zeta^455 * 2^31 = 28678040^455 * 2^31 -.word 51082899 // zeta^391 * 2^31 = 28678040^391 * 2^31 -.word 3945457459 // zeta^487 * (q^(-1) mod 2^32) * 2^31 = 28678040^487 * 375649793 * 2^31 -.word 1399362763 // zeta^423 * (q^(-1) mod 2^32) * 2^31 = 28678040^423 * 375649793 * 2^31 -.word 923248189 // zeta^455 * (q^(-1) mod 2^32) * 2^31 = 28678040^455 * 375649793 * 2^31 -.word 2083145581 // zeta^391 * (q^(-1) mod 2^32) * 2^31 = 28678040^391 * 375649793 * 2^31 -.word 6539853 // zeta^359 * 2^31 = 28678040^359 * 2^31 -.word 52712977 // zeta^295 * 2^31 = 28678040^295 * 2^31 -.word 15171525 // zeta^327 * 2^31 = 28678040^327 * 2^31 -.word 41070365 // zeta^263 * 2^31 = 28678040^263 * 2^31 -.word 1097807795 // zeta^359 * (q^(-1) mod 2^32) * 2^31 = 28678040^359 * 375649793 * 2^31 -.word 1402229743 // zeta^295 * (q^(-1) mod 2^32) * 2^31 = 28678040^295 * 375649793 * 2^31 -.word 857879099 // zeta^327 * (q^(-1) mod 2^32) * 2^31 = 28678040^327 * 375649793 * 2^31 -.word 2467328739 // zeta^263 * (q^(-1) mod 2^32) * 2^31 = 28678040^263 * 375649793 * 2^31 -.word 1421525 // zeta^502 * 2^31 = 28678040^502 * 2^31 -.word 5608953 // zeta^374 * 2^31 = 28678040^374 * 2^31 -.word 3344309 // zeta^438 * 2^31 = 28678040^438 * 2^31 -.word 54192527 // zeta^310 * 2^31 = 28678040^310 * 2^31 -.word 2006884651 // zeta^502 * (q^(-1) mod 2^32) * 2^31 = 28678040^502 * 375649793 * 2^31 -.word 1547838471 // zeta^374 * (q^(-1) mod 2^32) * 2^31 = 28678040^374 * 375649793 * 2^31 -.word 1835403851 // zeta^438 * (q^(-1) mod 2^32) * 2^31 = 28678040^438 * 375649793 * 2^31 -.word 3288902769 // zeta^310 * (q^(-1) mod 2^32) * 2^31 = 28678040^310 * 375649793 * 2^31 -.word 55532487 // zeta^507 * 2^31 = 28678040^507 * 2^31 -.word 25878283 // zeta^443 * 2^31 = 28678040^443 * 2^31 -.word 7519477 // zeta^475 * 2^31 = 28678040^475 * 2^31 -.word 10400227 // zeta^411 * 2^31 = 28678040^411 * 2^31 -.word 579496505 // zeta^507 * (q^(-1) mod 2^32) * 2^31 = 28678040^507 * 375649793 * 2^31 -.word 1491046133 // zeta^443 * (q^(-1) mod 2^32) * 2^31 = 28678040^443 * 375649793 * 2^31 -.word 2637878539 // zeta^475 * (q^(-1) mod 2^32) * 2^31 = 28678040^475 * 375649793 * 2^31 -.word 866659357 // zeta^411 * (q^(-1) mod 2^32) * 2^31 = 28678040^411 * 375649793 * 2^31 -.word 66449241 // zeta^379 * 2^31 = 28678040^379 * 2^31 -.word 4428811 // zeta^315 * 2^31 = 28678040^315 * 2^31 -.word 30618985 // zeta^347 * 2^31 = 28678040^347 * 2^31 -.word 46942975 // zeta^283 * 2^31 = 28678040^283 * 2^31 -.word 1923058343 // zeta^379 * (q^(-1) mod 2^32) * 2^31 = 28678040^379 * 375649793 * 2^31 -.word 3711490549 // zeta^315 * (q^(-1) mod 2^32) * 2^31 = 28678040^315 * 375649793 * 2^31 -.word 1530848407 // zeta^347 * (q^(-1) mod 2^32) * 2^31 = 28678040^347 * 375649793 * 2^31 -.word 3263539969 // zeta^283 * (q^(-1) mod 2^32) * 2^31 = 28678040^283 * 375649793 * 2^31 -.word 34238409 // zeta^470 * 2^31 = 28678040^470 * 2^31 -.word 7278675 // zeta^342 * 2^31 = 28678040^342 * 2^31 -.word 26316985 // zeta^406 * 2^31 = 28678040^406 * 2^31 -.word 1738533 // zeta^278 * 2^31 = 28678040^278 * 2^31 -.word 1976527415 // zeta^470 * (q^(-1) mod 2^32) * 2^31 = 28678040^470 * 375649793 * 2^31 -.word 3553111469 // zeta^342 * (q^(-1) mod 2^32) * 2^31 = 28678040^342 * 375649793 * 2^31 -.word 1070704967 // zeta^406 * (q^(-1) mod 2^32) * 2^31 = 28678040^406 * 375649793 * 2^31 -.word 280554203 // zeta^278 * (q^(-1) mod 2^32) * 2^31 = 28678040^278 * 375649793 * 2^31 -.word 29493541 // zeta^491 * 2^31 = 28678040^491 * 2^31 -.word 46179537 // zeta^427 * 2^31 = 28678040^427 * 2^31 -.word 61070425 // zeta^459 * 2^31 = 28678040^459 * 2^31 -.word 47641435 // zeta^395 * 2^31 = 28678040^395 * 2^31 -.word 3525667035 // zeta^491 * (q^(-1) mod 2^32) * 2^31 = 28678040^491 * 375649793 * 2^31 -.word 738952495 // zeta^427 * (q^(-1) mod 2^32) * 2^31 = 28678040^427 * 375649793 * 2^31 -.word 2855509415 // zeta^459 * (q^(-1) mod 2^32) * 2^31 = 28678040^459 * 375649793 * 2^31 -.word 2166266533 // zeta^395 * (q^(-1) mod 2^32) * 2^31 = 28678040^395 * 375649793 * 2^31 -.word 8700655 // zeta^363 * 2^31 = 28678040^363 * 2^31 -.word 49217369 // zeta^299 * 2^31 = 28678040^299 * 2^31 -.word 14037329 // zeta^331 * 2^31 = 28678040^331 * 2^31 -.word 57068693 // zeta^267 * 2^31 = 28678040^267 * 2^31 -.word 2143064849 // zeta^363 * (q^(-1) mod 2^32) * 2^31 = 28678040^363 * 375649793 * 2^31 -.word 3997596327 // zeta^299 * (q^(-1) mod 2^32) * 2^31 = 28678040^299 * 375649793 * 2^31 -.word 594737327 // zeta^331 * (q^(-1) mod 2^32) * 2^31 = 28678040^331 * 375649793 * 2^31 -.word 1214449003 // zeta^267 * (q^(-1) mod 2^32) * 2^31 = 28678040^267 * 375649793 * 2^31 -.word 5988919 // zeta^486 * 2^31 = 28678040^486 * 2^31 -.word 27781261 // zeta^358 * 2^31 = 28678040^358 * 2^31 -.word 33650523 // zeta^422 * 2^31 = 28678040^422 * 2^31 -.word 40314383 // zeta^294 * 2^31 = 28678040^294 * 2^31 -.word 2046739401 // zeta^486 * (q^(-1) mod 2^32) * 2^31 = 28678040^486 * 375649793 * 2^31 -.word 2556008819 // zeta^358 * (q^(-1) mod 2^32) * 2^31 = 28678040^358 * 375649793 * 2^31 -.word 2602309285 // zeta^422 * (q^(-1) mod 2^32) * 2^31 = 28678040^422 * 375649793 * 2^31 -.word 3711528945 // zeta^294 * (q^(-1) mod 2^32) * 2^31 = 28678040^294 * 375649793 * 2^31 -.word 25356533 // zeta^499 * 2^31 = 28678040^499 * 2^31 -.word 59712043 // zeta^435 * 2^31 = 28678040^435 * 2^31 -.word 59431885 // zeta^467 * 2^31 = 28678040^467 * 2^31 -.word 42783775 // zeta^403 * 2^31 = 28678040^403 * 2^31 -.word 232958219 // zeta^499 * (q^(-1) mod 2^32) * 2^31 = 28678040^499 * 375649793 * 2^31 -.word 2298121173 // zeta^435 * (q^(-1) mod 2^32) * 2^31 = 28678040^435 * 375649793 * 2^31 -.word 4009174579 // zeta^467 * (q^(-1) mod 2^32) * 2^31 = 28678040^467 * 375649793 * 2^31 -.word 4154483169 // zeta^403 * (q^(-1) mod 2^32) * 2^31 = 28678040^403 * 375649793 * 2^31 -.word 15118727 // zeta^371 * 2^31 = 28678040^371 * 2^31 -.word 16104593 // zeta^307 * 2^31 = 28678040^307 * 2^31 -.word 66551101 // zeta^339 * 2^31 = 28678040^339 * 2^31 -.word 27099659 // zeta^275 * 2^31 = 28678040^275 * 2^31 -.word 256676985 // zeta^371 * (q^(-1) mod 2^32) * 2^31 = 28678040^371 * 375649793 * 2^31 -.word 2042883439 // zeta^307 * (q^(-1) mod 2^32) * 2^31 = 28678040^307 * 375649793 * 2^31 -.word 2098783427 // zeta^339 * (q^(-1) mod 2^32) * 2^31 = 28678040^339 * 375649793 * 2^31 -.word 1730866165 // zeta^275 * (q^(-1) mod 2^32) * 2^31 = 28678040^275 * 375649793 * 2^31 -.word 52622279 // zeta^454 * 2^31 = 28678040^454 * 2^31 -.word 48542309 // zeta^326 * 2^31 = 28678040^326 * 2^31 -.word 28412919 // zeta^390 * 2^31 = 28678040^390 * 2^31 -.word 61490063 // zeta^262 * 2^31 = 28678040^262 * 2^31 -.word 111596089 // zeta^454 * (q^(-1) mod 2^32) * 2^31 = 28678040^454 * 375649793 * 2^31 -.word 2392801179 // zeta^326 * (q^(-1) mod 2^32) * 2^31 = 28678040^326 * 375649793 * 2^31 -.word 122296841 // zeta^390 * (q^(-1) mod 2^32) * 2^31 = 28678040^390 * 375649793 * 2^31 -.word 4112339569 // zeta^262 * (q^(-1) mod 2^32) * 2^31 = 28678040^262 * 375649793 * 2^31 -.word 17544659 // zeta^483 * 2^31 = 28678040^483 * 2^31 -.word 26761761 // zeta^419 * 2^31 = 28678040^419 * 2^31 -.word 28138345 // zeta^451 * 2^31 = 28678040^451 * 2^31 -.word 6006005 // zeta^387 * 2^31 = 28678040^387 * 2^31 -.word 1268942893 // zeta^483 * (q^(-1) mod 2^32) * 2^31 = 28678040^483 * 375649793 * 2^31 -.word 3876122591 // zeta^419 * (q^(-1) mod 2^32) * 2^31 = 28678040^419 * 375649793 * 2^31 -.word 148946583 // zeta^451 * (q^(-1) mod 2^32) * 2^31 = 28678040^451 * 375649793 * 2^31 -.word 375516427 // zeta^387 * (q^(-1) mod 2^32) * 2^31 = 28678040^387 * 375649793 * 2^31 -.word 49338991 // zeta^355 * 2^31 = 28678040^355 * 2^31 -.word 59052279 // zeta^291 * 2^31 = 28678040^291 * 2^31 -.word 54131019 // zeta^323 * 2^31 = 28678040^323 * 2^31 -.word 49172137 // zeta^259 * 2^31 = 28678040^259 * 2^31 -.word 2285599633 // zeta^355 * (q^(-1) mod 2^32) * 2^31 = 28678040^355 * 375649793 * 2^31 -.word 1420334345 // zeta^291 * (q^(-1) mod 2^32) * 2^31 = 28678040^291 * 375649793 * 2^31 -.word 1832318133 // zeta^323 * (q^(-1) mod 2^32) * 2^31 = 28678040^323 * 375649793 * 2^31 -.word 203443031 // zeta^259 * (q^(-1) mod 2^32) * 2^31 = 28678040^259 * 375649793 * 2^31 -.word 41164657 // zeta^506 * 2^31 = 28678040^506 * 2^31 -.word 23553921 // zeta^378 * 2^31 = 28678040^378 * 2^31 -.word 51075303 // zeta^442 * 2^31 = 28678040^442 * 2^31 -.word 11244857 // zeta^314 * 2^31 = 28678040^314 * 2^31 -.word 2292337295 // zeta^506 * (q^(-1) mod 2^32) * 2^31 = 28678040^506 * 375649793 * 2^31 -.word 2218762879 // zeta^378 * (q^(-1) mod 2^32) * 2^31 = 28678040^378 * 375649793 * 2^31 -.word 3660688665 // zeta^442 * (q^(-1) mod 2^32) * 2^31 = 28678040^442 * 375649793 * 2^31 -.word 2196022471 // zeta^314 * (q^(-1) mod 2^32) * 2^31 = 28678040^314 * 375649793 * 2^31 -.word 27161421 // zeta^509 * 2^31 = 28678040^509 * 2^31 -.word 12259351 // zeta^445 * 2^31 = 28678040^445 * 2^31 -.word 42183787 // zeta^477 * 2^31 = 28678040^477 * 2^31 -.word 260949 // zeta^413 * 2^31 = 28678040^413 * 2^31 -.word 2261683891 // zeta^509 * (q^(-1) mod 2^32) * 2^31 = 28678040^509 * 375649793 * 2^31 -.word 183096809 // zeta^445 * (q^(-1) mod 2^32) * 2^31 = 28678040^445 * 375649793 * 2^31 -.word 2523693461 // zeta^477 * (q^(-1) mod 2^32) * 2^31 = 28678040^477 * 375649793 * 2^31 -.word 2895730347 // zeta^413 * (q^(-1) mod 2^32) * 2^31 = 28678040^413 * 375649793 * 2^31 -.word 49379395 // zeta^381 * 2^31 = 28678040^381 * 2^31 -.word 45318697 // zeta^317 * 2^31 = 28678040^317 * 2^31 -.word 65417737 // zeta^349 * 2^31 = 28678040^349 * 2^31 -.word 60522221 // zeta^285 * 2^31 = 28678040^285 * 2^31 -.word 2945787325 // zeta^381 * (q^(-1) mod 2^32) * 2^31 = 28678040^381 * 375649793 * 2^31 -.word 2724075479 // zeta^317 * (q^(-1) mod 2^32) * 2^31 = 28678040^317 * 375649793 * 2^31 -.word 2827626487 // zeta^349 * (q^(-1) mod 2^32) * 2^31 = 28678040^349 * 375649793 * 2^31 -.word 482722579 // zeta^285 * (q^(-1) mod 2^32) * 2^31 = 28678040^285 * 375649793 * 2^31 -.word 3629237 // zeta^474 * 2^31 = 28678040^474 * 2^31 -.word 60326323 // zeta^346 * 2^31 = 28678040^346 * 2^31 -.word 30569867 // zeta^410 * 2^31 = 28678040^410 * 2^31 -.word 31921231 // zeta^282 * 2^31 = 28678040^282 * 2^31 -.word 3571167563 // zeta^474 * (q^(-1) mod 2^32) * 2^31 = 28678040^474 * 375649793 * 2^31 -.word 3851189325 // zeta^346 * (q^(-1) mod 2^32) * 2^31 = 28678040^346 * 375649793 * 2^31 -.word 1517877365 // zeta^410 * (q^(-1) mod 2^32) * 2^31 = 28678040^410 * 375649793 * 2^31 -.word 1275593137 // zeta^282 * (q^(-1) mod 2^32) * 2^31 = 28678040^282 * 375649793 * 2^31 -.word 51477925 // zeta^493 * 2^31 = 28678040^493 * 2^31 -.word 23177153 // zeta^429 * 2^31 = 28678040^429 * 2^31 -.word 42516129 // zeta^461 * 2^31 = 28678040^461 * 2^31 -.word 23261199 // zeta^397 * 2^31 = 28678040^397 * 2^31 -.word 1768092763 // zeta^493 * (q^(-1) mod 2^32) * 2^31 = 28678040^493 * 375649793 * 2^31 -.word 2982666815 // zeta^429 * (q^(-1) mod 2^32) * 2^31 = 28678040^429 * 375649793 * 2^31 -.word 134581087 // zeta^461 * (q^(-1) mod 2^32) * 2^31 = 28678040^461 * 375649793 * 2^31 -.word 3424757233 // zeta^397 * (q^(-1) mod 2^32) * 2^31 = 28678040^397 * 375649793 * 2^31 -.word 50523083 // zeta^365 * 2^31 = 28678040^365 * 2^31 -.word 29024109 // zeta^301 * 2^31 = 28678040^301 * 2^31 -.word 62634975 // zeta^333 * 2^31 = 28678040^333 * 2^31 -.word 5116371 // zeta^269 * 2^31 = 28678040^269 * 2^31 -.word 2363949621 // zeta^365 * (q^(-1) mod 2^32) * 2^31 = 28678040^365 * 375649793 * 2^31 -.word 2792055443 // zeta^301 * (q^(-1) mod 2^32) * 2^31 = 28678040^301 * 375649793 * 2^31 -.word 3296655905 // zeta^333 * (q^(-1) mod 2^32) * 2^31 = 28678040^333 * 375649793 * 2^31 -.word 4093127725 // zeta^269 * (q^(-1) mod 2^32) * 2^31 = 28678040^269 * 375649793 * 2^31 -.word 55626043 // zeta^490 * 2^31 = 28678040^490 * 2^31 -.word 15630981 // zeta^362 * 2^31 = 28678040^362 * 2^31 -.word 43717491 // zeta^426 * 2^31 = 28678040^426 * 2^31 -.word 14342369 // zeta^298 * 2^31 = 28678040^298 * 2^31 -.word 2004845765 // zeta^490 * (q^(-1) mod 2^32) * 2^31 = 28678040^490 * 375649793 * 2^31 -.word 3862343547 // zeta^362 * (q^(-1) mod 2^32) * 2^31 = 28678040^362 * 375649793 * 2^31 -.word 2436590221 // zeta^426 * (q^(-1) mod 2^32) * 2^31 = 28678040^426 * 375649793 * 2^31 -.word 2109337887 // zeta^298 * (q^(-1) mod 2^32) * 2^31 = 28678040^298 * 375649793 * 2^31 -.word 6776583 // zeta^501 * 2^31 = 28678040^501 * 2^31 -.word 33530533 // zeta^437 * 2^31 = 28678040^437 * 2^31 -.word 43598203 // zeta^469 * 2^31 = 28678040^469 * 2^31 -.word 59373651 // zeta^405 * 2^31 = 28678040^405 * 2^31 -.word 820174585 // zeta^501 * (q^(-1) mod 2^32) * 2^31 = 28678040^501 * 375649793 * 2^31 -.word 1139199835 // zeta^437 * (q^(-1) mod 2^32) * 2^31 = 28678040^437 * 375649793 * 2^31 -.word 3555298437 // zeta^469 * (q^(-1) mod 2^32) * 2^31 = 28678040^469 * 375649793 * 2^31 -.word 1035814317 // zeta^405 * (q^(-1) mod 2^32) * 2^31 = 28678040^405 * 375649793 * 2^31 -.word 37946425 // zeta^373 * 2^31 = 28678040^373 * 2^31 -.word 47668559 // zeta^309 * 2^31 = 28678040^309 * 2^31 -.word 10775673 // zeta^341 * 2^31 = 28678040^341 * 2^31 -.word 3826249 // zeta^277 * 2^31 = 28678040^277 * 2^31 -.word 262354375 // zeta^373 * (q^(-1) mod 2^32) * 2^31 = 28678040^373 * 375649793 * 2^31 -.word 703707313 // zeta^309 * (q^(-1) mod 2^32) * 2^31 = 28678040^309 * 375649793 * 2^31 -.word 2790542727 // zeta^341 * (q^(-1) mod 2^32) * 2^31 = 28678040^341 * 375649793 * 2^31 -.word 2635626423 // zeta^277 * (q^(-1) mod 2^32) * 2^31 = 28678040^277 * 375649793 * 2^31 -.word 53733071 // zeta^458 * 2^31 = 28678040^458 * 2^31 -.word 10734019 // zeta^330 * 2^31 = 28678040^330 * 2^31 -.word 25306471 // zeta^394 * 2^31 = 28678040^394 * 2^31 -.word 54139625 // zeta^266 * 2^31 = 28678040^266 * 2^31 -.word 284438321 // zeta^458 * (q^(-1) mod 2^32) * 2^31 = 28678040^458 * 375649793 * 2^31 -.word 3541161021 // zeta^330 * (q^(-1) mod 2^32) * 2^31 = 28678040^330 * 375649793 * 2^31 -.word 2646073497 // zeta^394 * (q^(-1) mod 2^32) * 2^31 = 28678040^394 * 375649793 * 2^31 -.word 3100573463 // zeta^266 * (q^(-1) mod 2^32) * 2^31 = 28678040^266 * 375649793 * 2^31 -.word 1468391 // zeta^485 * 2^31 = 28678040^485 * 2^31 -.word 4426959 // zeta^421 * 2^31 = 28678040^421 * 2^31 -.word 42735737 // zeta^453 * 2^31 = 28678040^453 * 2^31 -.word 38665093 // zeta^389 * 2^31 = 28678040^389 * 2^31 -.word 1874632217 // zeta^485 * (q^(-1) mod 2^32) * 2^31 = 28678040^485 * 375649793 * 2^31 -.word 3630205233 // zeta^421 * (q^(-1) mod 2^32) * 2^31 = 28678040^421 * 375649793 * 2^31 -.word 2166661511 // zeta^453 * (q^(-1) mod 2^32) * 2^31 = 28678040^453 * 375649793 * 2^31 -.word 1536243323 // zeta^389 * (q^(-1) mod 2^32) * 2^31 = 28678040^389 * 375649793 * 2^31 -.word 33133879 // zeta^357 * 2^31 = 28678040^357 * 2^31 -.word 7139481 // zeta^293 * 2^31 = 28678040^293 * 2^31 -.word 8438111 // zeta^325 * 2^31 = 28678040^325 * 2^31 -.word 50341189 // zeta^261 * 2^31 = 28678040^261 * 2^31 -.word 3126759625 // zeta^357 * (q^(-1) mod 2^32) * 2^31 = 28678040^357 * 375649793 * 2^31 -.word 523569511 // zeta^293 * (q^(-1) mod 2^32) * 2^31 = 28678040^293 * 375649793 * 2^31 -.word 1408300193 // zeta^325 * (q^(-1) mod 2^32) * 2^31 = 28678040^325 * 375649793 * 2^31 -.word 2172685499 // zeta^261 * (q^(-1) mod 2^32) * 2^31 = 28678040^261 * 375649793 * 2^31 -.word 47558821 // zeta^498 * 2^31 = 28678040^498 * 2^31 -.word 33268441 // zeta^370 * 2^31 = 28678040^370 * 2^31 -.word 63536237 // zeta^434 * 2^31 = 28678040^434 * 2^31 -.word 26272521 // zeta^306 * 2^31 = 28678040^306 * 2^31 -.word 664584539 // zeta^498 * (q^(-1) mod 2^32) * 2^31 = 28678040^498 * 375649793 * 2^31 -.word 2409420583 // zeta^370 * (q^(-1) mod 2^32) * 2^31 = 28678040^370 * 375649793 * 2^31 -.word 3799958931 // zeta^434 * (q^(-1) mod 2^32) * 2^31 = 28678040^434 * 375649793 * 2^31 -.word 835286775 // zeta^306 * (q^(-1) mod 2^32) * 2^31 = 28678040^306 * 375649793 * 2^31 -.word 1854317 // zeta^505 * 2^31 = 28678040^505 * 2^31 -.word 2223865 // zeta^441 * 2^31 = 28678040^441 * 2^31 -.word 22962475 // zeta^473 * 2^31 = 28678040^473 * 2^31 -.word 36888515 // zeta^409 * 2^31 = 28678040^409 * 2^31 -.word 1178728083 // zeta^505 * (q^(-1) mod 2^32) * 2^31 = 28678040^505 * 375649793 * 2^31 -.word 2481965831 // zeta^441 * (q^(-1) mod 2^32) * 2^31 = 28678040^441 * 375649793 * 2^31 -.word 128011477 // zeta^473 * (q^(-1) mod 2^32) * 2^31 = 28678040^473 * 375649793 * 2^31 -.word 3495870013 // zeta^409 * (q^(-1) mod 2^32) * 2^31 = 28678040^409 * 375649793 * 2^31 -.word 59868297 // zeta^377 * 2^31 = 28678040^377 * 2^31 -.word 15191207 // zeta^313 * 2^31 = 28678040^313 * 2^31 -.word 59108143 // zeta^345 * 2^31 = 28678040^345 * 2^31 -.word 4355773 // zeta^281 * 2^31 = 28678040^281 * 2^31 -.word 538432887 // zeta^377 * (q^(-1) mod 2^32) * 2^31 = 28678040^377 * 375649793 * 2^31 -.word 3252336985 // zeta^313 * (q^(-1) mod 2^32) * 2^31 = 28678040^313 * 375649793 * 2^31 -.word 1330506449 // zeta^345 * (q^(-1) mod 2^32) * 2^31 = 28678040^345 * 375649793 * 2^31 -.word 4169984835 // zeta^281 * (q^(-1) mod 2^32) * 2^31 = 28678040^281 * 375649793 * 2^31 -.word 27411989 // zeta^466 * 2^31 = 28678040^466 * 2^31 -.word 52176833 // zeta^338 * 2^31 = 28678040^338 * 2^31 -.word 52660121 // zeta^402 * 2^31 = 28678040^402 * 2^31 -.word 23140553 // zeta^274 * 2^31 = 28678040^274 * 2^31 -.word 652643307 // zeta^466 * (q^(-1) mod 2^32) * 2^31 = 28678040^466 * 375649793 * 2^31 -.word 4178403903 // zeta^338 * (q^(-1) mod 2^32) * 2^31 = 28678040^338 * 375649793 * 2^31 -.word 1113879143 // zeta^402 * (q^(-1) mod 2^32) * 2^31 = 28678040^402 * 375649793 * 2^31 -.word 3574776119 // zeta^274 * (q^(-1) mod 2^32) * 2^31 = 28678040^274 * 375649793 * 2^31 -.word 50275685 // zeta^489 * 2^31 = 28678040^489 * 2^31 -.word 12903773 // zeta^425 * 2^31 = 28678040^425 * 2^31 -.word 25228433 // zeta^457 * 2^31 = 28678040^457 * 2^31 -.word 55395235 // zeta^393 * 2^31 = 28678040^393 * 2^31 -.word 2869087387 // zeta^489 * (q^(-1) mod 2^32) * 2^31 = 28678040^489 * 375649793 * 2^31 -.word 433896611 // zeta^425 * (q^(-1) mod 2^32) * 2^31 = 28678040^425 * 375649793 * 2^31 -.word 157857135 // zeta^457 * (q^(-1) mod 2^32) * 2^31 = 28678040^457 * 375649793 * 2^31 -.word 2477464157 // zeta^393 * (q^(-1) mod 2^32) * 2^31 = 28678040^393 * 375649793 * 2^31 -.word 3868449 // zeta^361 * 2^31 = 28678040^361 * 2^31 -.word 66432231 // zeta^297 * 2^31 = 28678040^297 * 2^31 -.word 31236859 // zeta^329 * 2^31 = 28678040^329 * 2^31 -.word 13658415 // zeta^265 * 2^31 = 28678040^265 * 2^31 -.word 2938651359 // zeta^361 * (q^(-1) mod 2^32) * 2^31 = 28678040^361 * 375649793 * 2^31 -.word 814700825 // zeta^297 * (q^(-1) mod 2^32) * 2^31 = 28678040^297 * 375649793 * 2^31 -.word 1618291461 // zeta^329 * (q^(-1) mod 2^32) * 2^31 = 28678040^329 * 375649793 * 2^31 -.word 49245393 // zeta^265 * (q^(-1) mod 2^32) * 2^31 = 28678040^265 * 375649793 * 2^31 -.word 34409967 // zeta^482 * 2^31 = 28678040^482 * 2^31 -.word 12619783 // zeta^354 * 2^31 = 28678040^354 * 2^31 -.word 54561811 // zeta^418 * 2^31 = 28678040^418 * 2^31 -.word 61632377 // zeta^290 * 2^31 = 28678040^290 * 2^31 -.word 2233616401 // zeta^482 * (q^(-1) mod 2^32) * 2^31 = 28678040^482 * 375649793 * 2^31 -.word 2820912633 // zeta^354 * (q^(-1) mod 2^32) * 2^31 = 28678040^354 * 375649793 * 2^31 -.word 684470765 // zeta^418 * (q^(-1) mod 2^32) * 2^31 = 28678040^418 * 375649793 * 2^31 -.word 3345631879 // zeta^290 * (q^(-1) mod 2^32) * 2^31 = 28678040^290 * 375649793 * 2^31 -.word 7605279 // zeta^497 * 2^31 = 28678040^497 * 2^31 -.word 58319315 // zeta^433 * 2^31 = 28678040^433 * 2^31 -.word 16342937 // zeta^465 * 2^31 = 28678040^465 * 2^31 -.word 48148431 // zeta^401 * 2^31 = 28678040^401 * 2^31 -.word 568928737 // zeta^497 * (q^(-1) mod 2^32) * 2^31 = 28678040^497 * 375649793 * 2^31 -.word 1726766125 // zeta^433 * (q^(-1) mod 2^32) * 2^31 = 28678040^433 * 375649793 * 2^31 -.word 1056873063 // zeta^465 * (q^(-1) mod 2^32) * 2^31 = 28678040^465 * 375649793 * 2^31 -.word 958621233 // zeta^401 * (q^(-1) mod 2^32) * 2^31 = 28678040^401 * 375649793 * 2^31 -.word 62377755 // zeta^369 * 2^31 = 28678040^369 * 2^31 -.word 35459369 // zeta^305 * 2^31 = 28678040^305 * 2^31 -.word 27513701 // zeta^337 * 2^31 = 28678040^337 * 2^31 -.word 18346679 // zeta^273 * 2^31 = 28678040^273 * 2^31 -.word 4057153253 // zeta^369 * (q^(-1) mod 2^32) * 2^31 = 28678040^369 * 375649793 * 2^31 -.word 3867838679 // zeta^305 * (q^(-1) mod 2^32) * 2^31 = 28678040^305 * 375649793 * 2^31 -.word 589962907 // zeta^337 * (q^(-1) mod 2^32) * 2^31 = 28678040^337 * 375649793 * 2^31 -.word 1692873545 // zeta^273 * (q^(-1) mod 2^32) * 2^31 = 28678040^273 * 375649793 * 2^31 -.word 1824951 // zeta^450 * 2^31 = 28678040^450 * 2^31 -.word 40410247 // zeta^322 * 2^31 = 28678040^322 * 2^31 -.word 25935987 // zeta^386 * 2^31 = 28678040^386 * 2^31 -.word 53409853 // zeta^258 * 2^31 = 28678040^258 * 2^31 -.word 3034533193 // zeta^450 * (q^(-1) mod 2^32) * 2^31 = 28678040^450 * 375649793 * 2^31 -.word 1425582457 // zeta^322 * (q^(-1) mod 2^32) * 2^31 = 28678040^322 * 375649793 * 2^31 -.word 1695333773 // zeta^386 * (q^(-1) mod 2^32) * 2^31 = 28678040^386 * 375649793 * 2^31 -.word 2628741571 // zeta^258 * (q^(-1) mod 2^32) * 2^31 = 28678040^258 * 375649793 * 2^31 -.word 44896477 // zeta^481 * 2^31 = 28678040^481 * 2^31 -.word 66621379 // zeta^417 * 2^31 = 28678040^417 * 2^31 -.word 35702907 // zeta^449 * 2^31 = 28678040^449 * 2^31 -.word 44158149 // zeta^385 * 2^31 = 28678040^385 * 2^31 -.word 732401955 // zeta^481 * (q^(-1) mod 2^32) * 2^31 = 28678040^481 * 375649793 * 2^31 -.word 3346599485 // zeta^417 * (q^(-1) mod 2^32) * 2^31 = 28678040^417 * 375649793 * 2^31 -.word 1671955845 // zeta^449 * (q^(-1) mod 2^32) * 2^31 = 28678040^449 * 375649793 * 2^31 -.word 1684661563 // zeta^385 * (q^(-1) mod 2^32) * 2^31 = 28678040^385 * 375649793 * 2^31 -.word 32881793 // zeta^353 * 2^31 = 28678040^353 * 2^31 -.word 18033685 // zeta^289 * 2^31 = 28678040^289 * 2^31 -.word 29367795 // zeta^321 * 2^31 = 28678040^321 * 2^31 -.word 16787671 // zeta^257 * 2^31 = 28678040^257 * 2^31 -.word 3741535615 // zeta^353 * (q^(-1) mod 2^32) * 2^31 = 28678040^353 * 375649793 * 2^31 -.word 3094455787 // zeta^289 * (q^(-1) mod 2^32) * 2^31 = 28678040^289 * 375649793 * 2^31 -.word 3934216205 // zeta^321 * (q^(-1) mod 2^32) * 2^31 = 28678040^321 * 375649793 * 2^31 -.word 2459712809 // zeta^257 * (q^(-1) mod 2^32) * 2^31 = 28678040^257 * 375649793 * 2^31 -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 17514581 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 4460971 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_u32_33556993_28678040, %function -.global inv_ntt_u32_33556993_28678040 -inv_ntt_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r4, #:lower16:modulus_inv -movt r4, #:upper16:modulus_inv -vldrw.s32 Q4, [r0, #0] -vldrw.s32 Q5, [r0, #16] -vsub.s32 Q6, Q4, Q5 -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vqrdmulh.s32 Q5, Q6, Q5 -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vmul.u32 Q6, Q6, Q5 -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vst41.s32 {Q0,Q1,Q2,Q3}, [r0] -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vst43.s32 {Q0,Q1,Q2,Q3}, [r0]! -sub r0, r0, #1024 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #16] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #32] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #128] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #144] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #176] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #208] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #272] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #288] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #352] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #368] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #400] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #-480] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #-400] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #-352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #-288] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #-272] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #-224] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #-208] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #-176] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #-112] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #-96] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #-64] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #-32] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #-16] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #208] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #96] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #224] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #112] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #288] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #432] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #-432] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #-368] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #-480] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #-464] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #-400] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #-384] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #-320] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #-176] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #-112] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #-144] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #-192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #-128] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #256] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 // XXXXX -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 50631221 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 2147319755 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #-240] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #16] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #272] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #-480] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #-224] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #288] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #-464] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #304] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #-192] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #64] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #-432] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #-416] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #-160] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #-400] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #-384] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #384] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #400] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #-352] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #416] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #-336] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #432] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #-320] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #448] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #-304] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #208] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #-288] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #480] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #496] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 3244 -// Instruction count: 2742 \ No newline at end of file diff --git a/tests/ntt/auto/inv_ntt_u32_33556993_28678040_incomplete.s b/tests/ntt/auto/inv_ntt_u32_33556993_28678040_incomplete.s deleted file mode 100644 index a65f906..0000000 --- a/tests/ntt/auto/inv_ntt_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2535 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 36501331 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 17843885 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_u32_33556993_28678040_incomplete, %function -.global inv_ntt_u32_33556993_28678040_incomplete -inv_ntt_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #16] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #32] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #128] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #144] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #176] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #208] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #272] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #288] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #352] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #368] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #400] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #-480] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #-400] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #-352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #-288] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #-272] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #-224] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #-208] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #-176] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #-112] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #-96] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #-64] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #-32] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #-16] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #208] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #96] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #224] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #112] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #288] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #432] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #-432] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #-368] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #-480] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #-464] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #-400] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #-384] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #-320] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #-176] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #-112] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #-144] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #-192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #-128] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #256] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 // XXXXX -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 34739919 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 4294311729 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #-240] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #16] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #272] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #-480] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #-224] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #288] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #-464] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #304] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #-192] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #64] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #-432] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #-416] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #-160] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #-400] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #-384] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #384] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #400] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #-352] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #416] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #-336] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #432] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #-320] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #448] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #-304] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #208] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #-288] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #480] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #496] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2311 -// Instruction count: 1810 \ No newline at end of file diff --git a/tests/ntt/auto/ntt_u32_33556993_28678040_complete.s b/tests/ntt/auto/ntt_u32_33556993_28678040_complete.s deleted file mode 100644 index 0443f24..0000000 --- a/tests/ntt/auto/ntt_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,2915 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.word 13704133 // zeta^ 2 * 2^31 = 28678040^ 2 * 2^31 -.word 41177999 // zeta^130 * 2^31 = 28678040^130 * 2^31 -.word 26703739 // zeta^ 66 * 2^31 = 28678040^ 66 * 2^31 -.word 65289035 // zeta^194 * 2^31 = 28678040^194 * 2^31 -.word 1666225723 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 2 * 375649793 * 2^31 -.word 2599633521 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 28678040^130 * 375649793 * 2^31 -.word 2869384837 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 66 * 375649793 * 2^31 -.word 1260434101 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 28678040^194 * 375649793 * 2^31 -.word 50326315 // zeta^ 1 * 2^31 = 28678040^ 1 * 2^31 -.word 37746191 // zeta^ 65 * 2^31 = 28678040^ 65 * 2^31 -.word 49080301 // zeta^ 33 * 2^31 = 28678040^ 33 * 2^31 -.word 34232193 // zeta^ 97 * 2^31 = 28678040^ 97 * 2^31 -.word 1835254485 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 1 * 375649793 * 2^31 -.word 360751089 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 65 * 375649793 * 2^31 -.word 1200511507 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 33 * 375649793 * 2^31 -.word 553431679 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 97 * 375649793 * 2^31 -.word 22955837 // zeta^129 * 2^31 = 28678040^129 * 2^31 -.word 31411079 // zeta^193 * 2^31 = 28678040^193 * 2^31 -.word 492607 // zeta^161 * 2^31 = 28678040^161 * 2^31 -.word 22217509 // zeta^225 * 2^31 = 28678040^225 * 2^31 -.word 5481609 // zeta^ 34 * 2^31 = 28678040^ 34 * 2^31 -.word 12552175 // zeta^162 * 2^31 = 28678040^162 * 2^31 -.word 54494203 // zeta^ 98 * 2^31 = 28678040^ 98 * 2^31 -.word 32704019 // zeta^226 * 2^31 = 28678040^226 * 2^31 -.word 949335415 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 34 * 375649793 * 2^31 -.word 3610496529 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 28678040^162 * 375649793 * 2^31 -.word 1474054661 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 98 * 375649793 * 2^31 -.word 2061350893 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 28678040^226 * 375649793 * 2^31 -.word 48767307 // zeta^ 17 * 2^31 = 28678040^ 17 * 2^31 -.word 39600285 // zeta^ 81 * 2^31 = 28678040^ 81 * 2^31 -.word 31654617 // zeta^ 49 * 2^31 = 28678040^ 49 * 2^31 -.word 4736231 // zeta^113 * 2^31 = 28678040^113 * 2^31 -.word 2602093749 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 17 * 375649793 * 2^31 -.word 3705004387 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 81 * 375649793 * 2^31 -.word 427128615 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 49 * 375649793 * 2^31 -.word 237814041 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 28678040^113 * 375649793 * 2^31 -.word 18965555 // zeta^145 * 2^31 = 28678040^145 * 2^31 -.word 50771049 // zeta^209 * 2^31 = 28678040^209 * 2^31 -.word 8794671 // zeta^177 * 2^31 = 28678040^177 * 2^31 -.word 59508707 // zeta^241 * 2^31 = 28678040^241 * 2^31 -.word 43973433 // zeta^ 18 * 2^31 = 28678040^ 18 * 2^31 -.word 14453865 // zeta^146 * 2^31 = 28678040^146 * 2^31 -.word 14937153 // zeta^ 82 * 2^31 = 28678040^ 82 * 2^31 -.word 39701997 // zeta^210 * 2^31 = 28678040^210 * 2^31 -.word 720191175 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 18 * 375649793 * 2^31 -.word 3181088151 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 28678040^146 * 375649793 * 2^31 -.word 116563391 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 82 * 375649793 * 2^31 -.word 3642323987 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 28678040^210 * 375649793 * 2^31 -.word 53455571 // zeta^ 9 * 2^31 = 28678040^ 9 * 2^31 -.word 35877127 // zeta^ 73 * 2^31 = 28678040^ 73 * 2^31 -.word 681755 // zeta^ 41 * 2^31 = 28678040^ 41 * 2^31 -.word 63245537 // zeta^105 * 2^31 = 28678040^105 * 2^31 -.word 4245721901 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 9 * 375649793 * 2^31 -.word 2676675833 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 73 * 375649793 * 2^31 -.word 3480266469 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 41 * 375649793 * 2^31 -.word 1356315935 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 28678040^105 * 375649793 * 2^31 -.word 11718751 // zeta^137 * 2^31 = 28678040^137 * 2^31 -.word 41885553 // zeta^201 * 2^31 = 28678040^201 * 2^31 -.word 54210213 // zeta^169 * 2^31 = 28678040^169 * 2^31 -.word 16838301 // zeta^233 * 2^31 = 28678040^233 * 2^31 -.word 40841465 // zeta^ 50 * 2^31 = 28678040^ 50 * 2^31 -.word 3577749 // zeta^178 * 2^31 = 28678040^178 * 2^31 -.word 33845545 // zeta^114 * 2^31 = 28678040^114 * 2^31 -.word 19555165 // zeta^242 * 2^31 = 28678040^242 * 2^31 -.word 3459680519 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 50 * 375649793 * 2^31 -.word 495008363 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 28678040^178 * 375649793 * 2^31 -.word 1885546711 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 28678040^114 * 375649793 * 2^31 -.word 3630382755 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 28678040^242 * 375649793 * 2^31 -.word 62758213 // zeta^ 25 * 2^31 = 28678040^ 25 * 2^31 -.word 8005843 // zeta^ 89 * 2^31 = 28678040^ 89 * 2^31 -.word 51922779 // zeta^ 57 * 2^31 = 28678040^ 57 * 2^31 -.word 7245689 // zeta^121 * 2^31 = 28678040^121 * 2^31 -.word 124982459 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 25 * 375649793 * 2^31 -.word 2964460845 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 89 * 375649793 * 2^31 -.word 1042630309 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 57 * 375649793 * 2^31 -.word 3756534407 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 28678040^121 * 375649793 * 2^31 -.word 30225471 // zeta^153 * 2^31 = 28678040^153 * 2^31 -.word 44151511 // zeta^217 * 2^31 = 28678040^217 * 2^31 -.word 64890121 // zeta^185 * 2^31 = 28678040^185 * 2^31 -.word 65259669 // zeta^249 * 2^31 = 28678040^249 * 2^31 -.word 12974361 // zeta^ 10 * 2^31 = 28678040^ 10 * 2^31 -.word 41807515 // zeta^138 * 2^31 = 28678040^138 * 2^31 -.word 56379967 // zeta^ 74 * 2^31 = 28678040^ 74 * 2^31 -.word 13380915 // zeta^202 * 2^31 = 28678040^202 * 2^31 -.word 1194393831 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 10 * 375649793 * 2^31 -.word 1648893797 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 28678040^138 * 375649793 * 2^31 -.word 753806273 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 74 * 375649793 * 2^31 -.word 4010528973 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 28678040^202 * 375649793 * 2^31 -.word 16772797 // zeta^ 5 * 2^31 = 28678040^ 5 * 2^31 -.word 58675875 // zeta^ 69 * 2^31 = 28678040^ 69 * 2^31 -.word 59974505 // zeta^ 37 * 2^31 = 28678040^ 37 * 2^31 -.word 33980107 // zeta^101 * 2^31 = 28678040^101 * 2^31 -.word 2122281795 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 5 * 375649793 * 2^31 -.word 2886667101 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 69 * 375649793 * 2^31 -.word 3771397783 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 37 * 375649793 * 2^31 -.word 1168207669 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 28678040^101 * 375649793 * 2^31 -.word 28448893 // zeta^133 * 2^31 = 28678040^133 * 2^31 -.word 24378249 // zeta^197 * 2^31 = 28678040^197 * 2^31 -.word 62687027 // zeta^165 * 2^31 = 28678040^165 * 2^31 -.word 65645595 // zeta^229 * 2^31 = 28678040^229 * 2^31 -.word 52771617 // zeta^ 42 * 2^31 = 28678040^ 42 * 2^31 -.word 23396495 // zeta^170 * 2^31 = 28678040^170 * 2^31 -.word 51483005 // zeta^106 * 2^31 = 28678040^106 * 2^31 -.word 11487943 // zeta^234 * 2^31 = 28678040^234 * 2^31 -.word 2185629407 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 42 * 375649793 * 2^31 -.word 1858377073 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 28678040^170 * 375649793 * 2^31 -.word 432623747 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 28678040^106 * 375649793 * 2^31 -.word 2290121529 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 28678040^234 * 375649793 * 2^31 -.word 63287737 // zeta^ 21 * 2^31 = 28678040^ 21 * 2^31 -.word 56338313 // zeta^ 85 * 2^31 = 28678040^ 85 * 2^31 -.word 19445427 // zeta^ 53 * 2^31 = 28678040^ 53 * 2^31 -.word 29167561 // zeta^117 * 2^31 = 28678040^117 * 2^31 -.word 1659340871 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 21 * 375649793 * 2^31 -.word 1504424567 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 85 * 375649793 * 2^31 -.word 3591259981 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 53 * 375649793 * 2^31 -.word 4032612919 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 28678040^117 * 375649793 * 2^31 -.word 7740335 // zeta^149 * 2^31 = 28678040^149 * 2^31 -.word 23515783 // zeta^213 * 2^31 = 28678040^213 * 2^31 -.word 33583453 // zeta^181 * 2^31 = 28678040^181 * 2^31 -.word 60337403 // zeta^245 * 2^31 = 28678040^245 * 2^31 -.word 35192755 // zeta^ 26 * 2^31 = 28678040^ 26 * 2^31 -.word 36544119 // zeta^154 * 2^31 = 28678040^154 * 2^31 -.word 6787663 // zeta^ 90 * 2^31 = 28678040^ 90 * 2^31 -.word 63484749 // zeta^218 * 2^31 = 28678040^218 * 2^31 -.word 3019374157 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 26 * 375649793 * 2^31 -.word 2777089929 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 28678040^154 * 375649793 * 2^31 -.word 443777969 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 90 * 375649793 * 2^31 -.word 723799731 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 28678040^218 * 375649793 * 2^31 -.word 61997615 // zeta^ 13 * 2^31 = 28678040^ 13 * 2^31 -.word 4479011 // zeta^ 77 * 2^31 = 28678040^ 77 * 2^31 -.word 38089877 // zeta^ 45 * 2^31 = 28678040^ 45 * 2^31 -.word 16590903 // zeta^109 * 2^31 = 28678040^109 * 2^31 -.word 201839569 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 13 * 375649793 * 2^31 -.word 998311389 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 77 * 375649793 * 2^31 -.word 1502911851 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 45 * 375649793 * 2^31 -.word 1931017673 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 28678040^109 * 375649793 * 2^31 -.word 43852787 // zeta^141 * 2^31 = 28678040^141 * 2^31 -.word 24597857 // zeta^205 * 2^31 = 28678040^205 * 2^31 -.word 43936833 // zeta^173 * 2^31 = 28678040^173 * 2^31 -.word 15636061 // zeta^237 * 2^31 = 28678040^237 * 2^31 -.word 55869129 // zeta^ 58 * 2^31 = 28678040^ 58 * 2^31 -.word 16038683 // zeta^186 * 2^31 = 28678040^186 * 2^31 -.word 43560065 // zeta^122 * 2^31 = 28678040^122 * 2^31 -.word 25949329 // zeta^250 * 2^31 = 28678040^250 * 2^31 -.word 2098944823 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 58 * 375649793 * 2^31 -.word 634278629 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 28678040^186 * 375649793 * 2^31 -.word 2076204415 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 28678040^122 * 375649793 * 2^31 -.word 2002629999 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 28678040^250 * 375649793 * 2^31 -.word 6591765 // zeta^ 29 * 2^31 = 28678040^ 29 * 2^31 -.word 1696249 // zeta^ 93 * 2^31 = 28678040^ 93 * 2^31 -.word 21795289 // zeta^ 61 * 2^31 = 28678040^ 61 * 2^31 -.word 17734591 // zeta^125 * 2^31 = 28678040^125 * 2^31 -.word 3812244715 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 29 * 375649793 * 2^31 -.word 1467340807 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 93 * 375649793 * 2^31 -.word 1570891815 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 61 * 375649793 * 2^31 -.word 1349179969 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 28678040^125 * 375649793 * 2^31 -.word 66853037 // zeta^157 * 2^31 = 28678040^157 * 2^31 -.word 24930199 // zeta^221 * 2^31 = 28678040^221 * 2^31 -.word 54854635 // zeta^189 * 2^31 = 28678040^189 * 2^31 -.word 39952565 // zeta^253 * 2^31 = 28678040^253 * 2^31 -.word 5623923 // zeta^ 6 * 2^31 = 28678040^ 6 * 2^31 -.word 38701067 // zeta^134 * 2^31 = 28678040^134 * 2^31 -.word 18571677 // zeta^ 70 * 2^31 = 28678040^ 70 * 2^31 -.word 14491707 // zeta^198 * 2^31 = 28678040^198 * 2^31 -.word 182627725 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 6 * 375649793 * 2^31 -.word 4172670453 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 28678040^134 * 375649793 * 2^31 -.word 1902166115 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 70 * 375649793 * 2^31 -.word 4183371205 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 28678040^198 * 375649793 * 2^31 -.word 17941849 // zeta^ 3 * 2^31 = 28678040^ 3 * 2^31 -.word 12982967 // zeta^ 67 * 2^31 = 28678040^ 67 * 2^31 -.word 8061707 // zeta^ 35 * 2^31 = 28678040^ 35 * 2^31 -.word 17774995 // zeta^ 99 * 2^31 = 28678040^ 99 * 2^31 -.word 4091524263 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 3 * 375649793 * 2^31 -.word 2462649161 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 67 * 375649793 * 2^31 -.word 2874632949 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 35 * 375649793 * 2^31 -.word 2009367661 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 99 * 375649793 * 2^31 -.word 61107981 // zeta^131 * 2^31 = 28678040^131 * 2^31 -.word 38975641 // zeta^195 * 2^31 = 28678040^195 * 2^31 -.word 40352225 // zeta^163 * 2^31 = 28678040^163 * 2^31 -.word 49569327 // zeta^227 * 2^31 = 28678040^227 * 2^31 -.word 26799603 // zeta^ 38 * 2^31 = 28678040^ 38 * 2^31 -.word 33463463 // zeta^166 * 2^31 = 28678040^166 * 2^31 -.word 39332725 // zeta^102 * 2^31 = 28678040^102 * 2^31 -.word 61125067 // zeta^230 * 2^31 = 28678040^230 * 2^31 -.word 583438349 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 38 * 375649793 * 2^31 -.word 1692658009 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 28678040^166 * 375649793 * 2^31 -.word 1738958475 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 28678040^102 * 375649793 * 2^31 -.word 2248227893 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 28678040^230 * 375649793 * 2^31 -.word 40014327 // zeta^ 19 * 2^31 = 28678040^ 19 * 2^31 -.word 562885 // zeta^ 83 * 2^31 = 28678040^ 83 * 2^31 -.word 51009393 // zeta^ 51 * 2^31 = 28678040^ 51 * 2^31 -.word 51995259 // zeta^115 * 2^31 = 28678040^115 * 2^31 -.word 2564101129 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 19 * 375649793 * 2^31 -.word 2196183867 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 83 * 375649793 * 2^31 -.word 2252083855 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 51 * 375649793 * 2^31 -.word 4038290309 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 28678040^115 * 375649793 * 2^31 -.word 24330211 // zeta^147 * 2^31 = 28678040^147 * 2^31 -.word 7682101 // zeta^211 * 2^31 = 28678040^211 * 2^31 -.word 7401943 // zeta^179 * 2^31 = 28678040^179 * 2^31 -.word 41757453 // zeta^243 * 2^31 = 28678040^243 * 2^31 -.word 65375453 // zeta^ 22 * 2^31 = 28678040^ 22 * 2^31 -.word 40797001 // zeta^150 * 2^31 = 28678040^150 * 2^31 -.word 59835311 // zeta^ 86 * 2^31 = 28678040^ 86 * 2^31 -.word 32875577 // zeta^214 * 2^31 = 28678040^214 * 2^31 -.word 4014413091 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 22 * 375649793 * 2^31 -.word 3224262327 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 28678040^150 * 375649793 * 2^31 -.word 741855825 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 86 * 375649793 * 2^31 -.word 2318439879 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 28678040^214 * 375649793 * 2^31 -.word 10045293 // zeta^ 11 * 2^31 = 28678040^ 11 * 2^31 -.word 53076657 // zeta^ 75 * 2^31 = 28678040^ 75 * 2^31 -.word 17896617 // zeta^ 43 * 2^31 = 28678040^ 43 * 2^31 -.word 58413331 // zeta^107 * 2^31 = 28678040^107 * 2^31 -.word 3080518291 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 11 * 375649793 * 2^31 -.word 3700229967 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 75 * 375649793 * 2^31 -.word 297370967 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 43 * 375649793 * 2^31 -.word 2151902445 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 28678040^107 * 375649793 * 2^31 -.word 19472551 // zeta^139 * 2^31 = 28678040^139 * 2^31 -.word 6043561 // zeta^203 * 2^31 = 28678040^203 * 2^31 -.word 20934449 // zeta^171 * 2^31 = 28678040^171 * 2^31 -.word 37620445 // zeta^235 * 2^31 = 28678040^235 * 2^31 -.word 12921459 // zeta^ 54 * 2^31 = 28678040^ 54 * 2^31 -.word 63769677 // zeta^182 * 2^31 = 28678040^182 * 2^31 -.word 61505033 // zeta^118 * 2^31 = 28678040^118 * 2^31 -.word 65692461 // zeta^246 * 2^31 = 28678040^246 * 2^31 -.word 1006064525 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 54 * 375649793 * 2^31 -.word 2459563443 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 28678040^182 * 375649793 * 2^31 -.word 2747128823 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 28678040^118 * 375649793 * 2^31 -.word 2288082643 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 28678040^246 * 375649793 * 2^31 -.word 20171011 // zeta^ 27 * 2^31 = 28678040^ 27 * 2^31 -.word 36495001 // zeta^ 91 * 2^31 = 28678040^ 91 * 2^31 -.word 62685175 // zeta^ 59 * 2^31 = 28678040^ 59 * 2^31 -.word 664745 // zeta^123 * 2^31 = 28678040^123 * 2^31 -.word 1031427325 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 27 * 375649793 * 2^31 -.word 2764118887 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 91 * 375649793 * 2^31 -.word 583476745 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 59 * 375649793 * 2^31 -.word 2371908951 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 28678040^123 * 375649793 * 2^31 -.word 56713759 // zeta^155 * 2^31 = 28678040^155 * 2^31 -.word 59594509 // zeta^219 * 2^31 = 28678040^219 * 2^31 -.word 41235703 // zeta^187 * 2^31 = 28678040^187 * 2^31 -.word 11581499 // zeta^251 * 2^31 = 28678040^251 * 2^31 -.word 23458751 // zeta^ 14 * 2^31 = 28678040^ 14 * 2^31 -.word 9406759 // zeta^142 * 2^31 = 28678040^142 * 2^31 -.word 33711991 // zeta^ 78 * 2^31 = 28678040^ 78 * 2^31 -.word 32167773 // zeta^206 * 2^31 = 28678040^206 * 2^31 -.word 1501790785 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 14 * 375649793 * 2^31 -.word 2911894745 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 28678040^142 * 375649793 * 2^31 -.word 1905016457 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 78 * 375649793 * 2^31 -.word 204130979 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 28678040^206 * 375649793 * 2^31 -.word 26043621 // zeta^ 7 * 2^31 = 28678040^ 7 * 2^31 -.word 51942461 // zeta^ 71 * 2^31 = 28678040^ 71 * 2^31 -.word 14401009 // zeta^ 39 * 2^31 = 28678040^ 39 * 2^31 -.word 60574133 // zeta^103 * 2^31 = 28678040^103 * 2^31 -.word 1827638555 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 7 * 375649793 * 2^31 -.word 3437088195 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 71 * 375649793 * 2^31 -.word 2892737551 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 39 * 375649793 * 2^31 -.word 3197159499 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 28678040^103 * 375649793 * 2^31 -.word 16031087 // zeta^135 * 2^31 = 28678040^135 * 2^31 -.word 25566271 // zeta^199 * 2^31 = 28678040^199 * 2^31 -.word 54040269 // zeta^167 * 2^31 = 28678040^167 * 2^31 -.word 36895029 // zeta^231 * 2^31 = 28678040^231 * 2^31 -.word 41803191 // zeta^ 46 * 2^31 = 28678040^ 46 * 2^31 -.word 19377381 // zeta^174 * 2^31 = 28678040^174 * 2^31 -.word 9664027 // zeta^110 * 2^31 = 28678040^110 * 2^31 -.word 55794235 // zeta^238 * 2^31 = 28678040^238 * 2^31 -.word 2460960841 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 46 * 375649793 * 2^31 -.word 1411728667 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 28678040^174 * 375649793 * 2^31 -.word 1300076517 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 28678040^110 * 375649793 * 2^31 -.word 3978752965 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 28678040^238 * 375649793 * 2^31 -.word 19675339 // zeta^ 23 * 2^31 = 28678040^ 23 * 2^31 -.word 21359151 // zeta^ 87 * 2^31 = 28678040^ 87 * 2^31 -.word 63140729 // zeta^ 55 * 2^31 = 28678040^ 55 * 2^31 -.word 23160723 // zeta^119 * 2^31 = 28678040^119 * 2^31 -.word 398439733 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 23 * 375649793 * 2^31 -.word 897838033 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 87 * 375649793 * 2^31 -.word 494618247 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 55 * 375649793 * 2^31 -.word 3040761453 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 28678040^119 * 375649793 * 2^31 -.word 9258847 // zeta^151 * 2^31 = 28678040^151 * 2^31 -.word 4669959 // zeta^215 * 2^31 = 28678040^215 * 2^31 -.word 41266143 // zeta^183 * 2^31 = 28678040^183 * 2^31 -.word 61464071 // zeta^247 * 2^31 = 28678040^247 * 2^31 -.word 43355169 // zeta^ 30 * 2^31 = 28678040^ 30 * 2^31 -.word 5591977 // zeta^158 * 2^31 = 28678040^158 * 2^31 -.word 40694335 // zeta^ 94 * 2^31 = 28678040^ 94 * 2^31 -.word 25071607 // zeta^222 * 2^31 = 28678040^222 * 2^31 -.word 1107279327 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 30 * 375649793 * 2^31 -.word 552289879 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 28678040^158 * 375649793 * 2^31 -.word 879592385 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 94 * 375649793 * 2^31 -.word 2040862217 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 28678040^222 * 375649793 * 2^31 -.word 34737117 // zeta^ 15 * 2^31 = 28678040^ 15 * 2^31 -.word 45994147 // zeta^ 79 * 2^31 = 28678040^ 79 * 2^31 -.word 42273719 // zeta^ 47 * 2^31 = 28678040^ 47 * 2^31 -.word 60428681 // zeta^111 * 2^31 = 28678040^111 * 2^31 -.word 303076899 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 15 * 375649793 * 2^31 -.word 3854339421 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 79 * 375649793 * 2^31 -.word 3799259721 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 47 * 375649793 * 2^31 -.word 1636911223 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 28678040^111 * 375649793 * 2^31 -.word 26028927 // zeta^143 * 2^31 = 28678040^143 * 2^31 -.word 64083527 // zeta^207 * 2^31 = 28678040^207 * 2^31 -.word 60382541 // zeta^175 * 2^31 = 28678040^175 * 2^31 -.word 31337387 // zeta^239 * 2^31 = 28678040^239 * 2^31 -.word 27553395 // zeta^ 62 * 2^31 = 28678040^ 62 * 2^31 -.word 7648471 // zeta^190 * 2^31 = 28678040^190 * 2^31 -.word 689375 // zeta^126 * 2^31 = 28678040^126 * 2^31 -.word 46555773 // zeta^254 * 2^31 = 28678040^254 * 2^31 -.word 1673531277 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 62 * 375649793 * 2^31 -.word 1889513769 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 28678040^190 * 375649793 * 2^31 -.word 1477062945 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 28678040^126 * 375649793 * 2^31 -.word 2252242819 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 28678040^254 * 375649793 * 2^31 -.word 15797163 // zeta^ 31 * 2^31 = 28678040^ 31 * 2^31 -.word 40170027 // zeta^ 95 * 2^31 = 28678040^ 95 * 2^31 -.word 10866061 // zeta^ 63 * 2^31 = 28678040^ 63 * 2^31 -.word 56298001 // zeta^127 * 2^31 = 28678040^127 * 2^31 -.word 683123285 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 31 * 375649793 * 2^31 -.word 2755967957 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 95 * 375649793 * 2^31 -.word 273527923 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 63 * 375649793 * 2^31 -.word 644194287 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 28678040^127 * 375649793 * 2^31 -.word 50400667 // zeta^159 * 2^31 = 28678040^159 * 2^31 -.word 33861863 // zeta^223 * 2^31 = 28678040^223 * 2^31 -.word 53736885 // zeta^191 * 2^31 = 28678040^191 * 2^31 -.word 31774129 // zeta^255 * 2^31 = 28678040^255 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_u32_33556993_28678040, %function -.global ntt_u32_33556993_28678040 -ntt_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r11, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q5, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -// Butterfly [0, 1, 2, 3] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [16, 17, 18, 19] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [32, 33, 34, 35] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [48, 49, 50, 51] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [64, 65, 66, 67] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [80, 81, 82, 83] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [96, 97, 98, 99] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [112, 113, 114, 115] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [128, 129, 130, 131] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [144, 145, 146, 147] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [160, 161, 162, 163] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [176, 177, 178, 179] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [192, 193, 194, 195] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [208, 209, 210, 211] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [224, 225, 226, 227] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r12 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Butterfly [240, 241, 242, 243] -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2883 -// Instruction count: 2429 \ No newline at end of file diff --git a/tests/ntt/auto/ntt_u32_33556993_28678040_incomplete.s b/tests/ntt/auto/ntt_u32_33556993_28678040_incomplete.s deleted file mode 100644 index e93e64f..0000000 --- a/tests/ntt/auto/ntt_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2035 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_u32_33556993_28678040_incomplete, %function -.global ntt_u32_33556993_28678040_incomplete -ntt_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2003 -// Instruction count: 1565 \ No newline at end of file diff --git a/tests/ntt/auto/ntt_u32_33556993_28678040_incomplete_double.s b/tests/ntt/auto/ntt_u32_33556993_28678040_incomplete_double.s deleted file mode 100644 index ffb5c99..0000000 --- a/tests/ntt/auto/ntt_u32_33556993_28678040_incomplete_double.s +++ /dev/null @@ -1,2342 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_u32_33556993_28678040_incomplete_double, %function -.global ntt_u32_33556993_28678040_incomplete_double -ntt_u32_33556993_28678040_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q1, [r1, #96] -vqrdmulh.s32 Q7, Q1, r6 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q1, Q1, r5 -/// Twist in[8] by r6 -vstrw.u32 Q3, [r1, #64] -vqrdmlah.s32 Q7, Q1, r12 -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q3, r6 -vsub.s32 Q4, Q2, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q4, [r1,#32] -vqrdmlah.s32 Q7, Q3, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r8 -vadd.s32 Q2, Q2, Q6 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q2, [r1], #128 -vqrdmlah.s32 Q7, Q4, r12 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q1, Q2, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q2, Q2, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[0] from Q2 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #176] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q0, [r1, #96] -vqrdmulh.s32 Q7, Q0, r6 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q0, Q0, r5 -/// Twist in[24] by r6 -vstrw.u32 Q2, [r1, #64] -vqrdmlah.s32 Q7, Q0, r12 -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q2, r6 -vsub.s32 Q3, Q1, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#32] -vqrdmlah.s32 Q7, Q2, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[24] from Q2 -vqrdmulh.s32 Q7, Q3, r8 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q1, [r1], #128 -vqrdmlah.s32 Q7, Q3, r12 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q0, Q1, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q1, Q1, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q0, [r1,#-112] -// Release input[16] from Q1 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #160] -vmul.u32 Q4, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[40] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[32] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #224] -vmul.u32 Q3, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #208] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #192] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[56] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[48] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #288] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #256] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #368] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[72] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[64] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #352] -vmul.u32 Q3, Q3, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #336] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[88] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[80] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #416] -vmul.u32 Q4, Q4, r9 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #384] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[104] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[96] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #480] -vmul.u32 Q3, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #464] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #448] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #-448] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[120] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[112] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #-464] -vmul.u32 Q4, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[136] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[128] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #-400] -vmul.u32 Q3, Q3, r9 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #-416] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #-432] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #-320] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[152] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[144] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #-336] -vmul.u32 Q4, Q4, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #-368] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[168] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[160] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vmul.u32 Q3, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[184] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[176] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vmul.u32 Q4, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[200] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[192] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #-144] -vmul.u32 Q3, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #-160] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #-176] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #-64] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[216] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[208] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vmul.u32 Q4, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[232] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[224] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vmul.u32 Q3, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q6, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[248] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q6, Q3, r12 -// Release input[252] from Q3 -vqrdmlah.s32 Q4, Q2, r12 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1, #112] -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q6, Q1, r12 -vstrw.u32 Q6, [r1, #80] -// Release input[248] from Q1 -vqrdmulh.s32 Q6, Q2, r8 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q6, Q6 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q6, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[240] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2310 -// Instruction count: 1856 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_complete.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_complete.s deleted file mode 100644 index 4467b89..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_complete.s +++ /dev/null @@ -1,13759 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 = 17791697 * 2^31 -.word 3285804833 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 = 29333180 * 2^31 -.word 1876750725 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 = 16027071 * 2^31 -.word 3172903227 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 = 27246749 * 2^31 -.word 3890743531 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 = 19153009 * 2^31 -.word 3372902219 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 = 14378180 * 2^31 -.word 919922753 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 = 23328838 * 2^31 -.word 1492590085 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 = 26950707 * 2^31 -.word 3871802623 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 = 31812506 * 2^31 -.word 2035379173 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 = 17437883 * 2^31 -.word 3263167647 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 = 8357758 * 2^31 -.word 534733307 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 = 22422281 * 2^31 -.word 3582071787 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 = 29650081 * 2^31 -.word 4044509849 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 = 9686916 * 2^31 -.word 619773465 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 = 18399952 * 2^31 -.word 1177237627 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 = 27755269 * 2^31 -.word 3923278881 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 = 22563934 * 2^31 -.word 1443651165 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 = 2438403 * 2^31 -.word 2303493823 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 = 31481843 * 2^31 -.word 4161706847 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 = 32076751 * 2^31 -.word 4199769341 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 = 18223844 * 2^31 -.word 1165970155 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 = 3973412 * 2^31 -.word 254220777 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 = 7405458 * 2^31 -.word 473804703 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 = 33156191 * 2^31 -.word 4268832423 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 = 22859934 * 2^31 -.word 1462589385 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 = 23834070 * 2^31 -.word 1524915067 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 = 25149579 * 2^31 -.word 3756565603 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 = 13976724 * 2^31 -.word 894237409 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 = 15349951 * 2^31 -.word 3129580769 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 = 6932474 * 2^31 -.word 443542963 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 = 12503729 * 2^31 -.word 2947478141 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 = 10586616 * 2^31 -.word 677336697 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 = 15322485 * 2^31 -.word 3127823481 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 = 6173403 * 2^31 -.word 2542460889 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 = 14374018 * 2^31 -.word 919656467 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 = 9325363 * 2^31 -.word 2744124781 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 = 5605608 * 2^31 -.word 358649449 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 = 25200773 * 2^31 -.word 3759841019 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 = 31727447 * 2^31 -.word 4177420707 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 = 6658688 * 2^31 -.word 426026005 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 = 33297705 * 2^31 -.word 4277886557 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 = 486950 * 2^31 -.word 31155291 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 = 13215161 * 2^31 -.word 2992995895 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 = 16752026 * 2^31 -.word 1071802543 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 = 14102887 * 2^31 -.word 3049793025 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 = 32232983 * 2^31 -.word 4209765139 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 = 16009575 * 2^31 -.word 3171783825 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 = 5365218 * 2^31 -.word 343269183 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 = 24042369 * 2^31 -.word 3685725783 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 = 27221548 * 2^31 -.word 1741647511 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 = 7233695 * 2^31 -.word 2610298873 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 = 15385892 * 2^31 -.word 984396643 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 = 15700554 * 2^31 -.word 1004528867 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 = 17178032 * 2^31 -.word 1099058609 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 = 20482112 * 2^31 -.word 1310455209 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 = 31908284 * 2^31 -.word 2041507095 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 = 4869100 * 2^31 -.word 311527319 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 = 29810009 * 2^31 -.word 4054742117 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 = 11135584 * 2^31 -.word 712459929 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 = 18505659 * 2^31 -.word 3331484459 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 = 17484839 * 2^31 -.word 3266171913 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 = 20168277 * 2^31 -.word 3437859545 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 = 31283961 * 2^31 -.word 4149046263 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 = 17056436 * 2^31 -.word 1091278839 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 -.word 2147483711 // zeta^ 0 * (q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 387574637 // zeta^256 * (q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 1034331227 // zeta^128 * (q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 260443775 // zeta^384 * (q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 -.word 2147483711 // zeta^ 0 * (q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 1034331227 // zeta^128 * (q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 225927717 // zeta^ 64 * (q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 1061213519 // zeta^192 * (q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 -.word 225927717 // zeta^ 64 * (q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 2867950541 // zeta^320 * (q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 1061213519 // zeta^192 * (q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 1485640269 // zeta^448 * (q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 -.word 510244013 // zeta^ 32 * (q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 2068958813 // zeta^160 * (q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 3798698919 // zeta^ 96 * (q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 647594733 // zeta^224 * (q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 -.word 510244013 // zeta^ 32 * (q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 2908863877 // zeta^288 * (q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 2068958813 // zeta^160 * (q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 470281299 // zeta^416 * (q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 -.word 2011732375 // zeta^ 16 * (q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 72843559 // zeta^144 * (q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 974853061 // zeta^ 80 * (q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 3736072529 // zeta^208 * (q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 -.word 3798698919 // zeta^ 96 * (q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 3539370451 // zeta^352 * (q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 647594733 // zeta^224 * (q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 2984342153 // zeta^480 * (q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 -.word 3670574183 // zeta^ 48 * (q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 2705987941 // zeta^176 * (q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 113764189 // zeta^112 * (q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 390962787 // zeta^240 * (q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 -.word 2011732375 // zeta^ 16 * (q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 3751646015 // zeta^272 * (q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 72843559 // zeta^144 * (q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 3932493349 // zeta^400 * (q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 -.word 2896581291 // zeta^ 8 * (q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 695081809 // zeta^136 * (q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 1871467089 // zeta^ 72 * (q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 763860817 // zeta^200 * (q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 -.word 974853061 // zeta^ 80 * (q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 4056128829 // zeta^336 * (q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 3736072529 // zeta^208 * (q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 1697242247 // zeta^464 * (q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 -.word 307629373 // zeta^ 40 * (q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 876871829 // zeta^168 * (q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 4216088585 // zeta^104 * (q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 2919278235 // zeta^232 * (q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 -.word 3670574183 // zeta^ 48 * (q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 1317277063 // zeta^304 * (q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 2705987941 // zeta^176 * (q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 3756689085 // zeta^432 * (q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 -.word 170406997 // zeta^ 24 * (q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 4100667557 // zeta^152 * (q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 3050391755 // zeta^ 88 * (q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 1587133389 // zeta^216 * (q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 -.word 113764189 // zeta^112 * (q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 587045533 // zeta^368 * (q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 390962787 // zeta^240 * (q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 901308915 // zeta^496 * (q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 -.word 3057097163 // zeta^ 56 * (q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 2957603945 // zeta^184 * (q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 4074479965 // zeta^120 * (q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 2146473971 // zeta^248 * (q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 -.word 2896581291 // zeta^ 8 * (q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 1249625647 // zeta^264 * (q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 695081809 // zeta^136 * (q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 1507019089 // zeta^392 * (q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 -.word 3285804833 // zeta^ 4 * (q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 3172903227 // zeta^132 * (q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 3372902219 // zeta^ 68 * (q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 1492590085 // zeta^196 * (q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 -.word 1871467089 // zeta^ 72 * (q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 416692279 // zeta^328 * (q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 763860817 // zeta^200 * (q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 2350446661 // zeta^456 * (q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 -.word 2035379173 // zeta^ 36 * (q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 534733307 // zeta^164 * (q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 4044509849 // zeta^100 * (q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 1177237627 // zeta^228 * (q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 -.word 307629373 // zeta^ 40 * (q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 892719281 // zeta^296 * (q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 876871829 // zeta^168 * (q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 1664121347 // zeta^424 * (q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 -.word 1443651165 // zeta^ 20 * (q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 4161706847 // zeta^148 * (q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 1165970155 // zeta^ 84 * (q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 473804703 // zeta^212 * (q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 -.word 4216088585 // zeta^104 * (q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 4278606209 // zeta^360 * (q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 2919278235 // zeta^232 * (q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 2084247331 // zeta^488 * (q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 -.word 1462589385 // zeta^ 52 * (q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 3756565603 // zeta^180 * (q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 3129580769 // zeta^116 * (q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 2947478141 // zeta^244 * (q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 -.word 170406997 // zeta^ 24 * (q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 902884369 // zeta^280 * (q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 4100667557 // zeta^152 * (q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 102337021 // zeta^408 * (q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 -.word 3127823481 // zeta^ 12 * (q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 919656467 // zeta^140 * (q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 358649449 // zeta^ 76 * (q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 4177420707 // zeta^204 * (q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 -.word 3050391755 // zeta^ 88 * (q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 1885862951 // zeta^344 * (q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 1587133389 // zeta^216 * (q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 2329659789 // zeta^472 * (q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 -.word 4277886557 // zeta^ 44 * (q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 2992995895 // zeta^172 * (q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 3049793025 // zeta^108 * (q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 3171783825 // zeta^236 * (q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 -.word 3057097163 // zeta^ 56 * (q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 3752511543 // zeta^312 * (q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 2957603945 // zeta^184 * (q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 3361899881 // zeta^440 * (q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 -.word 3685725783 // zeta^ 28 * (q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 2610298873 // zeta^156 * (q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 1004528867 // zeta^ 92 * (q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 1310455209 // zeta^220 * (q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 -.word 4074479965 // zeta^120 * (q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 2389577759 // zeta^376 * (q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 2146473971 // zeta^248 * (q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 4013033631 // zeta^504 * (q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 -.word 311527319 // zeta^ 60 * (q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 712459929 // zeta^188 * (q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 3266171913 // zeta^124 * (q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 4149046263 // zeta^252 * (q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 -.word 3285804833 // zeta^ 4 * (q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 1876750725 // zeta^260 * (q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 3172903227 // zeta^132 * (q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 3890743531 // zeta^388 * (q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 57364657 // zeta^ 2 * 2^31 = 286215^ 2 * 2^31 -.word 65863923 // zeta^130 * 2^31 = 286215^130 * 2^31 -.word 38999497 // zeta^ 66 * 2^31 = 286215^ 66 * 2^31 -.word 39314409 // zeta^194 * 2^31 = 286215^194 * 2^31 -.word 3505411919 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 286215^ 2 * 71292929 * 2^31 -.word 4219008781 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 286215^130 * 71292929 * 2^31 -.word 1658081847 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 286215^ 66 * 71292929 * 2^31 -.word 453280791 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 286215^194 * 71292929 * 2^31 -.word 17742663 // zeta^258 * 2^31 = 286215^258 * 2^31 -.word 45275813 // zeta^386 * 2^31 = 286215^386 * 2^31 -.word 64102957 // zeta^322 * 2^31 = 286215^322 * 2^31 -.word 25400553 // zeta^450 * 2^31 = 286215^450 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 -.word 3372902219 // zeta^ 68 * (q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 919922753 // zeta^324 * (q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 1492590085 // zeta^196 * (q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 3871802623 // zeta^452 * (q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 8546383 // zeta^ 34 * 2^31 = 286215^ 34 * 2^31 -.word 46173583 // zeta^162 * 2^31 = 286215^162 * 2^31 -.word 66816363 // zeta^ 98 * 2^31 = 286215^ 98 * 2^31 -.word 45664163 // zeta^226 * 2^31 = 286215^226 * 2^31 -.word 269086641 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 286215^ 34 * 71292929 * 2^31 -.word 1939720817 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 286215^162 * 71292929 * 2^31 -.word 1119694485 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 286215^ 98 * 71292929 * 2^31 -.word 1740157021 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 286215^226 * 71292929 * 2^31 -.word 47863765 // zeta^290 * 2^31 = 286215^290 * 2^31 -.word 7553119 // zeta^418 * 2^31 = 286215^418 * 2^31 -.word 9938685 // zeta^354 * 2^31 = 286215^354 * 2^31 -.word 29035899 // zeta^482 * 2^31 = 286215^482 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 -.word 2035379173 // zeta^ 36 * (q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 3263167647 // zeta^292 * (q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 534733307 // zeta^164 * (q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 3582071787 // zeta^420 * (q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 4853311 // zeta^ 18 * 2^31 = 286215^ 18 * 2^31 -.word 19847287 // zeta^146 * 2^31 = 286215^146 * 2^31 -.word 47886143 // zeta^ 82 * 2^31 = 286215^ 82 * 2^31 -.word 18266849 // zeta^210 * 2^31 = 286215^210 * 2^31 -.word 103795137 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 286215^ 18 * 71292929 * 2^31 -.word 1457766281 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 286215^146 * 71292929 * 2^31 -.word 1556555969 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 286215^ 82 * 71292929 * 2^31 -.word 1339845919 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 286215^210 * 71292929 * 2^31 -.word 44071181 // zeta^274 * 2^31 = 286215^274 * 2^31 -.word 56515763 // zeta^402 * 2^31 = 286215^402 * 2^31 -.word 16559399 // zeta^338 * 2^31 = 286215^338 * 2^31 -.word 20020769 // zeta^466 * 2^31 = 286215^466 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 -.word 4044509849 // zeta^100 * (q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 619773465 // zeta^356 * (q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 1177237627 // zeta^228 * (q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 3923278881 // zeta^484 * (q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 66706479 // zeta^ 50 * 2^31 = 286215^ 50 * 2^31 -.word 65464889 // zeta^178 * 2^31 = 286215^178 * 2^31 -.word 31632431 // zeta^114 * 2^31 = 286215^114 * 2^31 -.word 16224217 // zeta^242 * 2^31 = 286215^242 * 2^31 -.word 1051556817 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 286215^ 50 * 71292929 * 2^31 -.word 2658270663 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 286215^178 * 71292929 * 2^31 -.word 2705632209 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 286215^114 * 71292929 * 2^31 -.word 1396856871 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 286215^242 * 71292929 * 2^31 -.word 20007053 // zeta^306 * 2^31 = 286215^306 * 2^31 -.word 9706713 // zeta^434 * 2^31 = 286215^434 * 2^31 -.word 28622733 // zeta^370 * 2^31 = 286215^370 * 2^31 -.word 47899901 // zeta^498 * 2^31 = 286215^498 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 -.word 1443651165 // zeta^ 20 * (q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 2303493823 // zeta^276 * (q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 4161706847 // zeta^148 * (q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 4199769341 // zeta^404 * (q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27282801 // zeta^ 10 * 2^31 = 286215^ 10 * 2^31 -.word 61894293 // zeta^138 * 2^31 = 286215^138 * 2^31 -.word 56460987 // zeta^ 74 * 2^31 = 286215^ 74 * 2^31 -.word 37053313 // zeta^202 * 2^31 = 286215^202 * 2^31 -.word 3929627279 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 286215^ 10 * 71292929 * 2^31 -.word 2488719723 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 286215^138 * 71292929 * 2^31 -.word 4277121349 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 286215^ 74 * 71292929 * 2^31 -.word 1897317503 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 286215^202 * 71292929 * 2^31 -.word 4482895 // zeta^266 * 2^31 = 286215^266 * 2^31 -.word 15492289 // zeta^394 * 2^31 = 286215^394 * 2^31 -.word 50954585 // zeta^330 * 2^31 = 286215^330 * 2^31 -.word 51397001 // zeta^458 * 2^31 = 286215^458 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 -.word 1165970155 // zeta^ 84 * (q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 254220777 // zeta^340 * (q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 473804703 // zeta^212 * (q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 4268832423 // zeta^468 * (q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 21989155 // zeta^ 42 * 2^31 = 286215^ 42 * 2^31 -.word 59599627 // zeta^170 * 2^31 = 286215^170 * 2^31 -.word 49109585 // zeta^106 * 2^31 = 286215^106 * 2^31 -.word 31901721 // zeta^234 * 2^31 = 286215^234 * 2^31 -.word 386789597 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 286215^ 42 * 71292929 * 2^31 -.word 644631797 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 286215^170 * 71292929 * 2^31 -.word 988761519 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 286215^106 * 71292929 * 2^31 -.word 2736594919 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 286215^234 * 71292929 * 2^31 -.word 27458873 // zeta^298 * 2^31 = 286215^298 * 2^31 -.word 19221631 // zeta^426 * 2^31 = 286215^426 * 2^31 -.word 49552591 // zeta^362 * 2^31 = 286215^362 * 2^31 -.word 64086513 // zeta^490 * 2^31 = 286215^490 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 -.word 1462589385 // zeta^ 52 * (q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 1524915067 // zeta^308 * (q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 3756565603 // zeta^180 * (q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 894237409 // zeta^436 * (q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 65803619 // zeta^ 26 * 2^31 = 286215^ 26 * 2^31 -.word 41181789 // zeta^154 * 2^31 = 286215^154 * 2^31 -.word 28235729 // zeta^ 90 * 2^31 = 286215^ 90 * 2^31 -.word 57735669 // zeta^218 * 2^31 = 286215^218 * 2^31 -.word 4205535901 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 286215^ 26 * 71292929 * 2^31 -.word 564798883 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 286215^154 * 71292929 * 2^31 -.word 399101999 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 286215^ 90 * 71292929 * 2^31 -.word 1381846539 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 286215^218 * 71292929 * 2^31 -.word 59515337 // zeta^282 * 2^31 = 286215^282 * 2^31 -.word 23737507 // zeta^410 * 2^31 = 286215^410 * 2^31 -.word 38742465 // zeta^346 * 2^31 = 286215^346 * 2^31 -.word 7373007 // zeta^474 * 2^31 = 286215^474 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 -.word 3129580769 // zeta^116 * (q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 443542963 // zeta^372 * (q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 2947478141 // zeta^244 * (q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 677336697 // zeta^500 * (q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 36649543 // zeta^ 58 * 2^31 = 286215^ 58 * 2^31 -.word 16801927 // zeta^186 * 2^31 = 286215^186 * 2^31 -.word 39975475 // zeta^122 * 2^31 = 286215^122 * 2^31 -.word 34708039 // zeta^250 * 2^31 = 286215^250 * 2^31 -.word 2972442041 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 286215^ 58 * 71292929 * 2^31 -.word 3495212921 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 286215^186 * 71292929 * 2^31 -.word 4092984781 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 286215^122 * 71292929 * 2^31 -.word 273251769 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 286215^250 * 71292929 * 2^31 -.word 57911317 // zeta^314 * 2^31 = 286215^314 * 2^31 -.word 55856649 // zeta^442 * 2^31 = 286215^442 * 2^31 -.word 32224601 // zeta^378 * 2^31 = 286215^378 * 2^31 -.word 66666709 // zeta^506 * 2^31 = 286215^506 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 -.word 3127823481 // zeta^ 12 * (q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 2542460889 // zeta^268 * (q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 919656467 // zeta^140 * (q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 2744124781 // zeta^396 * (q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 23814037 // zeta^ 6 * 2^31 = 286215^ 6 * 2^31 -.word 18856687 // zeta^134 * 2^31 = 286215^134 * 2^31 -.word 54338297 // zeta^ 70 * 2^31 = 286215^ 70 * 2^31 -.word 56618763 // zeta^198 * 2^31 = 286215^198 * 2^31 -.word 2353260651 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 286215^ 6 * 71292929 * 2^31 -.word 2085985553 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 286215^134 * 71292929 * 2^31 -.word 3891905799 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 286215^ 70 * 71292929 * 2^31 -.word 188336373 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 286215^198 * 71292929 * 2^31 -.word 26618141 // zeta^262 * 2^31 = 286215^262 * 2^31 -.word 56282849 // zeta^390 * 2^31 = 286215^390 * 2^31 -.word 53722505 // zeta^326 * 2^31 = 286215^326 * 2^31 -.word 23316989 // zeta^454 * 2^31 = 286215^454 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 -.word 358649449 // zeta^ 76 * (q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 3759841019 // zeta^332 * (q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 4177420707 // zeta^204 * (q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 426026005 // zeta^460 * (q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 42286697 // zeta^ 38 * 2^31 = 286215^ 38 * 2^31 -.word 17252607 // zeta^166 * 2^31 = 286215^166 * 2^31 -.word 14807241 // zeta^102 * 2^31 = 286215^102 * 2^31 -.word 39617057 // zeta^230 * 2^31 = 286215^230 * 2^31 -.word 2432379287 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 286215^ 38 * 71292929 * 2^31 -.word 3848312577 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 286215^166 * 71292929 * 2^31 -.word 4135417655 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 286215^102 * 71292929 * 2^31 -.word 1706599903 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 286215^230 * 71292929 * 2^31 -.word 28566647 // zeta^294 * 2^31 = 286215^294 * 2^31 -.word 57621535 // zeta^422 * 2^31 = 286215^422 * 2^31 -.word 57635731 // zeta^358 * 2^31 = 286215^358 * 2^31 -.word 7226843 // zeta^486 * 2^31 = 286215^486 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 -.word 4277886557 // zeta^ 44 * (q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 31155291 // zeta^300 * (q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 2992995895 // zeta^172 * (q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 1071802543 // zeta^428 * (q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 26740275 // zeta^ 22 * 2^31 = 286215^ 22 * 2^31 -.word 42796923 // zeta^150 * 2^31 = 286215^150 * 2^31 -.word 27010987 // zeta^ 86 * 2^31 = 286215^ 86 * 2^31 -.word 39320695 // zeta^214 * 2^31 = 286215^214 * 2^31 -.word 1721758157 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 286215^ 22 * 71292929 * 2^31 -.word 1004417157 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 286215^150 * 71292929 * 2^31 -.word 3453390933 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 286215^ 86 * 71292929 * 2^31 -.word 3277495177 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 286215^214 * 71292929 * 2^31 -.word 62169111 // zeta^278 * 2^31 = 286215^278 * 2^31 -.word 56278767 // zeta^406 * 2^31 = 286215^406 * 2^31 -.word 51999501 // zeta^342 * 2^31 = 286215^342 * 2^31 -.word 41776143 // zeta^470 * 2^31 = 286215^470 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 -.word 3049793025 // zeta^108 * (q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 4209765139 // zeta^364 * (q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 3171783825 // zeta^236 * (q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 343269183 // zeta^492 * (q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 67046997 // zeta^ 54 * 2^31 = 286215^ 54 * 2^31 -.word 59339603 // zeta^182 * 2^31 = 286215^182 * 2^31 -.word 47267443 // zeta^118 * 2^31 = 286215^118 * 2^31 -.word 30867557 // zeta^246 * 2^31 = 286215^246 * 2^31 -.word 3976083883 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 286215^ 54 * 71292929 * 2^31 -.word 1438352557 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 286215^182 * 71292929 * 2^31 -.word 1177598349 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 286215^118 * 71292929 * 2^31 -.word 3908618139 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 286215^246 * 71292929 * 2^31 -.word 26032801 // zeta^310 * 2^31 = 286215^310 * 2^31 -.word 61420673 // zeta^438 * 2^31 = 286215^438 * 2^31 -.word 14848525 // zeta^374 * 2^31 = 286215^374 * 2^31 -.word 51582797 // zeta^502 * 2^31 = 286215^502 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 -.word 3685725783 // zeta^ 28 * (q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 1741647511 // zeta^284 * (q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 2610298873 // zeta^156 * (q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 984396643 // zeta^412 * (q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 62478247 // zeta^ 14 * 2^31 = 286215^ 14 * 2^31 -.word 13974447 // zeta^142 * 2^31 = 286215^142 * 2^31 -.word 14999777 // zeta^ 78 * 2^31 = 286215^ 78 * 2^31 -.word 59134963 // zeta^206 * 2^31 = 286215^206 * 2^31 -.word 1815658585 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 286215^ 14 * 71292929 * 2^31 -.word 2831031377 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 286215^142 * 71292929 * 2^31 -.word 100550431 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 286215^ 78 * 71292929 * 2^31 -.word 819438605 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 286215^206 * 71292929 * 2^31 -.word 53918527 // zeta^270 * 2^31 = 286215^270 * 2^31 -.word 10108243 // zeta^398 * 2^31 = 286215^398 * 2^31 -.word 10961253 // zeta^334 * 2^31 = 286215^334 * 2^31 -.word 23786629 // zeta^462 * 2^31 = 286215^462 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 -.word 1004528867 // zeta^ 92 * (q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 1099058609 // zeta^348 * (q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 1310455209 // zeta^220 * (q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 2041507095 // zeta^476 * (q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 59164407 // zeta^ 46 * 2^31 = 286215^ 46 * 2^31 -.word 66143065 // zeta^174 * 2^31 = 286215^174 * 2^31 -.word 43155485 // zeta^110 * 2^31 = 286215^110 * 2^31 -.word 17669861 // zeta^238 * 2^31 = 286215^238 * 2^31 -.word 1909444873 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 286215^ 46 * 71292929 * 2^31 -.word 1951704231 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 286215^174 * 71292929 * 2^31 -.word 1714554851 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 286215^110 * 71292929 * 2^31 -.word 3532007707 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 286215^238 * 71292929 * 2^31 -.word 24091995 // zeta^302 * 2^31 = 286215^302 * 2^31 -.word 16101757 // zeta^430 * 2^31 = 286215^430 * 2^31 -.word 13774769 // zeta^366 * 2^31 = 286215^366 * 2^31 -.word 36746905 // zeta^494 * 2^31 = 286215^494 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 -.word 311527319 // zeta^ 60 * (q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 4054742117 // zeta^316 * (q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 712459929 // zeta^188 * (q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 3331484459 // zeta^444 * (q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 36697917 // zeta^ 30 * 2^31 = 286215^ 30 * 2^31 -.word 58452265 // zeta^158 * 2^31 = 286215^158 * 2^31 -.word 13961957 // zeta^ 94 * 2^31 = 286215^ 94 * 2^31 -.word 61179875 // zeta^222 * 2^31 = 286215^222 * 2^31 -.word 3107033283 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 286215^ 30 * 71292929 * 2^31 -.word 1790082775 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 286215^158 * 71292929 * 2^31 -.word 4221484315 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 286215^ 94 * 71292929 * 2^31 -.word 1423306781 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 286215^222 * 71292929 * 2^31 -.word 6463229 // zeta^286 * 2^31 = 286215^286 * 2^31 -.word 13236309 // zeta^414 * 2^31 = 286215^414 * 2^31 -.word 4183205 // zeta^350 * 2^31 = 286215^350 * 2^31 -.word 45952127 // zeta^478 * 2^31 = 286215^478 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 -.word 3266171913 // zeta^124 * (q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 3437859545 // zeta^380 * (q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 4149046263 // zeta^252 * (q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 1091278839 // zeta^508 * (q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.word 18340771 // zeta^ 62 * 2^31 = 286215^ 62 * 2^31 -.word 29457983 // zeta^190 * 2^31 = 286215^190 * 2^31 -.word 11263143 // zeta^126 * 2^31 = 286215^126 * 2^31 -.word 47890357 // zeta^254 * 2^31 = 286215^254 * 2^31 -.word 1148820573 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 286215^ 62 * 71292929 * 2^31 -.word 2922928577 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 286215^190 * 71292929 * 2^31 -.word 336477017 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 286215^126 * 71292929 * 2^31 -.word 1775863883 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 286215^254 * 71292929 * 2^31 -.word 64730193 // zeta^318 * 2^31 = 286215^318 * 2^31 -.word 58025703 // zeta^446 * 2^31 = 286215^446 * 2^31 -.word 7013271 // zeta^382 * 2^31 = 286215^382 * 2^31 -.word 34564147 // zeta^510 * 2^31 = 286215^510 * 2^31 -.word 57364657 // zeta^ 2 * 2^31 = 286215^ 2 * 2^31 -.word 17742663 // zeta^258 * 2^31 = 286215^258 * 2^31 -.word 65863923 // zeta^130 * 2^31 = 286215^130 * 2^31 -.word 45275813 // zeta^386 * 2^31 = 286215^386 * 2^31 -.word 3505411919 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 286215^ 2 * 71292929 * 2^31 -.word 1584684217 // zeta^258 * (q^(-1) mod 2^32) * 2^31 = 286215^258 * 71292929 * 2^31 -.word 4219008781 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 286215^130 * 71292929 * 2^31 -.word 2989944155 // zeta^386 * (q^(-1) mod 2^32) * 2^31 = 286215^386 * 71292929 * 2^31 -.word 777237 // zeta^ 1 * 2^31 = 286215^ 1 * 2^31 -.word 55561277 // zeta^129 * 2^31 = 286215^129 * 2^31 -.word 27993389 // zeta^ 65 * 2^31 = 286215^ 65 * 2^31 -.word 38299855 // zeta^193 * 2^31 = 286215^193 * 2^31 -.word 2165795819 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 286215^ 1 * 71292929 * 2^31 -.word 1901706179 // zeta^129 * (q^(-1) mod 2^32) * 2^31 = 286215^129 * 71292929 * 2^31 -.word 3169051347 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 286215^ 65 * 71292929 * 2^31 -.word 3730304817 // zeta^193 * (q^(-1) mod 2^32) * 2^31 = 286215^193 * 71292929 * 2^31 -.word 52753645 // zeta^257 * 2^31 = 286215^257 * 2^31 -.word 40360559 // zeta^385 * 2^31 = 286215^385 * 2^31 -.word 66907459 // zeta^321 * 2^31 = 286215^321 * 2^31 -.word 52619983 // zeta^449 * 2^31 = 286215^449 * 2^31 -.word 38999497 // zeta^ 66 * 2^31 = 286215^ 66 * 2^31 -.word 64102957 // zeta^322 * 2^31 = 286215^322 * 2^31 -.word 39314409 // zeta^194 * 2^31 = 286215^194 * 2^31 -.word 25400553 // zeta^450 * 2^31 = 286215^450 * 2^31 -.word 1658081847 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 286215^ 66 * 71292929 * 2^31 -.word 2453988819 // zeta^322 * (q^(-1) mod 2^32) * 2^31 = 286215^322 * 71292929 * 2^31 -.word 453280791 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 286215^194 * 71292929 * 2^31 -.word 2944455447 // zeta^450 * (q^(-1) mod 2^32) * 2^31 = 286215^450 * 71292929 * 2^31 -.word 40238469 // zeta^ 33 * 2^31 = 286215^ 33 * 2^31 -.word 52734981 // zeta^161 * 2^31 = 286215^161 * 2^31 -.word 33316175 // zeta^ 97 * 2^31 = 286215^ 97 * 2^31 -.word 14516413 // zeta^225 * 2^31 = 286215^225 * 2^31 -.word 2012662395 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 286215^ 33 * 71292929 * 2^31 -.word 2726042619 // zeta^161 * (q^(-1) mod 2^32) * 2^31 = 286215^161 * 71292929 * 2^31 -.word 3705141937 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 286215^ 97 * 71292929 * 2^31 -.word 2013267779 // zeta^225 * (q^(-1) mod 2^32) * 2^31 = 286215^225 * 71292929 * 2^31 -.word 57865117 // zeta^289 * 2^31 = 286215^289 * 2^31 -.word 51220645 // zeta^417 * 2^31 = 286215^417 * 2^31 -.word 15187781 // zeta^353 * 2^31 = 286215^353 * 2^31 -.word 50834899 // zeta^481 * 2^31 = 286215^481 * 2^31 -.word 8546383 // zeta^ 34 * 2^31 = 286215^ 34 * 2^31 -.word 47863765 // zeta^290 * 2^31 = 286215^290 * 2^31 -.word 46173583 // zeta^162 * 2^31 = 286215^162 * 2^31 -.word 7553119 // zeta^418 * 2^31 = 286215^418 * 2^31 -.word 269086641 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 286215^ 34 * 71292929 * 2^31 -.word 3516854315 // zeta^290 * (q^(-1) mod 2^32) * 2^31 = 286215^290 * 71292929 * 2^31 -.word 1939720817 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 286215^162 * 71292929 * 2^31 -.word 1843107745 // zeta^418 * (q^(-1) mod 2^32) * 2^31 = 286215^418 * 71292929 * 2^31 -.word 55488825 // zeta^ 17 * 2^31 = 286215^ 17 * 2^31 -.word 40612837 // zeta^145 * 2^31 = 286215^145 * 2^31 -.word 25879965 // zeta^ 81 * 2^31 = 286215^ 81 * 2^31 -.word 22638261 // zeta^209 * 2^31 = 286215^209 * 2^31 -.word 371340999 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 286215^ 17 * 71292929 * 2^31 -.word 1148195867 // zeta^145 * (q^(-1) mod 2^32) * 2^31 = 286215^145 * 71292929 * 2^31 -.word 3608519267 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 286215^ 81 * 71292929 * 2^31 -.word 1991432523 // zeta^209 * (q^(-1) mod 2^32) * 2^31 = 286215^209 * 71292929 * 2^31 -.word 39399403 // zeta^273 * 2^31 = 286215^273 * 2^31 -.word 36951535 // zeta^401 * 2^31 = 286215^401 * 2^31 -.word 13079509 // zeta^337 * 2^31 = 286215^337 * 2^31 -.word 16731335 // zeta^465 * 2^31 = 286215^465 * 2^31 -.word 66816363 // zeta^ 98 * 2^31 = 286215^ 98 * 2^31 -.word 9938685 // zeta^354 * 2^31 = 286215^354 * 2^31 -.word 45664163 // zeta^226 * 2^31 = 286215^226 * 2^31 -.word 29035899 // zeta^482 * 2^31 = 286215^482 * 2^31 -.word 1119694485 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 286215^ 98 * 71292929 * 2^31 -.word 4265599235 // zeta^354 * (q^(-1) mod 2^32) * 2^31 = 286215^354 * 71292929 * 2^31 -.word 1740157021 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 286215^226 * 71292929 * 2^31 -.word 3986696837 // zeta^482 * (q^(-1) mod 2^32) * 2^31 = 286215^482 * 71292929 * 2^31 -.word 30605811 // zeta^ 49 * 2^31 = 286215^ 49 * 2^31 -.word 33632567 // zeta^177 * 2^31 = 286215^177 * 2^31 -.word 53046167 // zeta^113 * 2^31 = 286215^113 * 2^31 -.word 61827119 // zeta^241 * 2^31 = 286215^241 * 2^31 -.word 2914711053 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 286215^ 49 * 71292929 * 2^31 -.word 66021065 // zeta^177 * (q^(-1) mod 2^32) * 2^31 = 286215^177 * 71292929 * 2^31 -.word 870722665 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 286215^113 * 71292929 * 2^31 -.word 2631397329 // zeta^241 * (q^(-1) mod 2^32) * 2^31 = 286215^241 * 71292929 * 2^31 -.word 338279 // zeta^305 * 2^31 = 286215^305 * 2^31 -.word 13681319 // zeta^433 * 2^31 = 286215^433 * 2^31 -.word 64110191 // zeta^369 * 2^31 = 286215^369 * 2^31 -.word 33513593 // zeta^497 * 2^31 = 286215^497 * 2^31 -.word 4853311 // zeta^ 18 * 2^31 = 286215^ 18 * 2^31 -.word 44071181 // zeta^274 * 2^31 = 286215^274 * 2^31 -.word 19847287 // zeta^146 * 2^31 = 286215^146 * 2^31 -.word 56515763 // zeta^402 * 2^31 = 286215^402 * 2^31 -.word 103795137 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 286215^ 18 * 71292929 * 2^31 -.word 2567540467 // zeta^274 * (q^(-1) mod 2^32) * 2^31 = 286215^274 * 71292929 * 2^31 -.word 1457766281 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 286215^146 * 71292929 * 2^31 -.word 3260914509 // zeta^402 * (q^(-1) mod 2^32) * 2^31 = 286215^402 * 71292929 * 2^31 -.word 9976091 // zeta^ 9 * 2^31 = 286215^ 9 * 2^31 -.word 46674479 // zeta^137 * 2^31 = 286215^137 * 2^31 -.word 52121925 // zeta^ 73 * 2^31 = 286215^ 73 * 2^31 -.word 17479817 // zeta^201 * 2^31 = 286215^201 * 2^31 -.word 362020581 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 286215^ 9 * 71292929 * 2^31 -.word 4249822673 // zeta^137 * (q^(-1) mod 2^32) * 2^31 = 286215^137 * 71292929 * 2^31 -.word 3696719547 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 286215^ 73 * 71292929 * 2^31 -.word 1703587703 // zeta^201 * (q^(-1) mod 2^32) * 2^31 = 286215^201 * 71292929 * 2^31 -.word 33171245 // zeta^265 * 2^31 = 286215^265 * 2^31 -.word 39814911 // zeta^393 * 2^31 = 286215^393 * 2^31 -.word 3213399 // zeta^329 * 2^31 = 286215^329 * 2^31 -.word 7982571 // zeta^457 * 2^31 = 286215^457 * 2^31 -.word 47886143 // zeta^ 82 * 2^31 = 286215^ 82 * 2^31 -.word 16559399 // zeta^338 * 2^31 = 286215^338 * 2^31 -.word 18266849 // zeta^210 * 2^31 = 286215^210 * 2^31 -.word 20020769 // zeta^466 * 2^31 = 286215^466 * 2^31 -.word 1556555969 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 286215^ 82 * 71292929 * 2^31 -.word 2488363737 // zeta^338 * (q^(-1) mod 2^32) * 2^31 = 286215^338 * 71292929 * 2^31 -.word 1339845919 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 286215^210 * 71292929 * 2^31 -.word 2923669983 // zeta^466 * (q^(-1) mod 2^32) * 2^31 = 286215^466 * 71292929 * 2^31 -.word 35162565 // zeta^ 41 * 2^31 = 286215^ 41 * 2^31 -.word 16918403 // zeta^169 * 2^31 = 286215^169 * 2^31 -.word 16550605 // zeta^105 * 2^31 = 286215^105 * 2^31 -.word 33023345 // zeta^233 * 2^31 = 286215^233 * 2^31 -.word 1311653435 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 286215^ 41 * 71292929 * 2^31 -.word 1751797885 // zeta^169 * (q^(-1) mod 2^32) * 2^31 = 286215^169 * 71292929 * 2^31 -.word 2373156147 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 286215^105 * 71292929 * 2^31 -.word 2577515151 // zeta^233 * (q^(-1) mod 2^32) * 2^31 = 286215^233 * 71292929 * 2^31 -.word 5341079 // zeta^297 * 2^31 = 286215^297 * 2^31 -.word 36198995 // zeta^425 * 2^31 = 286215^425 * 2^31 -.word 48512539 // zeta^361 * 2^31 = 286215^361 * 2^31 -.word 31279171 // zeta^489 * 2^31 = 286215^489 * 2^31 -.word 66706479 // zeta^ 50 * 2^31 = 286215^ 50 * 2^31 -.word 20007053 // zeta^306 * 2^31 = 286215^306 * 2^31 -.word 65464889 // zeta^178 * 2^31 = 286215^178 * 2^31 -.word 9706713 // zeta^434 * 2^31 = 286215^434 * 2^31 -.word 1051556817 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 286215^ 50 * 71292929 * 2^31 -.word 1524940659 // zeta^306 * (q^(-1) mod 2^32) * 2^31 = 286215^306 * 71292929 * 2^31 -.word 2658270663 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 286215^178 * 71292929 * 2^31 -.word 2309868327 // zeta^434 * (q^(-1) mod 2^32) * 2^31 = 286215^434 * 71292929 * 2^31 -.word 8917739 // zeta^ 25 * 2^31 = 286215^ 25 * 2^31 -.word 8543559 // zeta^153 * 2^31 = 286215^153 * 2^31 -.word 26606741 // zeta^ 89 * 2^31 = 286215^ 89 * 2^31 -.word 29856419 // zeta^217 * 2^31 = 286215^217 * 2^31 -.word 3685524757 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 286215^ 25 * 71292929 * 2^31 -.word 4031822521 // zeta^153 * (q^(-1) mod 2^32) * 2^31 = 286215^153 * 71292929 * 2^31 -.word 4104211307 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 286215^ 89 * 71292929 * 2^31 -.word 4167165277 // zeta^217 * (q^(-1) mod 2^32) * 2^31 = 286215^217 * 71292929 * 2^31 -.word 40333871 // zeta^281 * 2^31 = 286215^281 * 2^31 -.word 58896547 // zeta^409 * 2^31 = 286215^409 * 2^31 -.word 61609543 // zeta^345 * 2^31 = 286215^345 * 2^31 -.word 28944345 // zeta^473 * 2^31 = 286215^473 * 2^31 -.word 31632431 // zeta^114 * 2^31 = 286215^114 * 2^31 -.word 28622733 // zeta^370 * 2^31 = 286215^370 * 2^31 -.word 16224217 // zeta^242 * 2^31 = 286215^242 * 2^31 -.word 47899901 // zeta^498 * 2^31 = 286215^498 * 2^31 -.word 2705632209 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 286215^114 * 71292929 * 2^31 -.word 620316787 // zeta^370 * (q^(-1) mod 2^32) * 2^31 = 286215^370 * 71292929 * 2^31 -.word 1396856871 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 286215^242 * 71292929 * 2^31 -.word 4255949571 // zeta^498 * (q^(-1) mod 2^32) * 2^31 = 286215^498 * 71292929 * 2^31 -.word 56256591 // zeta^ 57 * 2^31 = 286215^ 57 * 2^31 -.word 5941703 // zeta^185 * 2^31 = 286215^185 * 2^31 -.word 6746731 // zeta^121 * 2^31 = 286215^121 * 2^31 -.word 19093221 // zeta^249 * 2^31 = 286215^249 * 2^31 -.word 3442601905 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 286215^ 57 * 71292929 * 2^31 -.word 2624351801 // zeta^185 * (q^(-1) mod 2^32) * 2^31 = 286215^185 * 71292929 * 2^31 -.word 3468281237 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 286215^121 * 71292929 * 2^31 -.word 925921563 // zeta^249 * (q^(-1) mod 2^32) * 2^31 = 286215^249 * 71292929 * 2^31 -.word 14554217 // zeta^313 * 2^31 = 286215^313 * 2^31 -.word 22619629 // zeta^441 * 2^31 = 286215^441 * 2^31 -.word 31005115 // zeta^377 * 2^31 = 286215^377 * 2^31 -.word 32877347 // zeta^505 * 2^31 = 286215^505 * 2^31 -.word 27282801 // zeta^ 10 * 2^31 = 286215^ 10 * 2^31 -.word 4482895 // zeta^266 * 2^31 = 286215^266 * 2^31 -.word 61894293 // zeta^138 * 2^31 = 286215^138 * 2^31 -.word 15492289 // zeta^394 * 2^31 = 286215^394 * 2^31 -.word 3929627279 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 286215^ 10 * 71292929 * 2^31 -.word 2686447793 // zeta^266 * (q^(-1) mod 2^32) * 2^31 = 286215^266 * 71292929 * 2^31 -.word 2488719723 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 286215^138 * 71292929 * 2^31 -.word 3130114879 // zeta^394 * (q^(-1) mod 2^32) * 2^31 = 286215^394 * 71292929 * 2^31 -.word 55571919 // zeta^ 5 * 2^31 = 286215^ 5 * 2^31 -.word 42621683 // zeta^133 * 2^31 = 286215^133 * 2^31 -.word 24875211 // zeta^ 69 * 2^31 = 286215^ 69 * 2^31 -.word 30494603 // zeta^197 * 2^31 = 286215^197 * 2^31 -.word 3411567153 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 286215^ 5 * 71292929 * 2^31 -.word 317431053 // zeta^133 * (q^(-1) mod 2^32) * 2^31 = 286215^133 * 71292929 * 2^31 -.word 3999541045 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 286215^ 69 * 71292929 * 2^31 -.word 2749130869 // zeta^197 * (q^(-1) mod 2^32) * 2^31 = 286215^197 * 71292929 * 2^31 -.word 62253083 // zeta^261 * 2^31 = 286215^261 * 2^31 -.word 20928585 // zeta^389 * 2^31 = 286215^389 * 2^31 -.word 63436675 // zeta^325 * 2^31 = 286215^325 * 2^31 -.word 45530719 // zeta^453 * 2^31 = 286215^453 * 2^31 -.word 56460987 // zeta^ 74 * 2^31 = 286215^ 74 * 2^31 -.word 50954585 // zeta^330 * 2^31 = 286215^330 * 2^31 -.word 37053313 // zeta^202 * 2^31 = 286215^202 * 2^31 -.word 51397001 // zeta^458 * 2^31 = 286215^458 * 2^31 -.word 4277121349 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 286215^ 74 * 71292929 * 2^31 -.word 3203163815 // zeta^330 * (q^(-1) mod 2^32) * 2^31 = 286215^330 * 71292929 * 2^31 -.word 1897317503 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 286215^202 * 71292929 * 2^31 -.word 15541879 // zeta^458 * (q^(-1) mod 2^32) * 2^31 = 286215^458 * 71292929 * 2^31 -.word 4019723 // zeta^ 37 * 2^31 = 286215^ 37 * 2^31 -.word 19765591 // zeta^165 * 2^31 = 286215^165 * 2^31 -.word 38300473 // zeta^101 * 2^31 = 286215^101 * 2^31 -.word 55444149 // zeta^229 * 2^31 = 286215^229 * 2^31 -.word 3866386933 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 286215^ 37 * 71292929 * 2^31 -.word 1829240489 // zeta^165 * (q^(-1) mod 2^32) * 2^31 = 286215^165 * 71292929 * 2^31 -.word 2620947655 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 286215^101 * 71292929 * 2^31 -.word 2883470667 // zeta^229 * (q^(-1) mod 2^32) * 2^31 = 286215^229 * 71292929 * 2^31 -.word 20041217 // zeta^293 * 2^31 = 286215^293 * 2^31 -.word 44031883 // zeta^421 * 2^31 = 286215^421 * 2^31 -.word 37036443 // zeta^357 * 2^31 = 286215^357 * 2^31 -.word 3898577 // zeta^485 * 2^31 = 286215^485 * 2^31 -.word 21989155 // zeta^ 42 * 2^31 = 286215^ 42 * 2^31 -.word 27458873 // zeta^298 * 2^31 = 286215^298 * 2^31 -.word 59599627 // zeta^170 * 2^31 = 286215^170 * 2^31 -.word 19221631 // zeta^426 * 2^31 = 286215^426 * 2^31 -.word 386789597 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 286215^ 42 * 71292929 * 2^31 -.word 1135471303 // zeta^298 * (q^(-1) mod 2^32) * 2^31 = 286215^298 * 71292929 * 2^31 -.word 644631797 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 286215^170 * 71292929 * 2^31 -.word 3071183745 // zeta^426 * (q^(-1) mod 2^32) * 2^31 = 286215^426 * 71292929 * 2^31 -.word 16292531 // zeta^ 21 * 2^31 = 286215^ 21 * 2^31 -.word 5202753 // zeta^149 * 2^31 = 286215^149 * 2^31 -.word 54527047 // zeta^ 85 * 2^31 = 286215^ 85 * 2^31 -.word 25139487 // zeta^213 * 2^31 = 286215^213 * 2^31 -.word 1584618829 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 286215^ 21 * 71292929 * 2^31 -.word 2465383615 // zeta^149 * (q^(-1) mod 2^32) * 2^31 = 286215^149 * 71292929 * 2^31 -.word 3484095417 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 286215^ 85 * 71292929 * 2^31 -.word 715996897 // zeta^213 * (q^(-1) mod 2^32) * 2^31 = 286215^213 * 71292929 * 2^31 -.word 21336239 // zeta^277 * 2^31 = 286215^277 * 2^31 -.word 2776701 // zeta^405 * 2^31 = 286215^405 * 2^31 -.word 30824587 // zeta^341 * 2^31 = 286215^341 * 2^31 -.word 917673 // zeta^469 * 2^31 = 286215^469 * 2^31 -.word 49109585 // zeta^106 * 2^31 = 286215^106 * 2^31 -.word 49552591 // zeta^362 * 2^31 = 286215^362 * 2^31 -.word 31901721 // zeta^234 * 2^31 = 286215^234 * 2^31 -.word 64086513 // zeta^490 * 2^31 = 286215^490 * 2^31 -.word 988761519 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 286215^106 * 71292929 * 2^31 -.word 2982951729 // zeta^362 * (q^(-1) mod 2^32) * 2^31 = 286215^362 * 71292929 * 2^31 -.word 2736594919 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 286215^234 * 71292929 * 2^31 -.word 2268841487 // zeta^490 * (q^(-1) mod 2^32) * 2^31 = 286215^490 * 71292929 * 2^31 -.word 26175789 // zeta^ 53 * 2^31 = 286215^ 53 * 2^31 -.word 57588867 // zeta^181 * 2^31 = 286215^181 * 2^31 -.word 53976883 // zeta^117 * 2^31 = 286215^117 * 2^31 -.word 45396353 // zeta^245 * 2^31 = 286215^245 * 2^31 -.word 1738514131 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 286215^ 53 * 71292929 * 2^31 -.word 491109245 // zeta^181 * (q^(-1) mod 2^32) * 2^31 = 286215^181 * 71292929 * 2^31 -.word 350771405 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 286215^117 * 71292929 * 2^31 -.word 3569841791 // zeta^245 * (q^(-1) mod 2^32) * 2^31 = 286215^245 * 71292929 * 2^31 -.word 8824487 // zeta^309 * 2^31 = 286215^309 * 2^31 -.word 54179811 // zeta^437 * 2^31 = 286215^437 * 2^31 -.word 4772975 // zeta^373 * 2^31 = 286215^373 * 2^31 -.word 30768061 // zeta^501 * 2^31 = 286215^501 * 2^31 -.word 65803619 // zeta^ 26 * 2^31 = 286215^ 26 * 2^31 -.word 59515337 // zeta^282 * 2^31 = 286215^282 * 2^31 -.word 41181789 // zeta^154 * 2^31 = 286215^154 * 2^31 -.word 23737507 // zeta^410 * 2^31 = 286215^410 * 2^31 -.word 4205535901 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 286215^ 26 * 71292929 * 2^31 -.word 1266370103 // zeta^282 * (q^(-1) mod 2^32) * 2^31 = 286215^282 * 71292929 * 2^31 -.word 564798883 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 286215^154 * 71292929 * 2^31 -.word 3792651101 // zeta^410 * (q^(-1) mod 2^32) * 2^31 = 286215^410 * 71292929 * 2^31 -.word 53517469 // zeta^ 13 * 2^31 = 286215^ 13 * 2^31 -.word 50784889 // zeta^141 * 2^31 = 286215^141 * 2^31 -.word 23566331 // zeta^ 77 * 2^31 = 286215^ 77 * 2^31 -.word 48988223 // zeta^205 * 2^31 = 286215^205 * 2^31 -.word 4194823011 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 286215^ 13 * 71292929 * 2^31 -.word 2405170567 // zeta^141 * (q^(-1) mod 2^32) * 2^31 = 286215^141 * 71292929 * 2^31 -.word 1134010373 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 286215^ 77 * 71292929 * 2^31 -.word 3177076673 // zeta^205 * (q^(-1) mod 2^32) * 2^31 = 286215^205 * 71292929 * 2^31 -.word 60092815 // zeta^269 * 2^31 = 286215^269 * 2^31 -.word 61674411 // zeta^397 * 2^31 = 286215^397 * 2^31 -.word 39757667 // zeta^333 * 2^31 = 286215^333 * 2^31 -.word 8190513 // zeta^461 * 2^31 = 286215^461 * 2^31 -.word 28235729 // zeta^ 90 * 2^31 = 286215^ 90 * 2^31 -.word 38742465 // zeta^346 * 2^31 = 286215^346 * 2^31 -.word 57735669 // zeta^218 * 2^31 = 286215^218 * 2^31 -.word 7373007 // zeta^474 * 2^31 = 286215^474 * 2^31 -.word 399101999 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 286215^ 90 * 71292929 * 2^31 -.word 3891723839 // zeta^346 * (q^(-1) mod 2^32) * 2^31 = 286215^346 * 71292929 * 2^31 -.word 1381846539 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 286215^218 * 71292929 * 2^31 -.word 602920753 // zeta^474 * (q^(-1) mod 2^32) * 2^31 = 286215^474 * 71292929 * 2^31 -.word 32965743 // zeta^ 45 * 2^31 = 286215^ 45 * 2^31 -.word 52246735 // zeta^173 * 2^31 = 286215^173 * 2^31 -.word 3192263 // zeta^109 * 2^31 = 286215^109 * 2^31 -.word 4211023 // zeta^237 * 2^31 = 286215^237 * 2^31 -.word 3204076433 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 286215^ 45 * 71292929 * 2^31 -.word 503521073 // zeta^173 * (q^(-1) mod 2^32) * 2^31 = 286215^173 * 71292929 * 2^31 -.word 242639417 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 286215^109 * 71292929 * 2^31 -.word 2050234033 // zeta^237 * (q^(-1) mod 2^32) * 2^31 = 286215^237 * 71292929 * 2^31 -.word 33869075 // zeta^301 * 2^31 = 286215^301 * 2^31 -.word 54305367 // zeta^429 * 2^31 = 286215^429 * 2^31 -.word 62210117 // zeta^365 * 2^31 = 286215^365 * 2^31 -.word 4533819 // zeta^493 * 2^31 = 286215^493 * 2^31 -.word 36649543 // zeta^ 58 * 2^31 = 286215^ 58 * 2^31 -.word 57911317 // zeta^314 * 2^31 = 286215^314 * 2^31 -.word 16801927 // zeta^186 * 2^31 = 286215^186 * 2^31 -.word 55856649 // zeta^442 * 2^31 = 286215^442 * 2^31 -.word 2972442041 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 286215^ 58 * 71292929 * 2^31 -.word 3046088683 // zeta^314 * (q^(-1) mod 2^32) * 2^31 = 286215^314 * 71292929 * 2^31 -.word 3495212921 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 286215^186 * 71292929 * 2^31 -.word 2191333879 // zeta^442 * (q^(-1) mod 2^32) * 2^31 = 286215^442 * 71292929 * 2^31 -.word 24096471 // zeta^ 29 * 2^31 = 286215^ 29 * 2^31 -.word 11285177 // zeta^157 * 2^31 = 286215^157 * 2^31 -.word 44020707 // zeta^ 93 * 2^31 = 286215^ 93 * 2^31 -.word 32962701 // zeta^221 * 2^31 = 286215^221 * 2^31 -.word 1612835113 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 286215^ 29 * 71292929 * 2^31 -.word 1426109767 // zeta^157 * (q^(-1) mod 2^32) * 2^31 = 286215^157 * 71292929 * 2^31 -.word 1824244765 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 286215^ 93 * 71292929 * 2^31 -.word 1033834355 // zeta^221 * (q^(-1) mod 2^32) * 2^31 = 286215^221 * 71292929 * 2^31 -.word 1983307 // zeta^285 * 2^31 = 286215^285 * 2^31 -.word 29279291 // zeta^413 * 2^31 = 286215^413 * 2^31 -.word 12866971 // zeta^349 * 2^31 = 286215^349 * 2^31 -.word 66909741 // zeta^477 * 2^31 = 286215^477 * 2^31 -.word 39975475 // zeta^122 * 2^31 = 286215^122 * 2^31 -.word 32224601 // zeta^378 * 2^31 = 286215^378 * 2^31 -.word 34708039 // zeta^250 * 2^31 = 286215^250 * 2^31 -.word 66666709 // zeta^506 * 2^31 = 286215^506 * 2^31 -.word 4092984781 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 286215^122 * 71292929 * 2^31 -.word 405418663 // zeta^378 * (q^(-1) mod 2^32) * 2^31 = 286215^378 * 71292929 * 2^31 -.word 273251769 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 286215^250 * 71292929 * 2^31 -.word 1692927787 // zeta^506 * (q^(-1) mod 2^32) * 2^31 = 286215^506 * 71292929 * 2^31 -.word 61360623 // zeta^ 61 * 2^31 = 286215^ 61 * 2^31 -.word 34886301 // zeta^189 * 2^31 = 286215^189 * 2^31 -.word 64746911 // zeta^125 * 2^31 = 286215^125 * 2^31 -.word 64451751 // zeta^253 * 2^31 = 286215^253 * 2^31 -.word 270863889 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 286215^ 61 * 71292929 * 2^31 -.word 261371235 // zeta^189 * (q^(-1) mod 2^32) * 2^31 = 286215^189 * 71292929 * 2^31 -.word 1992614497 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 286215^125 * 71292929 * 2^31 -.word 3768755033 // zeta^253 * (q^(-1) mod 2^32) * 2^31 = 286215^253 * 71292929 * 2^31 -.word 16710617 // zeta^317 * 2^31 = 286215^317 * 2^31 -.word 14951531 // zeta^445 * 2^31 = 286215^445 * 2^31 -.word 23267497 // zeta^381 * 2^31 = 286215^381 * 2^31 -.word 22075887 // zeta^509 * 2^31 = 286215^509 * 2^31 -.word 23814037 // zeta^ 6 * 2^31 = 286215^ 6 * 2^31 -.word 26618141 // zeta^262 * 2^31 = 286215^262 * 2^31 -.word 18856687 // zeta^134 * 2^31 = 286215^134 * 2^31 -.word 56282849 // zeta^390 * 2^31 = 286215^390 * 2^31 -.word 2353260651 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 286215^ 6 * 71292929 * 2^31 -.word 3113639651 // zeta^262 * (q^(-1) mod 2^32) * 2^31 = 286215^262 * 71292929 * 2^31 -.word 2085985553 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 286215^134 * 71292929 * 2^31 -.word 4038613279 // zeta^390 * (q^(-1) mod 2^32) * 2^31 = 286215^390 * 71292929 * 2^31 -.word 62729229 // zeta^ 3 * 2^31 = 286215^ 3 * 2^31 -.word 46907071 // zeta^131 * 2^31 = 286215^131 * 2^31 -.word 40510321 // zeta^ 67 * 2^31 = 286215^ 67 * 2^31 -.word 18336723 // zeta^195 * 2^31 = 286215^195 * 2^31 -.word 1407507443 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 286215^ 3 * 71292929 * 2^31 -.word 658643265 // zeta^131 * (q^(-1) mod 2^32) * 2^31 = 286215^131 * 71292929 * 2^31 -.word 4074734735 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 286215^ 67 * 71292929 * 2^31 -.word 1979788333 // zeta^195 * (q^(-1) mod 2^32) * 2^31 = 286215^195 * 71292929 * 2^31 -.word 15524337 // zeta^259 * 2^31 = 286215^259 * 2^31 -.word 34995301 // zeta^387 * 2^31 = 286215^387 * 2^31 -.word 39153149 // zeta^323 * 2^31 = 286215^323 * 2^31 -.word 45363787 // zeta^451 * 2^31 = 286215^451 * 2^31 -.word 54338297 // zeta^ 70 * 2^31 = 286215^ 70 * 2^31 -.word 53722505 // zeta^326 * 2^31 = 286215^326 * 2^31 -.word 56618763 // zeta^198 * 2^31 = 286215^198 * 2^31 -.word 23316989 // zeta^454 * 2^31 = 286215^454 * 2^31 -.word 3891905799 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 286215^ 70 * 71292929 * 2^31 -.word 2351540855 // zeta^326 * (q^(-1) mod 2^32) * 2^31 = 286215^326 * 71292929 * 2^31 -.word 188336373 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 286215^198 * 71292929 * 2^31 -.word 585874947 // zeta^454 * (q^(-1) mod 2^32) * 2^31 = 286215^454 * 71292929 * 2^31 -.word 43900797 // zeta^ 35 * 2^31 = 286215^ 35 * 2^31 -.word 19099363 // zeta^163 * 2^31 = 286215^163 * 2^31 -.word 37247565 // zeta^ 99 * 2^31 = 286215^ 99 * 2^31 -.word 20393575 // zeta^227 * 2^31 = 286215^227 * 2^31 -.word 3574442115 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 286215^ 35 * 71292929 * 2^31 -.word 1131415837 // zeta^163 * (q^(-1) mod 2^32) * 2^31 = 286215^163 * 71292929 * 2^31 -.word 77835699 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 286215^ 99 * 71292929 * 2^31 -.word 1749608857 // zeta^227 * (q^(-1) mod 2^32) * 2^31 = 286215^227 * 71292929 * 2^31 -.word 40473217 // zeta^291 * 2^31 = 286215^291 * 2^31 -.word 49625347 // zeta^419 * 2^31 = 286215^419 * 2^31 -.word 61819871 // zeta^355 * 2^31 = 286215^355 * 2^31 -.word 31056177 // zeta^483 * 2^31 = 286215^483 * 2^31 -.word 42286697 // zeta^ 38 * 2^31 = 286215^ 38 * 2^31 -.word 28566647 // zeta^294 * 2^31 = 286215^294 * 2^31 -.word 17252607 // zeta^166 * 2^31 = 286215^166 * 2^31 -.word 57621535 // zeta^422 * 2^31 = 286215^422 * 2^31 -.word 2432379287 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 286215^ 38 * 71292929 * 2^31 -.word 540980105 // zeta^294 * (q^(-1) mod 2^32) * 2^31 = 286215^294 * 71292929 * 2^31 -.word 3848312577 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 286215^166 * 71292929 * 2^31 -.word 3660946401 // zeta^422 * (q^(-1) mod 2^32) * 2^31 = 286215^422 * 71292929 * 2^31 -.word 49980433 // zeta^ 19 * 2^31 = 286215^ 19 * 2^31 -.word 38860839 // zeta^147 * 2^31 = 286215^147 * 2^31 -.word 975271 // zeta^ 83 * 2^31 = 286215^ 83 * 2^31 -.word 11332017 // zeta^211 * 2^31 = 286215^211 * 2^31 -.word 3731358703 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 286215^ 19 * 71292929 * 2^31 -.word 4273283033 // zeta^147 * (q^(-1) mod 2^32) * 2^31 = 286215^147 * 71292929 * 2^31 -.word 1299396185 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 286215^ 83 * 71292929 * 2^31 -.word 3549871695 // zeta^211 * (q^(-1) mod 2^32) * 2^31 = 286215^211 * 71292929 * 2^31 -.word 27568477 // zeta^275 * 2^31 = 286215^275 * 2^31 -.word 37636193 // zeta^403 * 2^31 = 286215^403 * 2^31 -.word 15169147 // zeta^339 * 2^31 = 286215^339 * 2^31 -.word 16295429 // zeta^467 * 2^31 = 286215^467 * 2^31 -.word 14807241 // zeta^102 * 2^31 = 286215^102 * 2^31 -.word 57635731 // zeta^358 * 2^31 = 286215^358 * 2^31 -.word 39617057 // zeta^230 * 2^31 = 286215^230 * 2^31 -.word 7226843 // zeta^486 * 2^31 = 286215^486 * 2^31 -.word 4135417655 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 286215^102 * 71292929 * 2^31 -.word 903840877 // zeta^358 * (q^(-1) mod 2^32) * 2^31 = 286215^358 * 71292929 * 2^31 -.word 1706599903 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 286215^230 * 71292929 * 2^31 -.word 1471935013 // zeta^486 * (q^(-1) mod 2^32) * 2^31 = 286215^486 * 71292929 * 2^31 -.word 3332433 // zeta^ 51 * 2^31 = 286215^ 51 * 2^31 -.word 24408307 // zeta^179 * 2^31 = 286215^179 * 2^31 -.word 8472991 // zeta^115 * 2^31 = 286215^115 * 2^31 -.word 8888451 // zeta^243 * 2^31 = 286215^243 * 2^31 -.word 1501679279 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 286215^ 51 * 71292929 * 2^31 -.word 661751565 // zeta^179 * (q^(-1) mod 2^32) * 2^31 = 286215^179 * 71292929 * 2^31 -.word 1329565281 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 286215^115 * 71292929 * 2^31 -.word 63756157 // zeta^243 * (q^(-1) mod 2^32) * 2^31 = 286215^243 * 71292929 * 2^31 -.word 51201903 // zeta^307 * 2^31 = 286215^307 * 2^31 -.word 58877085 // zeta^435 * 2^31 = 286215^435 * 2^31 -.word 58657139 // zeta^371 * 2^31 = 286215^371 * 2^31 -.word 45219173 // zeta^499 * 2^31 = 286215^499 * 2^31 -.word 26740275 // zeta^ 22 * 2^31 = 286215^ 22 * 2^31 -.word 62169111 // zeta^278 * 2^31 = 286215^278 * 2^31 -.word 42796923 // zeta^150 * 2^31 = 286215^150 * 2^31 -.word 56278767 // zeta^406 * 2^31 = 286215^406 * 2^31 -.word 1721758157 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 286215^ 22 * 71292929 * 2^31 -.word 3549362153 // zeta^278 * (q^(-1) mod 2^32) * 2^31 = 286215^278 * 71292929 * 2^31 -.word 1004417157 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 286215^150 * 71292929 * 2^31 -.word 2998573329 // zeta^406 * (q^(-1) mod 2^32) * 2^31 = 286215^406 * 71292929 * 2^31 -.word 59973457 // zeta^ 11 * 2^31 = 286215^ 11 * 2^31 -.word 43437671 // zeta^139 * 2^31 = 286215^139 * 2^31 -.word 1060971 // zeta^ 75 * 2^31 = 286215^ 75 * 2^31 -.word 52769869 // zeta^203 * 2^31 = 286215^203 * 2^31 -.word 3776022703 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 286215^ 11 * 71292929 * 2^31 -.word 1474906521 // zeta^139 * (q^(-1) mod 2^32) * 2^31 = 286215^139 * 71292929 * 2^31 -.word 3233843093 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 286215^ 75 * 71292929 * 2^31 -.word 2244400051 // zeta^203 * (q^(-1) mod 2^32) * 2^31 = 286215^203 * 71292929 * 2^31 -.word 28602327 // zeta^267 * 2^31 = 286215^267 * 2^31 -.word 30804797 // zeta^395 * 2^31 = 286215^395 * 2^31 -.word 48997929 // zeta^331 * 2^31 = 286215^331 * 2^31 -.word 2017467 // zeta^459 * 2^31 = 286215^459 * 2^31 -.word 27010987 // zeta^ 86 * 2^31 = 286215^ 86 * 2^31 -.word 51999501 // zeta^342 * 2^31 = 286215^342 * 2^31 -.word 39320695 // zeta^214 * 2^31 = 286215^214 * 2^31 -.word 41776143 // zeta^470 * 2^31 = 286215^470 * 2^31 -.word 3453390933 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 286215^ 86 * 71292929 * 2^31 -.word 4288713971 // zeta^342 * (q^(-1) mod 2^32) * 2^31 = 286215^342 * 71292929 * 2^31 -.word 3277495177 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 286215^214 * 71292929 * 2^31 -.word 1474618353 // zeta^470 * (q^(-1) mod 2^32) * 2^31 = 286215^470 * 71292929 * 2^31 -.word 48422787 // zeta^ 43 * 2^31 = 286215^ 43 * 2^31 -.word 2000399 // zeta^171 * 2^31 = 286215^171 * 2^31 -.word 21758565 // zeta^107 * 2^31 = 286215^107 * 2^31 -.word 18821133 // zeta^235 * 2^31 = 286215^235 * 2^31 -.word 2202638461 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 286215^ 43 * 71292929 * 2^31 -.word 85185009 // zeta^171 * (q^(-1) mod 2^32) * 2^31 = 286215^171 * 71292929 * 2^31 -.word 2983445915 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 286215^107 * 71292929 * 2^31 -.word 2804078579 // zeta^235 * (q^(-1) mod 2^32) * 2^31 = 286215^235 * 71292929 * 2^31 -.word 40282091 // zeta^299 * 2^31 = 286215^299 * 2^31 -.word 694581 // zeta^427 * 2^31 = 286215^427 * 2^31 -.word 9386261 // zeta^363 * 2^31 = 286215^363 * 2^31 -.word 31687909 // zeta^491 * 2^31 = 286215^491 * 2^31 -.word 67046997 // zeta^ 54 * 2^31 = 286215^ 54 * 2^31 -.word 26032801 // zeta^310 * 2^31 = 286215^310 * 2^31 -.word 59339603 // zeta^182 * 2^31 = 286215^182 * 2^31 -.word 61420673 // zeta^438 * 2^31 = 286215^438 * 2^31 -.word 3976083883 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 286215^ 54 * 71292929 * 2^31 -.word 3814452575 // zeta^310 * (q^(-1) mod 2^32) * 2^31 = 286215^310 * 71292929 * 2^31 -.word 1438352557 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 286215^182 * 71292929 * 2^31 -.word 1212871551 // zeta^438 * (q^(-1) mod 2^32) * 2^31 = 286215^438 * 71292929 * 2^31 -.word 39239633 // zeta^ 27 * 2^31 = 286215^ 27 * 2^31 -.word 6650571 // zeta^155 * 2^31 = 286215^155 * 2^31 -.word 55728179 // zeta^ 91 * 2^31 = 286215^ 91 * 2^31 -.word 53303437 // zeta^219 * 2^31 = 286215^219 * 2^31 -.word 1398925359 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 286215^ 27 * 71292929 * 2^31 -.word 4228529461 // zeta^155 * (q^(-1) mod 2^32) * 2^31 = 286215^155 * 71292929 * 2^31 -.word 28680141 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 286215^ 91 * 71292929 * 2^31 -.word 3144200051 // zeta^219 * (q^(-1) mod 2^32) * 2^31 = 286215^219 * 71292929 * 2^31 -.word 43502609 // zeta^283 * 2^31 = 286215^283 * 2^31 -.word 3716037 // zeta^411 * 2^31 = 286215^411 * 2^31 -.word 47859657 // zeta^347 * 2^31 = 286215^347 * 2^31 -.word 54206995 // zeta^475 * 2^31 = 286215^475 * 2^31 -.word 47267443 // zeta^118 * 2^31 = 286215^118 * 2^31 -.word 14848525 // zeta^374 * 2^31 = 286215^374 * 2^31 -.word 30867557 // zeta^246 * 2^31 = 286215^246 * 2^31 -.word 51582797 // zeta^502 * 2^31 = 286215^502 * 2^31 -.word 1177598349 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 286215^118 * 71292929 * 2^31 -.word 2930734579 // zeta^374 * (q^(-1) mod 2^32) * 2^31 = 286215^374 * 71292929 * 2^31 -.word 3908618139 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 286215^246 * 71292929 * 2^31 -.word 4048613555 // zeta^502 * (q^(-1) mod 2^32) * 2^31 = 286215^502 * 71292929 * 2^31 -.word 17343785 // zeta^ 59 * 2^31 = 286215^ 59 * 2^31 -.word 18576903 // zeta^187 * 2^31 = 286215^187 * 2^31 -.word 54844885 // zeta^123 * 2^31 = 286215^123 * 2^31 -.word 26502613 // zeta^251 * 2^31 = 286215^251 * 2^31 -.word 1787151063 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 286215^ 59 * 71292929 * 2^31 -.word 2878710265 // zeta^187 * (q^(-1) mod 2^32) * 2^31 = 286215^187 * 71292929 * 2^31 -.word 4129581611 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 286215^123 * 71292929 * 2^31 -.word 1695867435 // zeta^251 * (q^(-1) mod 2^32) * 2^31 = 286215^251 * 71292929 * 2^31 -.word 46515603 // zeta^315 * 2^31 = 286215^315 * 2^31 -.word 22784943 // zeta^443 * 2^31 = 286215^443 * 2^31 -.word 61940237 // zeta^379 * 2^31 = 286215^379 * 2^31 -.word 32550703 // zeta^507 * 2^31 = 286215^507 * 2^31 -.word 62478247 // zeta^ 14 * 2^31 = 286215^ 14 * 2^31 -.word 53918527 // zeta^270 * 2^31 = 286215^270 * 2^31 -.word 13974447 // zeta^142 * 2^31 = 286215^142 * 2^31 -.word 10108243 // zeta^398 * 2^31 = 286215^398 * 2^31 -.word 1815658585 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 286215^ 14 * 71292929 * 2^31 -.word 3192593601 // zeta^270 * (q^(-1) mod 2^32) * 2^31 = 286215^270 * 71292929 * 2^31 -.word 2831031377 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 286215^142 * 71292929 * 2^31 -.word 2017114797 // zeta^398 * (q^(-1) mod 2^32) * 2^31 = 286215^398 * 71292929 * 2^31 -.word 23583191 // zeta^ 7 * 2^31 = 286215^ 7 * 2^31 -.word 1509997 // zeta^135 * 2^31 = 286215^135 * 2^31 -.word 43053267 // zeta^ 71 * 2^31 = 286215^ 71 * 2^31 -.word 47998299 // zeta^199 * 2^31 = 286215^199 * 2^31 -.word 1726070313 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 286215^ 7 * 71292929 * 2^31 -.word 1246363027 // zeta^135 * (q^(-1) mod 2^32) * 2^31 = 286215^135 * 71292929 * 2^31 -.word 575670061 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 286215^ 71 * 71292929 * 2^31 -.word 2855916197 // zeta^199 * (q^(-1) mod 2^32) * 2^31 = 286215^199 * 71292929 * 2^31 -.word 1748775 // zeta^263 * 2^31 = 286215^263 * 2^31 -.word 33596261 // zeta^391 * 2^31 = 286215^391 * 2^31 -.word 8679237 // zeta^327 * 2^31 = 286215^327 * 2^31 -.word 8074045 // zeta^455 * 2^31 = 286215^455 * 2^31 -.word 14999777 // zeta^ 78 * 2^31 = 286215^ 78 * 2^31 -.word 10961253 // zeta^334 * 2^31 = 286215^334 * 2^31 -.word 59134963 // zeta^206 * 2^31 = 286215^206 * 2^31 -.word 23786629 // zeta^462 * 2^31 = 286215^462 * 2^31 -.word 100550431 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 286215^ 78 * 71292929 * 2^31 -.word 877692571 // zeta^334 * (q^(-1) mod 2^32) * 2^31 = 286215^334 * 71292929 * 2^31 -.word 819438605 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 286215^206 * 71292929 * 2^31 -.word 2139739003 // zeta^462 * (q^(-1) mod 2^32) * 2^31 = 286215^462 * 71292929 * 2^31 -.word 1544785 // zeta^ 39 * 2^31 = 286215^ 39 * 2^31 -.word 54479437 // zeta^167 * 2^31 = 286215^167 * 2^31 -.word 44611143 // zeta^103 * 2^31 = 286215^103 * 2^31 -.word 43877703 // zeta^231 * 2^31 = 286215^231 * 2^31 -.word 3599046063 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 286215^ 39 * 71292929 * 2^31 -.word 421313971 // zeta^167 * (q^(-1) mod 2^32) * 2^31 = 286215^167 * 71292929 * 2^31 -.word 2886885817 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 286215^103 * 71292929 * 2^31 -.word 745902777 // zeta^231 * (q^(-1) mod 2^32) * 2^31 = 286215^231 * 71292929 * 2^31 -.word 49916343 // zeta^295 * 2^31 = 286215^295 * 2^31 -.word 11302783 // zeta^423 * 2^31 = 286215^423 * 2^31 -.word 46650163 // zeta^359 * 2^31 = 286215^359 * 2^31 -.word 41460293 // zeta^487 * 2^31 = 286215^487 * 2^31 -.word 59164407 // zeta^ 46 * 2^31 = 286215^ 46 * 2^31 -.word 24091995 // zeta^302 * 2^31 = 286215^302 * 2^31 -.word 66143065 // zeta^174 * 2^31 = 286215^174 * 2^31 -.word 16101757 // zeta^430 * 2^31 = 286215^430 * 2^31 -.word 1909444873 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 286215^ 46 * 71292929 * 2^31 -.word 2892405413 // zeta^302 * (q^(-1) mod 2^32) * 2^31 = 286215^302 * 71292929 * 2^31 -.word 1951704231 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 286215^174 * 71292929 * 2^31 -.word 260429443 // zeta^430 * (q^(-1) mod 2^32) * 2^31 = 286215^430 * 71292929 * 2^31 -.word 51071665 // zeta^ 23 * 2^31 = 286215^ 23 * 2^31 -.word 29551825 // zeta^151 * 2^31 = 286215^151 * 2^31 -.word 65641461 // zeta^ 87 * 2^31 = 286215^ 87 * 2^31 -.word 4991871 // zeta^215 * 2^31 = 286215^215 * 2^31 -.word 1348492623 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 286215^ 23 * 71292929 * 2^31 -.word 4210932527 // zeta^151 * (q^(-1) mod 2^32) * 2^31 = 286215^151 * 71292929 * 2^31 -.word 2872355851 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 286215^ 87 * 71292929 * 2^31 -.word 180333697 // zeta^215 * (q^(-1) mod 2^32) * 2^31 = 286215^215 * 71292929 * 2^31 -.word 24878029 // zeta^279 * 2^31 = 286215^279 * 2^31 -.word 6465513 // zeta^407 * 2^31 = 286215^407 * 2^31 -.word 58394439 // zeta^343 * 2^31 = 286215^343 * 2^31 -.word 13917917 // zeta^471 * 2^31 = 286215^471 * 2^31 -.word 43155485 // zeta^110 * 2^31 = 286215^110 * 2^31 -.word 13774769 // zeta^366 * 2^31 = 286215^366 * 2^31 -.word 17669861 // zeta^238 * 2^31 = 286215^238 * 2^31 -.word 36746905 // zeta^494 * 2^31 = 286215^494 * 2^31 -.word 1714554851 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 286215^110 * 71292929 * 2^31 -.word 643921999 // zeta^366 * (q^(-1) mod 2^32) * 2^31 = 286215^366 * 71292929 * 2^31 -.word 3532007707 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 286215^238 * 71292929 * 2^31 -.word 2417439079 // zeta^494 * (q^(-1) mod 2^32) * 2^31 = 286215^494 * 71292929 * 2^31 -.word 60010757 // zeta^ 55 * 2^31 = 286215^ 55 * 2^31 -.word 25675953 // zeta^183 * 2^31 = 286215^183 * 2^31 -.word 6969519 // zeta^119 * 2^31 = 286215^119 * 2^31 -.word 65987733 // zeta^247 * 2^31 = 286215^247 * 2^31 -.word 3134527227 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 286215^ 55 * 71292929 * 2^31 -.word 1167318863 // zeta^183 * (q^(-1) mod 2^32) * 2^31 = 286215^183 * 71292929 * 2^31 -.word 3048275793 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 286215^119 * 71292929 * 2^31 -.word 3599262571 // zeta^247 * (q^(-1) mod 2^32) * 2^31 = 286215^247 * 71292929 * 2^31 -.word 23508291 // zeta^311 * 2^31 = 286215^311 * 2^31 -.word 20438945 // zeta^439 * 2^31 = 286215^439 * 2^31 -.word 45946307 // zeta^375 * 2^31 = 286215^375 * 2^31 -.word 13177575 // zeta^503 * 2^31 = 286215^503 * 2^31 -.word 36697917 // zeta^ 30 * 2^31 = 286215^ 30 * 2^31 -.word 6463229 // zeta^286 * 2^31 = 286215^286 * 2^31 -.word 58452265 // zeta^158 * 2^31 = 286215^158 * 2^31 -.word 13236309 // zeta^414 * 2^31 = 286215^414 * 2^31 -.word 3107033283 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 286215^ 30 * 71292929 * 2^31 -.word 3040143619 // zeta^286 * (q^(-1) mod 2^32) * 2^31 = 286215^286 * 71292929 * 2^31 -.word 1790082775 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 286215^158 * 71292929 * 2^31 -.word 616779691 // zeta^414 * (q^(-1) mod 2^32) * 2^31 = 286215^414 * 71292929 * 2^31 -.word 27760241 // zeta^ 15 * 2^31 = 286215^ 15 * 2^31 -.word 62784079 // zeta^143 * 2^31 = 286215^143 * 2^31 -.word 38109317 // zeta^ 79 * 2^31 = 286215^ 79 * 2^31 -.word 58557411 // zeta^207 * 2^31 = 286215^207 * 2^31 -.word 3449426319 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 286215^ 15 * 71292929 * 2^31 -.word 3705558449 // zeta^143 * (q^(-1) mod 2^32) * 2^31 = 286215^143 * 71292929 * 2^31 -.word 2760853371 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 286215^ 79 * 71292929 * 2^31 -.word 341701661 // zeta^207 * (q^(-1) mod 2^32) * 2^31 = 286215^207 * 71292929 * 2^31 -.word 60112057 // zeta^271 * 2^31 = 286215^271 * 2^31 -.word 57345683 // zeta^399 * 2^31 = 286215^399 * 2^31 -.word 52171431 // zeta^335 * 2^31 = 286215^335 * 2^31 -.word 33135953 // zeta^463 * 2^31 = 286215^463 * 2^31 -.word 13961957 // zeta^ 94 * 2^31 = 286215^ 94 * 2^31 -.word 4183205 // zeta^350 * 2^31 = 286215^350 * 2^31 -.word 61179875 // zeta^222 * 2^31 = 286215^222 * 2^31 -.word 45952127 // zeta^478 * 2^31 = 286215^478 * 2^31 -.word 4221484315 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 286215^ 94 * 71292929 * 2^31 -.word 1002042203 // zeta^350 * (q^(-1) mod 2^32) * 2^31 = 286215^350 * 71292929 * 2^31 -.word 1423306781 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 286215^222 * 71292929 * 2^31 -.word 1886825345 // zeta^478 * (q^(-1) mod 2^32) * 2^31 = 286215^478 * 71292929 * 2^31 -.word 27574275 // zeta^ 47 * 2^31 = 286215^ 47 * 2^31 -.word 57612861 // zeta^175 * 2^31 = 286215^175 * 2^31 -.word 14604621 // zeta^111 * 2^31 = 286215^111 * 2^31 -.word 55726513 // zeta^239 * 2^31 = 286215^239 * 2^31 -.word 2946217981 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 286215^ 47 * 71292929 * 2^31 -.word 3580521923 // zeta^175 * (q^(-1) mod 2^32) * 2^31 = 286215^175 * 71292929 * 2^31 -.word 1238707891 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 286215^111 * 71292929 * 2^31 -.word 2838582863 // zeta^239 * (q^(-1) mod 2^32) * 2^31 = 286215^239 * 71292929 * 2^31 -.word 31057151 // zeta^303 * 2^31 = 286215^303 * 2^31 -.word 518163 // zeta^431 * 2^31 = 286215^431 * 2^31 -.word 39018755 // zeta^367 * 2^31 = 286215^367 * 2^31 -.word 25130025 // zeta^495 * 2^31 = 286215^495 * 2^31 -.word 18340771 // zeta^ 62 * 2^31 = 286215^ 62 * 2^31 -.word 64730193 // zeta^318 * 2^31 = 286215^318 * 2^31 -.word 29457983 // zeta^190 * 2^31 = 286215^190 * 2^31 -.word 58025703 // zeta^446 * 2^31 = 286215^446 * 2^31 -.word 1148820573 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 286215^ 62 * 71292929 * 2^31 -.word 4161860527 // zeta^318 * (q^(-1) mod 2^32) * 2^31 = 286215^318 * 71292929 * 2^31 -.word 2922928577 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 286215^190 * 71292929 * 2^31 -.word 4276007193 // zeta^446 * (q^(-1) mod 2^32) * 2^31 = 286215^446 * 71292929 * 2^31 -.word 34062919 // zeta^ 31 * 2^31 = 286215^ 31 * 2^31 -.word 6546201 // zeta^159 * 2^31 = 286215^159 * 2^31 -.word 45814067 // zeta^ 95 * 2^31 = 286215^ 95 * 2^31 -.word 42277717 // zeta^223 * 2^31 = 286215^223 * 2^31 -.word 2257802681 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 286215^ 31 * 71292929 * 2^31 -.word 1893205223 // zeta^159 * (q^(-1) mod 2^32) * 2^31 = 286215^159 * 71292929 * 2^31 -.word 523560653 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 286215^ 95 * 71292929 * 2^31 -.word 2692754603 // zeta^223 * (q^(-1) mod 2^32) * 2^31 = 286215^223 * 71292929 * 2^31 -.word 56829859 // zeta^287 * 2^31 = 286215^287 * 2^31 -.word 52668271 // zeta^415 * 2^31 = 286215^415 * 2^31 -.word 44133165 // zeta^351 * 2^31 = 286215^351 * 2^31 -.word 5172947 // zeta^479 * 2^31 = 286215^479 * 2^31 -.word 11263143 // zeta^126 * 2^31 = 286215^126 * 2^31 -.word 7013271 // zeta^382 * 2^31 = 286215^382 * 2^31 -.word 47890357 // zeta^254 * 2^31 = 286215^254 * 2^31 -.word 34564147 // zeta^510 * 2^31 = 286215^510 * 2^31 -.word 336477017 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 286215^126 * 71292929 * 2^31 -.word 1986303081 // zeta^382 * (q^(-1) mod 2^32) * 2^31 = 286215^382 * 71292929 * 2^31 -.word 1775863883 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 286215^254 * 71292929 * 2^31 -.word 2373488589 // zeta^510 * (q^(-1) mod 2^32) * 2^31 = 286215^510 * 71292929 * 2^31 -.word 23173257 // zeta^ 63 * 2^31 = 286215^ 63 * 2^31 -.word 5005437 // zeta^191 * 2^31 = 286215^191 * 2^31 -.word 62149479 // zeta^127 * 2^31 = 286215^127 * 2^31 -.word 65886399 // zeta^255 * 2^31 = 286215^255 * 2^31 -.word 4164145015 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 286215^ 63 * 71292929 * 2^31 -.word 3683067779 // zeta^191 * (q^(-1) mod 2^32) * 2^31 = 286215^191 * 71292929 * 2^31 -.word 3012805785 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 286215^127 * 71292929 * 2^31 -.word 3861937985 // zeta^255 * (q^(-1) mod 2^32) * 2^31 = 286215^255 * 71292929 * 2^31 -.word 59633685 // zeta^319 * 2^31 = 286215^319 * 2^31 -.word 26383745 // zeta^447 * 2^31 = 286215^447 * 2^31 -.word 1655173 // zeta^383 * 2^31 = 286215^383 * 2^31 -.word 59872277 // zeta^511 * 2^31 = 286215^511 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215, %function -.global ntt_1024_u32_33564673_286215 -ntt_1024_u32_33564673_286215: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r7 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vqrdmulh.s32 Q4, Q2, r7 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r9 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[772]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vmul.u32 Q4, Q4, r6 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r7 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r9 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[776]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(48)] -// Release input[264] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[784]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[788]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[792]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[800]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[804]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[808]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[816]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(208)] -// Release input[304] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[820]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[824]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[832]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[836]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[840]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[848]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[848]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[852]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(368)] -// Release input[848] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[852]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[84]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[856]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(384)] -// Release input[852] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[856]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(336)] -// Release input[84] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(400)] -// Release input[856] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[864]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[868]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(432)] -// Release input[864] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[868]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[872]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(448)] -// Release input[868] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[872]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(464)] -// Release input[872] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[880]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[880]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[884]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(496)] -// Release input[880] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[884]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[888]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-496)] -// Release input[884] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[888]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-480)] -// Release input[888] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[896]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[900]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[904]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[912]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[916]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-384)] -// Release input[912] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[916]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[148]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[920]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-368)] -// Release input[916] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[920]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-416)] -// Release input[148] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-352)] -// Release input[920] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[928]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[928]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[932]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-320)] -// Release input[928] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[932]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[936]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-304)] -// Release input[932] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[936]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-288)] -// Release input[936] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[944]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[948]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[948]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[952]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-240)] -// Release input[948] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[952]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-224)] -// Release input[952] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[960]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[704]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[448]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-224)] -// Release input[448] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[968]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-192)] -// Release input[456] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[716]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release input[716] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-176)] -// Release input[460] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[980]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[980]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[984]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-112)] -// Release input[980] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[984]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-160)] -// Release input[212] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-96)] -// Release input[984] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[992]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[996]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[996]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-48)] -// Release input[996] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[232]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-80)] -// Release input[232] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1008]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[752]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[496]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(0)] -// Release input[1008] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-32)] -// Release input[496] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r6 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(0)] -// Release input[504] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[508]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(32)] -// Release input[764] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(16)] -// Release input[508] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[196]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[204]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[208]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[20]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[216]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(80)] -// Release input[20] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(96)] -// Release input[24] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[228]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[232]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(144)] -// Release input[36] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[448]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[456]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[468]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[276]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-144)] -// Release input[468] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[472]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(96)] -// Release input[276] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-128)] -// Release input[472] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[480]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[480]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[484]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-96)] -// Release input[480] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[484]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-80)] -// Release input[484] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[492]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[496]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[708]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[712]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[720]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[724]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[724]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(96)] -// Release input[528] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[724] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(112)] -// Release input[532] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[732]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[732]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(128)] -// Release input[536] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[736]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-96)] -// Release input[732] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[736]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[544]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[736] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(160)] -// Release input[544] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[548]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[744]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[744]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(176)] -// Release input[548] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[552]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-48)] -// Release input[744] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(192)] -// Release input[552] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[568]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(256)] -// Release input[568] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[896]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[832]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-448)] -// Release input[896] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release input[832] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[964]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[900]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[968]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[840]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(336)] -// Release input[840] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[972]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[908]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[844]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[976]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-400)] -// Release input[908] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(352)] -// Release input[844] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[976]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[912]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[980]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-128)] -// Release input[976] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-384)] -// Release input[912] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[980]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[984]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-112)] -// Release input[980] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[984]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-96)] -// Release input[984] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-336)] -// Release input[924] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[992]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[864]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[996]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[996]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-48)] -// Release input[996] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1000]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1004]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-32)] -// Release input[1000] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1004]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-16)] -// Release input[1004] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[944]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[880]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-256)] -// Release input[944] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(496)] -// Release input[880] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[948]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1016]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-240)] -// Release input[948] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1016]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[888]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-480)] -// Release input[888] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[956]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -52)] -vmul.u32 Q2, Q2, r6 -// input[892]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-208)] -// Release input[956] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-464)] -// Release input[892] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[560]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[564]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(32)] -// Release input[512] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[568]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[524]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(80)] -// Release input[524] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[580]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(304)] -// Release input[580] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[584]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(320)] -// Release input[584] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[588]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(336)] -// Release input[588] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[692]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-256)] -// Release input[692] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[696]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-240)] -// Release input[696] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release input[648] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[752]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[760]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[800]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[784]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(176)] -// Release input[800] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(112)] -// Release input[784] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[820]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[804]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[772]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(192)] -// Release input[804] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[792]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(64)] -// Release input[772] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[776]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(144)] -// Release input[792] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[828]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[812]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[796]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(80)] -// Release input[776] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[780]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release input[812] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(160)] -// Release input[796] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[864]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(96)] -// Release input[780] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[832]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(432)] -// Release input[864] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(304)] -// Release input[832] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[836]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[888]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(320)] -// Release input[836] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[840]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[892]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-480)] -// Release input[888] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[892]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[876]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release input[840] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-464)] -// Release input[892] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release input[876] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[948]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[948]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-448)] -// Release input[896] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[900]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[952]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-240)] -// Release input[948] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[952]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-432)] -// Release input[900] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q2, Q2, r6 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-416)] -// Release input[904] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[908]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[992]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[976]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-400)] -// Release input[908] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1012]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-64)] -// Release input[992] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-128)] -// Release input[976] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1012]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(16)] -// Release input[1012] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[984]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-96)] -// Release input[984] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[1004]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[988]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-16)] -// Release input[1004] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-80)] -// Release input[988] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[44]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[92]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(368)] -// Release input[92] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[268]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q2, Q2, r6 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[284]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[332]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[336]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(336)] -// Release input[336] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[428]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[428]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-304)] -// Release input[428] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[524]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r6 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[556]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[604]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[620]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[620]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(352)] -// Release input[592] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(464)] -// Release input[620] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[652]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[668]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[668]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-352)] -// Release input[668] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[672]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release input[672] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r6 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-80)] -// Release input[736] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[776]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release input[776] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[792]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(144)] -// Release input[792] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[804]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(112)] -// Release input[784] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[824]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[820]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(272)] -// Release input[824] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(256)] -// Release input[820] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[840]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[832]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(304)] -// Release input[832] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[864]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[888]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release input[864] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-480)] -// Release input[888] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[912]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-384)] -// Release input[912] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[928]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-320)] -// Release input[928] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[968]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[964]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-160)] -// Release input[968] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-176)] -// Release input[964] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[976]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[996]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-128)] -// Release input[976] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[1016]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[1012]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -vqrdmulh.s32 Q2, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q2, Q4, r9 -vstrw.u32 Q3, [r10,#(32)] -// Release input[1016] from Q3 -vsub.s32 Q4, Q1, Q2 -vstrw.u32 Q4, [r10,#(16)] -// Release input[1012] from Q4 -vadd.s32 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r8, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q5, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r9 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r8], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r8, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r9 -vldrw.s32 Q3, [r8, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r9 -vldrw.s32 Q2, [r8, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r9 -vldrw.s32 Q6, [r8, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r7 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r9 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 13727 -// Instruction count: 11416 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete.s deleted file mode 100644 index b2c90be..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete.s +++ /dev/null @@ -1,10303 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 = 17791697 * 2^31 -.word 3285804833 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 = 29333180 * 2^31 -.word 1876750725 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 = 16027071 * 2^31 -.word 3172903227 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 = 27246749 * 2^31 -.word 3890743531 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 = 19153009 * 2^31 -.word 3372902219 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 = 14378180 * 2^31 -.word 919922753 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 = 23328838 * 2^31 -.word 1492590085 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 = 26950707 * 2^31 -.word 3871802623 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 = 31812506 * 2^31 -.word 2035379173 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 = 17437883 * 2^31 -.word 3263167647 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 = 8357758 * 2^31 -.word 534733307 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 = 22422281 * 2^31 -.word 3582071787 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 = 29650081 * 2^31 -.word 4044509849 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 = 9686916 * 2^31 -.word 619773465 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 = 18399952 * 2^31 -.word 1177237627 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 = 27755269 * 2^31 -.word 3923278881 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 = 22563934 * 2^31 -.word 1443651165 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 = 2438403 * 2^31 -.word 2303493823 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 = 31481843 * 2^31 -.word 4161706847 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 = 32076751 * 2^31 -.word 4199769341 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 = 18223844 * 2^31 -.word 1165970155 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 = 3973412 * 2^31 -.word 254220777 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 = 7405458 * 2^31 -.word 473804703 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 = 33156191 * 2^31 -.word 4268832423 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 = 22859934 * 2^31 -.word 1462589385 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 = 23834070 * 2^31 -.word 1524915067 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 = 25149579 * 2^31 -.word 3756565603 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 = 13976724 * 2^31 -.word 894237409 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 = 15349951 * 2^31 -.word 3129580769 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 = 6932474 * 2^31 -.word 443542963 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 = 12503729 * 2^31 -.word 2947478141 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 = 10586616 * 2^31 -.word 677336697 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 = 15322485 * 2^31 -.word 3127823481 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 = 6173403 * 2^31 -.word 2542460889 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 = 14374018 * 2^31 -.word 919656467 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 = 9325363 * 2^31 -.word 2744124781 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 = 5605608 * 2^31 -.word 358649449 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 = 25200773 * 2^31 -.word 3759841019 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 = 31727447 * 2^31 -.word 4177420707 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 = 6658688 * 2^31 -.word 426026005 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 = 33297705 * 2^31 -.word 4277886557 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 = 486950 * 2^31 -.word 31155291 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 = 13215161 * 2^31 -.word 2992995895 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 = 16752026 * 2^31 -.word 1071802543 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 = 14102887 * 2^31 -.word 3049793025 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 = 32232983 * 2^31 -.word 4209765139 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 = 16009575 * 2^31 -.word 3171783825 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 = 5365218 * 2^31 -.word 343269183 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 = 24042369 * 2^31 -.word 3685725783 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 = 27221548 * 2^31 -.word 1741647511 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 = 7233695 * 2^31 -.word 2610298873 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 = 15385892 * 2^31 -.word 984396643 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 = 15700554 * 2^31 -.word 1004528867 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 = 17178032 * 2^31 -.word 1099058609 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 = 20482112 * 2^31 -.word 1310455209 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 = 31908284 * 2^31 -.word 2041507095 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 = 4869100 * 2^31 -.word 311527319 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 = 29810009 * 2^31 -.word 4054742117 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 = 11135584 * 2^31 -.word 712459929 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 = 18505659 * 2^31 -.word 3331484459 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 = 17484839 * 2^31 -.word 3266171913 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 = 20168277 * 2^31 -.word 3437859545 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 = 31283961 * 2^31 -.word 4149046263 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 = 17056436 * 2^31 -.word 1091278839 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete, %function -.global ntt_1024_u32_33564673_286215_incomplete -ntt_1024_u32_33564673_286215_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r7 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vqrdmulh.s32 Q4, Q2, r7 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r9 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[772]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vmul.u32 Q4, Q4, r6 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r7 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r9 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[776]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(48)] -// Release input[264] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[784]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[788]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[792]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[800]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[804]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[808]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[816]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(208)] -// Release input[304] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[820]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[824]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[832]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[836]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[840]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[848]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[848]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[852]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(368)] -// Release input[848] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[852]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[84]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[856]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(384)] -// Release input[852] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[856]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(336)] -// Release input[84] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(400)] -// Release input[856] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[864]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[868]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(432)] -// Release input[864] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[868]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[872]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(448)] -// Release input[868] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[872]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(464)] -// Release input[872] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[880]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[880]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[884]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(496)] -// Release input[880] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[884]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[888]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-496)] -// Release input[884] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[888]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-480)] -// Release input[888] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[896]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[900]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[904]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[912]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[916]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-384)] -// Release input[912] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[916]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[148]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[920]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-368)] -// Release input[916] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[920]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-416)] -// Release input[148] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-352)] -// Release input[920] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[928]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[928]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[932]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-320)] -// Release input[928] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[932]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[936]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-304)] -// Release input[932] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[936]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-288)] -// Release input[936] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[944]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[948]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[948]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[952]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-240)] -// Release input[948] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[952]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-224)] -// Release input[952] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[960]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[704]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[448]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-224)] -// Release input[448] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[968]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-192)] -// Release input[456] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[716]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release input[716] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-176)] -// Release input[460] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[980]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[980]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[984]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-112)] -// Release input[980] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[984]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-160)] -// Release input[212] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-96)] -// Release input[984] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[992]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[996]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[996]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-48)] -// Release input[996] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[232]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-80)] -// Release input[232] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1008]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[752]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[496]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(0)] -// Release input[1008] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-32)] -// Release input[496] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r6 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(0)] -// Release input[504] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[508]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(32)] -// Release input[764] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(16)] -// Release input[508] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[196]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[204]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[208]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[20]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[216]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(80)] -// Release input[20] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(96)] -// Release input[24] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[228]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[232]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(144)] -// Release input[36] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[448]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[456]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[468]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[276]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-144)] -// Release input[468] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[472]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(96)] -// Release input[276] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-128)] -// Release input[472] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[480]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[480]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[484]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-96)] -// Release input[480] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[484]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-80)] -// Release input[484] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[492]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[496]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[708]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[712]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[720]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[724]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[724]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(96)] -// Release input[528] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[724] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(112)] -// Release input[532] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[732]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[732]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(128)] -// Release input[536] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[736]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-96)] -// Release input[732] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[736]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[544]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[736] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(160)] -// Release input[544] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[548]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[744]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[744]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(176)] -// Release input[548] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[552]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-48)] -// Release input[744] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(192)] -// Release input[552] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[568]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(256)] -// Release input[568] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[896]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[832]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-448)] -// Release input[896] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release input[832] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[964]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[900]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[968]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[840]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(336)] -// Release input[840] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[972]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[908]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[844]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[976]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-400)] -// Release input[908] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(352)] -// Release input[844] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[976]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[912]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[980]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-128)] -// Release input[976] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-384)] -// Release input[912] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[980]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[984]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-112)] -// Release input[980] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[984]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-96)] -// Release input[984] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-336)] -// Release input[924] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[992]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[864]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[996]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[996]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-48)] -// Release input[996] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1000]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1004]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-32)] -// Release input[1000] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1004]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-16)] -// Release input[1004] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[944]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[880]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-256)] -// Release input[944] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(496)] -// Release input[880] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[948]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1016]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-240)] -// Release input[948] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1016]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[888]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-480)] -// Release input[888] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[956]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -52)] -vmul.u32 Q2, Q2, r6 -// input[892]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-208)] -// Release input[956] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-464)] -// Release input[892] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[560]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[564]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(32)] -// Release input[512] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[568]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[524]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(80)] -// Release input[524] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[580]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(304)] -// Release input[580] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[584]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(320)] -// Release input[584] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[588]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(336)] -// Release input[588] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[692]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-256)] -// Release input[692] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[696]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-240)] -// Release input[696] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release input[648] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[752]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[760]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[800]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[784]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(176)] -// Release input[800] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(112)] -// Release input[784] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[820]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[804]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[772]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(192)] -// Release input[804] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[792]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(64)] -// Release input[772] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[776]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(144)] -// Release input[792] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[828]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[812]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[796]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(80)] -// Release input[776] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[780]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release input[812] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(160)] -// Release input[796] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[864]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(96)] -// Release input[780] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[832]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(432)] -// Release input[864] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(304)] -// Release input[832] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[836]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[888]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(320)] -// Release input[836] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[840]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[892]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-480)] -// Release input[888] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[892]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[876]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release input[840] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-464)] -// Release input[892] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release input[876] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[948]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[948]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-448)] -// Release input[896] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[900]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[952]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-240)] -// Release input[948] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[952]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-432)] -// Release input[900] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q2, Q2, r6 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-416)] -// Release input[904] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[908]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[992]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[976]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-400)] -// Release input[908] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1012]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-64)] -// Release input[992] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-128)] -// Release input[976] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1012]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(16)] -// Release input[1012] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[984]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-96)] -// Release input[984] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[1004]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[988]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-16)] -// Release input[1004] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-80)] -// Release input[988] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[44]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[92]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(368)] -// Release input[92] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[268]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q2, Q2, r6 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[284]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[332]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[336]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(336)] -// Release input[336] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[428]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[428]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-304)] -// Release input[428] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[524]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r6 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[556]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[604]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[620]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[620]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(352)] -// Release input[592] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(464)] -// Release input[620] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[652]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[668]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[668]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-352)] -// Release input[668] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[672]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release input[672] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r6 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-80)] -// Release input[736] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[776]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release input[776] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[792]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(144)] -// Release input[792] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[804]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(112)] -// Release input[784] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[824]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[820]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(272)] -// Release input[824] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(256)] -// Release input[820] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[840]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[832]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(304)] -// Release input[832] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[864]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[888]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release input[864] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-480)] -// Release input[888] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[912]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-384)] -// Release input[912] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[928]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-320)] -// Release input[928] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[968]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[964]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-160)] -// Release input[968] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-176)] -// Release input[964] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[976]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[996]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-128)] -// Release input[976] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[1016]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[1012]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -vqrdmulh.s32 Q2, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q2, Q4, r9 -vstrw.u32 Q3, [r10,#(32)] -// Release input[1016] from Q3 -vsub.s32 Q4, Q1, Q2 -vstrw.u32 Q4, [r10,#(16)] -// Release input[1012] from Q4 -vadd.s32 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 10271 -// Instruction count: 7960 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev.s deleted file mode 100644 index 2fbf285..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev.s +++ /dev/null @@ -1,10303 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 50801141 // zeta^1008 * 2^31 = 286215^1008 * 2^31 = 19477423 * 2^31 -.word 3393658379 // zeta^1008 * f(q^(-1) mod 2^32) * 2^31 = 286215^1008 * 71292929 * 2^31 -.word 34432613 // zeta^752 * 2^31 = 286215^752 * 2^31 = 27454015 * 2^31 -.word 3904004507 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 286215^752 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 19296927 // zeta^880 * 2^31 = 286215^880 * 2^31 = 24389287 * 2^31 -.word 3707921761 // zeta^880 * f(q^(-1) mod 2^32) * 2^31 = 286215^880 * 71292929 * 2^31 -.word 15646559 // zeta^624 * 2^31 = 286215^624 * 2^31 = 31786565 * 2^31 -.word 4181203105 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 286215^624 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 49831615 // zeta^944 * 2^31 = 286215^944 * 2^31 = 8413164 * 2^31 -.word 538278209 // zeta^944 * f(q^(-1) mod 2^32) * 2^31 = 286215^944 * 71292929 * 2^31 -.word 54669671 // zeta^688 * 2^31 = 286215^688 * 2^31 = 24835380 * 2^31 -.word 1588979353 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 286215^688 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 44135817 // zeta^816 * 2^31 = 286215^816 * 2^31 = 12975937 * 2^31 -.word 2977690231 // zeta^816 * f(q^(-1) mod 2^32) * 2^31 = 286215^816 * 71292929 * 2^31 -.word 24702057 // zeta^560 * 2^31 = 286215^560 * 2^31 = 9759120 * 2^31 -.word 624393111 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 286215^560 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 27214985 // zeta^976 * 2^31 = 286215^976 * 2^31 = 7037169 * 2^31 -.word 2597725047 // zeta^976 * f(q^(-1) mod 2^32) * 2^31 = 286215^976 * 71292929 * 2^31 -.word 40167763 // zeta^720 * 2^31 = 286215^720 * 2^31 = 8735396 * 2^31 -.word 558894765 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 286215^720 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 5600575 // zeta^848 * 2^31 = 286215^848 * 2^31 = 3732990 * 2^31 -.word 238838465 // zeta^848 * f(q^(-1) mod 2^32) * 2^31 = 286215^848 * 71292929 * 2^31 -.word 53619655 // zeta^592 * 2^31 = 286215^592 * 2^31 = 18327945 * 2^31 -.word 3320114233 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 286215^592 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 64695847 // zeta^912 * 2^31 = 286215^912 * 2^31 = 5665384 * 2^31 -.word 362473945 // zeta^912 * f(q^(-1) mod 2^32) * 2^31 = 286215^912 * 71292929 * 2^31 -.word 42330409 // zeta^656 * 2^31 = 286215^656 * 2^31 = 32426145 * 2^31 -.word 4222123735 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 286215^656 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 10468161 // zeta^784 * 2^31 = 286215^784 * 2^31 = 8491986 * 2^31 -.word 543321279 // zeta^784 * f(q^(-1) mod 2^32) * 2^31 = 286215^784 * 71292929 * 2^31 -.word 32014745 // zeta^528 * 2^31 = 286215^528 * 2^31 = 2121761 * 2^31 -.word 2283234919 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 286215^528 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 50801141 // zeta^1008 * 2^31 = 286215^1008 * 2^31 = 19477423 * 2^31 -.word 3393658379 // zeta^1008 * f(q^(-1) mod 2^32) * 2^31 = 286215^1008 * 71292929 * 2^31 -.word 34432613 // zeta^752 * 2^31 = 286215^752 * 2^31 = 27454015 * 2^31 -.word 3904004507 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 286215^752 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 19296927 // zeta^880 * 2^31 = 286215^880 * 2^31 = 24389287 * 2^31 -.word 3707921761 // zeta^880 * f(q^(-1) mod 2^32) * 2^31 = 286215^880 * 71292929 * 2^31 -.word 15646559 // zeta^624 * 2^31 = 286215^624 * 2^31 = 31786565 * 2^31 -.word 4181203105 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 286215^624 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 49831615 // zeta^944 * 2^31 = 286215^944 * 2^31 = 8413164 * 2^31 -.word 538278209 // zeta^944 * f(q^(-1) mod 2^32) * 2^31 = 286215^944 * 71292929 * 2^31 -.word 54669671 // zeta^688 * 2^31 = 286215^688 * 2^31 = 24835380 * 2^31 -.word 1588979353 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 286215^688 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 44135817 // zeta^816 * 2^31 = 286215^816 * 2^31 = 12975937 * 2^31 -.word 2977690231 // zeta^816 * f(q^(-1) mod 2^32) * 2^31 = 286215^816 * 71292929 * 2^31 -.word 24702057 // zeta^560 * 2^31 = 286215^560 * 2^31 = 9759120 * 2^31 -.word 624393111 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 286215^560 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 27214985 // zeta^976 * 2^31 = 286215^976 * 2^31 = 7037169 * 2^31 -.word 2597725047 // zeta^976 * f(q^(-1) mod 2^32) * 2^31 = 286215^976 * 71292929 * 2^31 -.word 40167763 // zeta^720 * 2^31 = 286215^720 * 2^31 = 8735396 * 2^31 -.word 558894765 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 286215^720 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 5600575 // zeta^848 * 2^31 = 286215^848 * 2^31 = 3732990 * 2^31 -.word 238838465 // zeta^848 * f(q^(-1) mod 2^32) * 2^31 = 286215^848 * 71292929 * 2^31 -.word 53619655 // zeta^592 * 2^31 = 286215^592 * 2^31 = 18327945 * 2^31 -.word 3320114233 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 286215^592 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 64695847 // zeta^912 * 2^31 = 286215^912 * 2^31 = 5665384 * 2^31 -.word 362473945 // zeta^912 * f(q^(-1) mod 2^32) * 2^31 = 286215^912 * 71292929 * 2^31 -.word 42330409 // zeta^656 * 2^31 = 286215^656 * 2^31 = 32426145 * 2^31 -.word 4222123735 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 286215^656 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 10468161 // zeta^784 * 2^31 = 286215^784 * 2^31 = 8491986 * 2^31 -.word 543321279 // zeta^784 * f(q^(-1) mod 2^32) * 2^31 = 286215^784 * 71292929 * 2^31 -.word 32014745 // zeta^528 * 2^31 = 286215^528 * 2^31 = 2121761 * 2^31 -.word 2283234919 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 286215^528 * 71292929 * 2^31 -.word 50801141 // zeta^1008 * 2^31 = 286215^1008 * 2^31 = 19477423 * 2^31 -.word 3393658379 // zeta^1008 * f(q^(-1) mod 2^32) * 2^31 = 286215^1008 * 71292929 * 2^31 -.word 42676385 // zeta^1016 * 2^31 = 286215^1016 * 2^31 = 4406558 * 2^31 -.word 281933663 // zeta^1016 * f(q^(-1) mod 2^32) * 2^31 = 286215^1016 * 71292929 * 2^31 -.word 28237813 // zeta^760 * 2^31 = 286215^760 * 2^31 = 15781 * 2^31 -.word 2148493323 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 286215^760 * 71292929 * 2^31 -.word 34432613 // zeta^752 * 2^31 = 286215^752 * 2^31 = 27454015 * 2^31 -.word 3904004507 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 286215^752 * 71292929 * 2^31 -.word 49494049 // zeta^888 * 2^31 = 286215^888 * 2^31 = 29780798 * 2^31 -.word 1905389535 // zeta^888 * f(q^(-1) mod 2^32) * 2^31 = 286215^888 * 71292929 * 2^31 -.word 34765151 // zeta^632 * 2^31 = 286215^632 * 2^31 = 3446166 * 2^31 -.word 220487329 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 286215^632 * 71292929 * 2^31 -.word 19296927 // zeta^880 * 2^31 = 286215^880 * 2^31 = 24389287 * 2^31 -.word 3707921761 // zeta^880 * f(q^(-1) mod 2^32) * 2^31 = 286215^880 * 71292929 * 2^31 -.word 54213995 // zeta^952 * 2^31 = 286215^952 * 2^31 = 14583628 * 2^31 -.word 933067413 // zeta^952 * f(q^(-1) mod 2^32) * 2^31 = 286215^952 * 71292929 * 2^31 -.word 8006763 // zeta^696 * 2^31 = 286215^696 * 2^31 = 20902680 * 2^31 -.word 1337363349 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 286215^696 * 71292929 * 2^31 -.word 15646559 // zeta^624 * 2^31 = 286215^624 * 2^31 = 31786565 * 2^31 -.word 4181203105 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 286215^624 * 71292929 * 2^31 -.word 15970361 // zeta^824 * 2^31 = 286215^824 * 2^31 = 8478458 * 2^31 -.word 542455751 // zeta^824 * f(q^(-1) mod 2^32) * 2^31 = 286215^824 * 71292929 * 2^31 -.word 4170189 // zeta^568 * 2^31 = 286215^568 * 2^31 = 19347624 * 2^31 -.word 1237870131 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 286215^568 * 71292929 * 2^31 -.word 49831615 // zeta^944 * 2^31 = 286215^944 * 2^31 = 8413164 * 2^31 -.word 538278209 // zeta^944 * f(q^(-1) mod 2^32) * 2^31 = 286215^944 * 71292929 * 2^31 -.word 5906831 // zeta^984 * 2^31 = 286215^984 * 2^31 = 30717302 * 2^31 -.word 1965307505 // zeta^984 * f(q^(-1) mod 2^32) * 2^31 = 286215^984 * 71292929 * 2^31 -.word 32641999 // zeta^728 * 2^31 = 286215^728 * 2^31 = 8758145 * 2^31 -.word 2707833905 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 286215^728 * 71292929 * 2^31 -.word 54669671 // zeta^688 * 2^31 = 286215^688 * 2^31 = 24835380 * 2^31 -.word 1588979353 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 286215^688 * 71292929 * 2^31 -.word 30303273 // zeta^856 * 2^31 = 286215^856 * 2^31 = 4089071 * 2^31 -.word 2409104343 // zeta^856 * f(q^(-1) mod 2^32) * 2^31 = 286215^856 * 71292929 * 2^31 -.word 53563597 // zeta^600 * 2^31 = 286215^600 * 2^31 = 19452428 * 2^31 -.word 1244575539 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 286215^600 * 71292929 * 2^31 -.word 44135817 // zeta^816 * 2^31 = 286215^816 * 2^31 = 12975937 * 2^31 -.word 2977690231 // zeta^816 * f(q^(-1) mod 2^32) * 2^31 = 286215^816 * 71292929 * 2^31 -.word 27877887 // zeta^920 * 2^31 = 286215^920 * 2^31 = 31965169 * 2^31 -.word 4192630273 // zeta^920 * f(q^(-1) mod 2^32) * 2^31 = 286215^920 * 71292929 * 2^31 -.word 54874279 // zeta^664 * 2^31 = 286215^664 * 2^31 = 3036860 * 2^31 -.word 194299737 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 286215^664 * 71292929 * 2^31 -.word 24702057 // zeta^560 * 2^31 = 286215^560 * 2^31 = 9759120 * 2^31 -.word 624393111 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 286215^560 * 71292929 * 2^31 -.word 11789331 // zeta^792 * 2^31 = 286215^792 * 2^31 = 19452799 * 2^31 -.word 3392082925 // zeta^792 * f(q^(-1) mod 2^32) * 2^31 = 286215^792 * 71292929 * 2^31 -.word 5622871 // zeta^536 * 2^31 = 286215^536 * 2^31 = 30901251 * 2^31 -.word 4124560297 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 286215^536 * 71292929 * 2^31 -.word 27214985 // zeta^976 * 2^31 = 286215^976 * 2^31 = 7037169 * 2^31 -.word 2597725047 // zeta^976 * f(q^(-1) mod 2^32) * 2^31 = 286215^976 * 71292929 * 2^31 -.word 30990117 // zeta^1000 * 2^31 = 286215^1000 * 2^31 = 988369 * 2^31 -.word 2210719963 // zeta^1000 * f(q^(-1) mod 2^32) * 2^31 = 286215^1000 * 71292929 * 2^31 -.word 34156189 // zeta^744 * 2^31 = 286215^744 * 2^31 = 21501702 * 2^31 -.word 1375689059 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 286215^744 * 71292929 * 2^31 -.word 40167763 // zeta^720 * 2^31 = 286215^720 * 2^31 = 8735396 * 2^31 -.word 558894765 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 286215^720 * 71292929 * 2^31 -.word 50516355 // zeta^872 * 2^31 = 286215^872 * 2^31 = 255720 * 2^31 -.word 16361085 // zeta^872 * f(q^(-1) mod 2^32) * 2^31 = 286215^872 * 71292929 * 2^31 -.word 26091531 // zeta^616 * 2^31 = 286215^616 * 2^31 = 1232856 * 2^31 -.word 78878709 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 286215^616 * 71292929 * 2^31 -.word 5600575 // zeta^848 * 2^31 = 286215^848 * 2^31 = 3732990 * 2^31 -.word 238838465 // zeta^848 * f(q^(-1) mod 2^32) * 2^31 = 286215^848 * 71292929 * 2^31 -.word 4276741 // zeta^936 * 2^31 = 286215^936 * 2^31 = 7554841 * 2^31 -.word 2630845947 // zeta^936 * f(q^(-1) mod 2^32) * 2^31 = 286215^936 * 71292929 * 2^31 -.word 39557271 // zeta^680 * 2^31 = 286215^680 * 2^31 = 19859369 * 2^31 -.word 3418095465 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 286215^680 * 71292929 * 2^31 -.word 53619655 // zeta^592 * 2^31 = 286215^592 * 2^31 = 18327945 * 2^31 -.word 3320114233 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 286215^592 * 71292929 * 2^31 -.word 64080051 // zeta^808 * 2^31 = 286215^808 * 2^31 = 19611677 * 2^31 -.word 3402248013 // zeta^808 * f(q^(-1) mod 2^32) * 2^31 = 286215^808 * 71292929 * 2^31 -.word 40363327 // zeta^552 * 2^31 = 286215^552 * 2^31 = 28756497 * 2^31 -.word 3987337921 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 286215^552 * 71292929 * 2^31 -.word 64695847 // zeta^912 * 2^31 = 286215^912 * 2^31 = 5665384 * 2^31 -.word 362473945 // zeta^912 * f(q^(-1) mod 2^32) * 2^31 = 286215^912 * 71292929 * 2^31 -.word 14946375 // zeta^968 * 2^31 = 286215^968 * 2^31 = 30392408 * 2^31 -.word 1944520633 // zeta^968 * f(q^(-1) mod 2^32) * 2^31 = 286215^968 * 71292929 * 2^31 -.word 53251923 // zeta^712 * 2^31 = 286215^712 * 2^31 = 21625705 * 2^31 -.word 3531106477 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 286215^712 * 71292929 * 2^31 -.word 42330409 // zeta^656 * 2^31 = 286215^656 * 2^31 = 32426145 * 2^31 -.word 4222123735 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 286215^656 * 71292929 * 2^31 -.word 60760121 // zeta^840 * 2^31 = 286215^840 * 2^31 = 27051869 * 2^31 -.word 3878275015 // zeta^840 * f(q^(-1) mod 2^32) * 2^31 = 286215^840 * 71292929 * 2^31 -.word 40454739 // zeta^584 * 2^31 = 286215^584 * 2^31 = 4314075 * 2^31 -.word 2423500205 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 286215^584 * 71292929 * 2^31 -.word 10468161 // zeta^784 * 2^31 = 286215^784 * 2^31 = 8491986 * 2^31 -.word 543321279 // zeta^784 * f(q^(-1) mod 2^32) * 2^31 = 286215^784 * 71292929 * 2^31 -.word 55066963 // zeta^904 * 2^31 = 286215^904 * 2^31 = 10010313 * 2^31 -.word 2787948205 // zeta^904 * f(q^(-1) mod 2^32) * 2^31 = 286215^904 * 71292929 * 2^31 -.word 62067539 // zeta^648 * 2^31 = 286215^648 * 2^31 = 22700705 * 2^31 -.word 3599885485 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 286215^648 * 71292929 * 2^31 -.word 32014745 // zeta^528 * 2^31 = 286215^528 * 2^31 = 2121761 * 2^31 -.word 2283234919 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 286215^528 * 71292929 * 2^31 -.word 57896497 // zeta^776 * 2^31 = 286215^776 * 2^31 = 14033313 * 2^31 -.word 3045341647 // zeta^776 * f(q^(-1) mod 2^32) * 2^31 = 286215^776 * 71292929 * 2^31 -.word 59857581 // zeta^520 * 2^31 = 286215^520 * 2^31 = 21856450 * 2^31 -.word 1398386003 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 286215^520 * 71292929 * 2^31 -.word 42676385 // zeta^1016 * 2^31 = 286215^1016 * 2^31 = 4406558 * 2^31 -.word 281933663 // zeta^1016 * f(q^(-1) mod 2^32) * 2^31 = 286215^1016 * 71292929 * 2^31 -.word 46825465 // zeta^1020 * 2^31 = 286215^1020 * 2^31 = 16508237 * 2^31 -.word 3203688455 // zeta^1020 * f(q^(-1) mod 2^32) * 2^31 = 286215^1020 * 71292929 * 2^31 -.word 36459513 // zeta^764 * 2^31 = 286215^764 * 2^31 = 2280712 * 2^31 -.word 145921031 // zeta^764 * f(q^(-1) mod 2^32) * 2^31 = 286215^764 * 71292929 * 2^31 -.word 28237813 // zeta^760 * 2^31 = 286215^760 * 2^31 = 15781 * 2^31 -.word 2148493323 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 286215^760 * 71292929 * 2^31 -.word 31181531 // zeta^892 * 2^31 = 286215^892 * 2^31 = 13396396 * 2^31 -.word 857107749 // zeta^892 * f(q^(-1) mod 2^32) * 2^31 = 286215^892 * 71292929 * 2^31 -.word 30379019 // zeta^636 * 2^31 = 286215^636 * 2^31 = 16079834 * 2^31 -.word 1028795381 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 286215^636 * 71292929 * 2^31 -.word 49494049 // zeta^888 * 2^31 = 286215^888 * 2^31 = 29780798 * 2^31 -.word 1905389535 // zeta^888 * f(q^(-1) mod 2^32) * 2^31 = 286215^888 * 71292929 * 2^31 -.word 22115117 // zeta^956 * 2^31 = 286215^956 * 2^31 = 15059014 * 2^31 -.word 963482835 // zeta^956 * f(q^(-1) mod 2^32) * 2^31 = 286215^956 * 71292929 * 2^31 -.word 58687131 // zeta^700 * 2^31 = 286215^700 * 2^31 = 22429089 * 2^31 -.word 3582507365 // zeta^700 * f(q^(-1) mod 2^32) * 2^31 = 286215^700 * 71292929 * 2^31 -.word 34765151 // zeta^632 * 2^31 = 286215^632 * 2^31 = 3446166 * 2^31 -.word 220487329 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 286215^632 * 71292929 * 2^31 -.word 31362151 // zeta^828 * 2^31 = 286215^828 * 2^31 = 3754664 * 2^31 -.word 240225177 // zeta^828 * f(q^(-1) mod 2^32) * 2^31 = 286215^828 * 71292929 * 2^31 -.word 29454233 // zeta^572 * 2^31 = 286215^572 * 2^31 = 28695573 * 2^31 -.word 3983439975 // zeta^572 * f(q^(-1) mod 2^32) * 2^31 = 286215^572 * 71292929 * 2^31 -.word 54213995 // zeta^952 * 2^31 = 286215^952 * 2^31 = 14583628 * 2^31 -.word 933067413 // zeta^952 * f(q^(-1) mod 2^32) * 2^31 = 286215^952 * 71292929 * 2^31 -.word 12244249 // zeta^988 * 2^31 = 286215^988 * 2^31 = 1656389 * 2^31 -.word 2253460199 // zeta^988 * f(q^(-1) mod 2^32) * 2^31 = 286215^988 * 71292929 * 2^31 -.word 41856427 // zeta^732 * 2^31 = 286215^732 * 2^31 = 13082561 * 2^31 -.word 2984512085 // zeta^732 * f(q^(-1) mod 2^32) * 2^31 = 286215^732 * 71292929 * 2^31 -.word 8006763 // zeta^696 * 2^31 = 286215^696 * 2^31 = 20902680 * 2^31 -.word 1337363349 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 286215^696 * 71292929 * 2^31 -.word 61228467 // zeta^860 * 2^31 = 286215^860 * 2^31 = 16386641 * 2^31 -.word 3195908685 // zeta^860 * f(q^(-1) mod 2^32) * 2^31 = 286215^860 * 71292929 * 2^31 -.word 27503845 // zeta^604 * 2^31 = 286215^604 * 2^31 = 17864119 * 2^31 -.word 3290438427 // zeta^604 * f(q^(-1) mod 2^32) * 2^31 = 286215^604 * 71292929 * 2^31 -.word 15970361 // zeta^824 * 2^31 = 286215^824 * 2^31 = 8478458 * 2^31 -.word 542455751 // zeta^824 * f(q^(-1) mod 2^32) * 2^31 = 286215^824 * 71292929 * 2^31 -.word 11828069 // zeta^924 * 2^31 = 286215^924 * 2^31 = 18178781 * 2^31 -.word 3310570651 // zeta^924 * f(q^(-1) mod 2^32) * 2^31 = 286215^924 * 71292929 * 2^31 -.word 26556411 // zeta^668 * 2^31 = 286215^668 * 2^31 = 26330978 * 2^31 -.word 1684668421 // zeta^668 * f(q^(-1) mod 2^32) * 2^31 = 286215^668 * 71292929 * 2^31 -.word 4170189 // zeta^568 * 2^31 = 286215^568 * 2^31 = 19347624 * 2^31 -.word 1237870131 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 286215^568 * 71292929 * 2^31 -.word 51861145 // zeta^796 * 2^31 = 286215^796 * 2^31 = 6343125 * 2^31 -.word 2553319783 // zeta^796 * f(q^(-1) mod 2^32) * 2^31 = 286215^796 * 71292929 * 2^31 -.word 36544089 // zeta^540 * 2^31 = 286215^540 * 2^31 = 9522304 * 2^31 -.word 609241511 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 286215^540 * 71292929 * 2^31 -.word 5906831 // zeta^984 * 2^31 = 286215^984 * 2^31 = 30717302 * 2^31 -.word 1965307505 // zeta^984 * f(q^(-1) mod 2^32) * 2^31 = 286215^984 * 71292929 * 2^31 -.word 22546241 // zeta^1004 * 2^31 = 286215^1004 * 2^31 = 28199455 * 2^31 -.word 3951698111 // zeta^1004 * f(q^(-1) mod 2^32) * 2^31 = 286215^1004 * 71292929 * 2^31 -.word 38046867 // zeta^748 * 2^31 = 286215^748 * 2^31 = 17555098 * 2^31 -.word 1123183469 // zeta^748 * f(q^(-1) mod 2^32) * 2^31 = 286215^748 * 71292929 * 2^31 -.word 32641999 // zeta^728 * 2^31 = 286215^728 * 2^31 = 8758145 * 2^31 -.word 2707833905 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 286215^728 * 71292929 * 2^31 -.word 27734805 // zeta^876 * 2^31 = 286215^876 * 2^31 = 1331690 * 2^31 -.word 85202155 // zeta^876 * f(q^(-1) mod 2^32) * 2^31 = 286215^876 * 71292929 * 2^31 -.word 28876291 // zeta^620 * 2^31 = 286215^620 * 2^31 = 19461786 * 2^31 -.word 1245174269 // zeta^620 * f(q^(-1) mod 2^32) * 2^31 = 286215^620 * 71292929 * 2^31 -.word 30303273 // zeta^856 * 2^31 = 286215^856 * 2^31 = 4089071 * 2^31 -.word 2409104343 // zeta^856 * f(q^(-1) mod 2^32) * 2^31 = 286215^856 * 71292929 * 2^31 -.word 37621937 // zeta^940 * 2^31 = 286215^940 * 2^31 = 16812647 * 2^31 -.word 3223164751 // zeta^940 * f(q^(-1) mod 2^32) * 2^31 = 286215^940 * 71292929 * 2^31 -.word 1992249 // zeta^684 * 2^31 = 286215^684 * 2^31 = 20349512 * 2^31 -.word 1301971399 // zeta^684 * f(q^(-1) mod 2^32) * 2^31 = 286215^684 * 71292929 * 2^31 -.word 53563597 // zeta^600 * 2^31 = 286215^600 * 2^31 = 19452428 * 2^31 -.word 1244575539 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 286215^600 * 71292929 * 2^31 -.word 59370589 // zeta^812 * 2^31 = 286215^812 * 2^31 = 33077723 * 2^31 -.word 4263812003 // zeta^812 * f(q^(-1) mod 2^32) * 2^31 = 286215^812 * 71292929 * 2^31 -.word 62535263 // zeta^556 * 2^31 = 286215^556 * 2^31 = 266968 * 2^31 -.word 17080737 // zeta^556 * f(q^(-1) mod 2^32) * 2^31 = 286215^556 * 71292929 * 2^31 -.word 27877887 // zeta^920 * 2^31 = 286215^920 * 2^31 = 31965169 * 2^31 -.word 4192630273 // zeta^920 * f(q^(-1) mod 2^32) * 2^31 = 286215^920 * 71292929 * 2^31 -.word 17316887 // zeta^972 * 2^31 = 286215^972 * 2^31 = 26905985 * 2^31 -.word 3868941289 // zeta^972 * f(q^(-1) mod 2^32) * 2^31 = 286215^972 * 71292929 * 2^31 -.word 37759397 // zeta^716 * 2^31 = 286215^716 * 2^31 = 1837226 * 2^31 -.word 117546587 // zeta^716 * f(q^(-1) mod 2^32) * 2^31 = 286215^716 * 71292929 * 2^31 -.word 54874279 // zeta^664 * 2^31 = 286215^664 * 2^31 = 3036860 * 2^31 -.word 194299737 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 286215^664 * 71292929 * 2^31 -.word 49424125 // zeta^844 * 2^31 = 286215^844 * 2^31 = 8363900 * 2^31 -.word 535126275 // zeta^844 * f(q^(-1) mod 2^32) * 2^31 = 286215^844 * 71292929 * 2^31 -.word 27346539 // zeta^588 * 2^31 = 286215^588 * 2^31 = 27959065 * 2^31 -.word 3936317845 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 286215^588 * 71292929 * 2^31 -.word 11789331 // zeta^792 * 2^31 = 286215^792 * 2^31 = 19452799 * 2^31 -.word 3392082925 // zeta^792 * f(q^(-1) mod 2^32) * 2^31 = 286215^792 * 71292929 * 2^31 -.word 40459631 // zeta^908 * 2^31 = 286215^908 * 2^31 = 24239310 * 2^31 -.word 1550842513 // zeta^908 * f(q^(-1) mod 2^32) * 2^31 = 286215^908 * 71292929 * 2^31 -.word 43261973 // zeta^652 * 2^31 = 286215^652 * 2^31 = 19190655 * 2^31 -.word 3375310827 // zeta^652 * f(q^(-1) mod 2^32) * 2^31 = 286215^652 * 71292929 * 2^31 -.word 5622871 // zeta^536 * 2^31 = 286215^536 * 2^31 = 30901251 * 2^31 -.word 4124560297 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 286215^536 * 71292929 * 2^31 -.word 8722395 // zeta^780 * 2^31 = 286215^780 * 2^31 = 27391270 * 2^31 -.word 1752506405 // zeta^780 * f(q^(-1) mod 2^32) * 2^31 = 286215^780 * 71292929 * 2^31 -.word 6423675 // zeta^524 * 2^31 = 286215^524 * 2^31 = 18242188 * 2^31 -.word 1167143813 // zeta^524 * f(q^(-1) mod 2^32) * 2^31 = 286215^524 * 71292929 * 2^31 -.word 30990117 // zeta^1000 * 2^31 = 286215^1000 * 2^31 = 988369 * 2^31 -.word 2210719963 // zeta^1000 * f(q^(-1) mod 2^32) * 2^31 = 286215^1000 * 71292929 * 2^31 -.word 65179259 // zeta^1012 * 2^31 = 286215^1012 * 2^31 = 22978057 * 2^31 -.word 3617630597 // zeta^1012 * f(q^(-1) mod 2^32) * 2^31 = 286215^1012 * 71292929 * 2^31 -.word 59951743 // zeta^756 * 2^31 = 286215^756 * 2^31 = 21060944 * 2^31 -.word 1347489153 // zeta^756 * f(q^(-1) mod 2^32) * 2^31 = 286215^756 * 71292929 * 2^31 -.word 34156189 // zeta^744 * 2^31 = 286215^744 * 2^31 = 21501702 * 2^31 -.word 1375689059 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 286215^744 * 71292929 * 2^31 -.word 26490293 // zeta^884 * 2^31 = 286215^884 * 2^31 = 26632199 * 2^31 -.word 3851424331 // zeta^884 * f(q^(-1) mod 2^32) * 2^31 = 286215^884 * 71292929 * 2^31 -.word 17634531 // zeta^628 * 2^31 = 286215^628 * 2^31 = 18214722 * 2^31 -.word 1165386525 // zeta^628 * f(q^(-1) mod 2^32) * 2^31 = 286215^628 * 71292929 * 2^31 -.word 50516355 // zeta^872 * 2^31 = 286215^872 * 2^31 = 255720 * 2^31 -.word 16361085 // zeta^872 * f(q^(-1) mod 2^32) * 2^31 = 286215^872 * 71292929 * 2^31 -.word 41972451 // zeta^948 * 2^31 = 286215^948 * 2^31 = 19587949 * 2^31 -.word 3400729885 // zeta^948 * f(q^(-1) mod 2^32) * 2^31 = 286215^948 * 71292929 * 2^31 -.word 60320869 // zeta^692 * 2^31 = 286215^692 * 2^31 = 8415094 * 2^31 -.word 538401691 // zeta^692 * f(q^(-1) mod 2^32) * 2^31 = 286215^692 * 71292929 * 2^31 -.word 26091531 // zeta^616 * 2^31 = 286215^616 * 2^31 = 1232856 * 2^31 -.word 78878709 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 286215^616 * 71292929 * 2^31 -.word 53470077 // zeta^820 * 2^31 = 286215^820 * 2^31 = 9730603 * 2^31 -.word 2770052227 // zeta^820 * f(q^(-1) mod 2^32) * 2^31 = 286215^820 * 71292929 * 2^31 -.word 48566219 // zeta^564 * 2^31 = 286215^564 * 2^31 = 10704739 * 2^31 -.word 2832377909 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 286215^564 * 71292929 * 2^31 -.word 4276741 // zeta^936 * 2^31 = 286215^936 * 2^31 = 7554841 * 2^31 -.word 2630845947 // zeta^936 * f(q^(-1) mod 2^32) * 2^31 = 286215^936 * 71292929 * 2^31 -.word 16490153 // zeta^980 * 2^31 = 286215^980 * 2^31 = 408482 * 2^31 -.word 26134871 // zeta^980 * f(q^(-1) mod 2^32) * 2^31 = 286215^980 * 71292929 * 2^31 -.word 28235681 // zeta^724 * 2^31 = 286215^724 * 2^31 = 26159215 * 2^31 -.word 3821162591 // zeta^724 * f(q^(-1) mod 2^32) * 2^31 = 286215^724 * 71292929 * 2^31 -.word 39557271 // zeta^680 * 2^31 = 286215^680 * 2^31 = 19859369 * 2^31 -.word 3418095465 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 286215^680 * 71292929 * 2^31 -.word 20173291 // zeta^852 * 2^31 = 286215^852 * 2^31 = 29591261 * 2^31 -.word 4040746517 // zeta^852 * f(q^(-1) mod 2^32) * 2^31 = 286215^852 * 71292929 * 2^31 -.word 53760749 // zeta^596 * 2^31 = 286215^596 * 2^31 = 15340829 * 2^31 -.word 3128997139 // zeta^596 * f(q^(-1) mod 2^32) * 2^31 = 286215^596 * 71292929 * 2^31 -.word 64080051 // zeta^808 * 2^31 = 286215^808 * 2^31 = 19611677 * 2^31 -.word 3402248013 // zeta^808 * f(q^(-1) mod 2^32) * 2^31 = 286215^808 * 71292929 * 2^31 -.word 1785087 // zeta^916 * 2^31 = 286215^916 * 2^31 = 1487922 * 2^31 -.word 95197953 // zeta^916 * f(q^(-1) mod 2^32) * 2^31 = 286215^916 * 71292929 * 2^31 -.word 39175009 // zeta^660 * 2^31 = 286215^660 * 2^31 = 2082830 * 2^31 -.word 133260447 // zeta^660 * f(q^(-1) mod 2^32) * 2^31 = 286215^660 * 71292929 * 2^31 -.word 40363327 // zeta^552 * 2^31 = 286215^552 * 2^31 = 28756497 * 2^31 -.word 3987337921 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 286215^552 * 71292929 * 2^31 -.word 5942977 // zeta^788 * 2^31 = 286215^788 * 2^31 = 31126270 * 2^31 -.word 1991473471 // zeta^788 * f(q^(-1) mod 2^32) * 2^31 = 286215^788 * 71292929 * 2^31 -.word 46872159 // zeta^532 * 2^31 = 286215^532 * 2^31 = 11000739 * 2^31 -.word 2851316129 // zeta^532 * f(q^(-1) mod 2^32) * 2^31 = 286215^532 * 71292929 * 2^31 -.word 14946375 // zeta^968 * 2^31 = 286215^968 * 2^31 = 30392408 * 2^31 -.word 1944520633 // zeta^968 * f(q^(-1) mod 2^32) * 2^31 = 286215^968 * 71292929 * 2^31 -.word 54391843 // zeta^996 * 2^31 = 286215^996 * 2^31 = 5809404 * 2^31 -.word 371688413 // zeta^996 * f(q^(-1) mod 2^32) * 2^31 = 286215^996 * 71292929 * 2^31 -.word 16695421 // zeta^740 * 2^31 = 286215^740 * 2^31 = 15164721 * 2^31 -.word 3117729667 // zeta^740 * f(q^(-1) mod 2^32) * 2^31 = 286215^740 * 71292929 * 2^31 -.word 53251923 // zeta^712 * 2^31 = 286215^712 * 2^31 = 21625705 * 2^31 -.word 3531106477 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 286215^712 * 71292929 * 2^31 -.word 44381723 // zeta^868 * 2^31 = 286215^868 * 2^31 = 23877757 * 2^31 -.word 3675193829 // zeta^868 * f(q^(-1) mod 2^32) * 2^31 = 286215^868 * 71292929 * 2^31 -.word 66751131 // zeta^612 * 2^31 = 286215^612 * 2^31 = 3914592 * 2^31 -.word 250457445 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 286215^612 * 71292929 * 2^31 -.word 60760121 // zeta^840 * 2^31 = 286215^840 * 2^31 = 27051869 * 2^31 -.word 3878275015 // zeta^840 * f(q^(-1) mod 2^32) * 2^31 = 286215^840 * 71292929 * 2^31 -.word 43981805 // zeta^932 * 2^31 = 286215^932 * 2^31 = 11142392 * 2^31 -.word 712895507 // zeta^932 * f(q^(-1) mod 2^32) * 2^31 = 286215^932 * 71292929 * 2^31 -.word 19851773 // zeta^676 * 2^31 = 286215^676 * 2^31 = 25206915 * 2^31 -.word 3760233987 // zeta^676 * f(q^(-1) mod 2^32) * 2^31 = 286215^676 * 71292929 * 2^31 -.word 40454739 // zeta^584 * 2^31 = 286215^584 * 2^31 = 4314075 * 2^31 -.word 2423500205 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 286215^584 * 71292929 * 2^31 -.word 66659489 // zeta^804 * 2^31 = 286215^804 * 2^31 = 16126790 * 2^31 -.word 1031799647 // zeta^804 * f(q^(-1) mod 2^32) * 2^31 = 286215^804 * 71292929 * 2^31 -.word 2982887 // zeta^548 * 2^31 = 286215^548 * 2^31 = 1752167 * 2^31 -.word 2259588121 // zeta^548 * f(q^(-1) mod 2^32) * 2^31 = 286215^548 * 71292929 * 2^31 -.word 55066963 // zeta^904 * 2^31 = 286215^904 * 2^31 = 10010313 * 2^31 -.word 2787948205 // zeta^904 * f(q^(-1) mod 2^32) * 2^31 = 286215^904 * 71292929 * 2^31 -.word 26160385 // zeta^964 * 2^31 = 286215^964 * 2^31 = 6613966 * 2^31 -.word 423164671 // zeta^964 * f(q^(-1) mod 2^32) * 2^31 = 286215^964 * 71292929 * 2^31 -.word 61355527 // zeta^708 * 2^31 = 286215^708 * 2^31 = 10235835 * 2^31 -.word 2802377209 // zeta^708 * f(q^(-1) mod 2^32) * 2^31 = 286215^708 * 71292929 * 2^31 -.word 62067539 // zeta^648 * 2^31 = 286215^648 * 2^31 = 22700705 * 2^31 -.word 3599885485 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 286215^648 * 71292929 * 2^31 -.word 18833475 // zeta^836 * 2^31 = 286215^836 * 2^31 = 19186493 * 2^31 -.word 3375044541 // zeta^836 * f(q^(-1) mod 2^32) * 2^31 = 286215^836 * 71292929 * 2^31 -.word 53374797 // zeta^580 * 2^31 = 286215^580 * 2^31 = 14411664 * 2^31 -.word 922065075 // zeta^580 * f(q^(-1) mod 2^32) * 2^31 = 286215^580 * 71292929 * 2^31 -.word 57896497 // zeta^776 * 2^31 = 286215^776 * 2^31 = 14033313 * 2^31 -.word 3045341647 // zeta^776 * f(q^(-1) mod 2^32) * 2^31 = 286215^776 * 71292929 * 2^31 -.word 55382253 // zeta^900 * 2^31 = 286215^900 * 2^31 = 6317924 * 2^31 -.word 404223763 // zeta^900 * f(q^(-1) mod 2^32) * 2^31 = 286215^900 * 71292929 * 2^31 -.word 26227005 // zeta^644 * 2^31 = 286215^644 * 2^31 = 17537602 * 2^31 -.word 1122064067 // zeta^644 * f(q^(-1) mod 2^32) * 2^31 = 286215^644 * 71292929 * 2^31 -.word 59857581 // zeta^520 * 2^31 = 286215^520 * 2^31 = 21856450 * 2^31 -.word 1398386003 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 286215^520 * 71292929 * 2^31 -.word 60426631 // zeta^772 * 2^31 = 286215^772 * 2^31 = 4231493 * 2^31 -.word 2418216569 // zeta^772 * f(q^(-1) mod 2^32) * 2^31 = 286215^772 * 71292929 * 2^31 -.word 32956195 // zeta^516 * 2^31 = 286215^516 * 2^31 = 15772976 * 2^31 -.word 1009162461 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 286215^516 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete_bitrev, %function -.global ntt_1024_u32_33564673_286215_incomplete_bitrev -ntt_1024_u32_33564673_286215_incomplete_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r7 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q0, Q0, r6 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r9 -vqrdmulh.s32 Q4, Q2, r7 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r9 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[524]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vmul.u32 Q4, Q4, r6 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vqrdmlah.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r7 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r12,#(80)] -// Release input[524] from Q4 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[268]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vmul.u32 Q1, Q1, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(32)] -// Release input[512] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(48)] -// Release input[264] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[772]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vmul.u32 Q0, Q0, r6 -// input[776]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(64)] -// Release input[772] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(80)] -// Release input[776] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[652]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[908]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[908]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[900]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[904]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-400)] -// Release input[908] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-416)] -// Release input[904] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[76]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q1, Q1, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-448)] -// Release input[896] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q0, Q0, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[332]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[844]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[836]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[840]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(320)] -// Release input[836] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(336)] -// Release input[840] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[716]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-176)] -// Release input[712] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[452]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -52)] -vmul.u32 Q1, Q1, r6 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-192)] -// Release input[456] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[964]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -44)] -vmul.u32 Q0, Q0, r6 -// input[968]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-176)] -// Release input[964] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-160)] -// Release input[968] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[556]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[556]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[544]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(208)] -// Release input[556] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q0, Q0, r6 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(160)] -// Release input[544] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[812]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[804]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[808]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[800]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(192)] -// Release input[804] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(208)] -// Release input[808] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(176)] -// Release input[800] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[672]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[428]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[428]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release input[672] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[940]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-304)] -// Release input[428] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[940]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-352)] -// Release input[416] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[928]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-272)] -// Release input[940] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-320)] -// Release input[928] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[620]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[620]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[608]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[364]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(464)] -// Release input[620] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[364]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(416)] -// Release input[608] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(448)] -// Release input[364] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q0, Q0, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[864]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release input[864] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q0, Q0, r6 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1004]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1004]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[1000]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -8)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-16)] -// Release input[1004] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-32)] -// Release input[1000] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q1, Q1, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[272]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[796]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[796]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[788]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[792]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(80)] -// Release input[272] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(160)] -// Release input[796] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(128)] -// Release input[788] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(144)] -// Release input[792] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[668]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[668]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[656]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[412]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-352)] -// Release input[668] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[412]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-400)] -// Release input[656] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-368)] -// Release input[412] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-416)] -// Release input[400] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[912]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-384)] -// Release input[912] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[604]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[604]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[592]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(400)] -// Release input[604] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q0, Q0, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(352)] -// Release input[592] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[336]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[860]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[860]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[852]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(336)] -// Release input[336] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[848]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(416)] -// Release input[860] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(384)] -// Release input[852] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q1, Q1, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(368)] -// Release input[848] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[476]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[476]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-112)] -// Release input[476] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[980]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[984]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-112)] -// Release input[980] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-96)] -// Release input[984] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q0, Q0, r6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(224)] -// Release input[560] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[820]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 64)] -vmul.u32 Q0, Q0, r6 -// input[824]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(256)] -// Release input[820] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(272)] -// Release input[824] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[700]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[700]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[700] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release input[688] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[948]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[952]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[944]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-240)] -// Release input[948] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-224)] -// Release input[952] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-256)] -// Release input[944] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q0, Q0, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[892]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[892]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[884]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[888]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[880]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-464)] -// Release input[892] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-496)] -// Release input[884] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-480)] -// Release input[888] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q0, Q0, r6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release input[880] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(16)] -// Release input[760] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[500]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-16)] -// Release input[500] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(0)] -// Release input[504] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[1012]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 4)] -vmul.u32 Q0, Q0, r6 -// input[1016]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(16)] -// Release input[1012] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(32)] -// Release input[1016] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r6 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[560]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q1, Q1, r6 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[816]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[816]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[784]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[800]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(112)] -// Release input[784] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release input[800] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[432]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q2, Q2, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[944]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[912]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[928]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-384)] -// Release input[912] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[624]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[624]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[368]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(288)] -// Release input[576] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[880]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[848]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[864]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[832]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(368)] -// Release input[848] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q2, Q2, r6 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(304)] -// Release input[832] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q1, Q1, r6 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[976]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[992]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-128)] -// Release input[976] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-64)] -// Release input[992] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[56]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r6 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[568]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[312]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[312]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[824]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[792]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[808]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(144)] -// Release input[792] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(208)] -// Release input[808] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[696]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[696]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[648]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-240)] -// Release input[696] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[440]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-432)] -// Release input[648] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[952]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-256)] -// Release input[440] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[952]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-448)] -// Release input[392] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-416)] -// Release input[904] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q1, Q1, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[584]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-496)] -// Release input[632] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(320)] -// Release input[584] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[888]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[888]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-480)] -// Release input[888] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q1, Q1, r6 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[760]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vmul.u32 Q1, Q1, r6 -// input[1000]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[968]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-32)] -// Release input[1000] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-160)] -// Release input[968] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[564]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[820]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[788]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[804]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[772]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(128)] -// Release input[788] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(64)] -// Release input[772] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-448)] -// Release input[644] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[948]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[948]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-240)] -// Release input[948] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[580]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(304)] -// Release input[580] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[324]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[884]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[884]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[852]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmul.u32 Q1, Q1, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(288)] -// Release input[324] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[836]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-496)] -// Release input[884] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(384)] -// Release input[852] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(320)] -// Release input[836] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[708]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q1, Q1, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-192)] -// Release input[708] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1012]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1012]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[980]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[996]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -12)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-208)] -// Release input[452] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(16)] -// Release input[1012] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-112)] -// Release input[980] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[828]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[828]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[812]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(288)] -// Release input[828] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(160)] -// Release input[796] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(224)] -// Release input[812] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[908]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-336)] -// Release input[924] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-400)] -// Release input[908] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[636]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[636]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-480)] -// Release input[636] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q1, Q1, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[892]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[892]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[860]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-464)] -// Release input[892] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(416)] -// Release input[860] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vmul.u32 Q1, Q1, r6 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-112)] -// Release input[476] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[988]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[1004]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-80)] -// Release input[988] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-16)] -// Release input[1004] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vmul.u32 Q0, Q0, r6 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vmul.u32 Q2, Q2, r6 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[960]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[832]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[896]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(304)] -// Release input[832] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-448)] -// Release input[896] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[480]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[992]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[864]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[928]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(432)] -// Release input[864] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[720]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[528]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(96)] -// Release input[528] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[848]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 92)] -vmul.u32 Q2, Q2, r6 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(368)] -// Release input[848] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(112)] -// Release input[784] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[880]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[944]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -64)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[816]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(496)] -// Release input[880] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-256)] -// Release input[944] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r6 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q2, Q2, r6 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[456]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[968]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[840]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[904]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-416)] -// Release input[904] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[744]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[552]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-48)] -// Release input[744] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(192)] -// Release input[552] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[808]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(208)] -// Release input[808] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q2, Q2, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(128)] -// Release input[536] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[984]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[984]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[792]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-96)] -// Release input[984] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release input[792] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[312]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1016]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1016]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[888]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -120)] -vmul.u32 Q0, Q0, r6 -// input[952]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-480)] -// Release input[888] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-224)] -// Release input[952] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[196]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q2, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[708]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[836]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmul.u32 Q2, Q2, r6 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[772]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(320)] -// Release input[836] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[228]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(64)] -// Release input[772] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[548]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(176)] -// Release input[548] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[996]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[996]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[804]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-48)] -// Release input[996] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -// Release input[804] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[20]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q2, Q2, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(80)] -// Release input[20] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[532]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[468]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[468]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(112)] -// Release input[532] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[980]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-144)] -// Release input[468] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[980]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[852]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-112)] -// Release input[980] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(384)] -// Release input[852] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[884]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-496)] -// Release input[884] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmul.u32 Q2, Q2, r6 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[972]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[972]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[844]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[908]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[780]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-144)] -// Release input[972] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release input[844] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-400)] -// Release input[908] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(96)] -// Release input[780] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[556]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[492]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[492]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(208)] -// Release input[556] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[300]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1004]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-48)] -// Release input[492] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1004]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[876]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(192)] -// Release input[300] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-16)] -// Release input[1004] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release input[876] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q2, Q2, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[540]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(144)] -// Release input[540] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[860]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[796]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(416)] -// Release input[860] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release input[796] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vmul.u32 Q0, Q0, r6 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[892]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[956]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -52)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[828]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-464)] -// Release input[892] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-208)] -// Release input[956] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vmul.u32 Q0, Q0, r6 -// input[512]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(288)] -// Release input[828] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(32)] -// Release input[512] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[896]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[832]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[832]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vmul.u32 Q1, Q1, r6 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(304)] -// Release input[832] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[704]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release input[704] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[800]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[928]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[928]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-320)] -// Release input[928] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[864]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q0, Q0, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(432)] -// Release input[864] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[992]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[784]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[784]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q1, Q1, r6 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(112)] -// Release input[784] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[912]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-384)] -// Release input[912] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[848]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[976]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[976]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q1, Q1, r6 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-128)] -// Release input[976] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r6 -// input[560]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(224)] -// Release input[560] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[880]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(496)] -// Release input[880] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vmul.u32 Q0, Q0, r6 -// input[752]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(0)] -// Release input[1008] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-16)] -// Release input[752] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[776]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q2, Q2, r6 -// input[520]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(64)] -// Release input[520] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[904]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-416)] -// Release input[904] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[840]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q0, Q0, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[968]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[968]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[808]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-160)] -// Release input[968] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-176)] -// Release input[712] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[808]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[936]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(208)] -// Release input[808] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[936]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[872]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-288)] -// Release input[936] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[872]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[104]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(464)] -// Release input[872] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1000]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(416)] -// Release input[104] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-32)] -// Release input[1000] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[792]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[920]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[920]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[152]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[856]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-352)] -// Release input[920] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[856]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-400)] -// Release input[152] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[984]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(400)] -// Release input[856] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[984]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-96)] -// Release input[984] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[952]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[952]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[888]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-224)] -// Release input[952] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[888]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q0, Q0, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-480)] -// Release input[888] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[772]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(16)] -// Release input[760] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[772]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vmul.u32 Q1, Q1, r6 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(64)] -// Release input[772] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[900]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[836]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[836]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(320)] -// Release input[836] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[964]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[452]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -52)] -vmul.u32 Q1, Q1, r6 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[804]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q0, Q0, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[932]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[932]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[164]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[868]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-304)] -// Release input[932] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[868]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-352)] -// Release input[164] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[996]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(448)] -// Release input[868] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[996]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q0, Q0, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-48)] -// Release input[996] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[788]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[20]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[916]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[916]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(80)] -// Release input[20] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[852]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-368)] -// Release input[916] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[852]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q0, Q0, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-416)] -// Release input[148] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[84]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[980]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(384)] -// Release input[852] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[980]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(336)] -// Release input[84] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[820]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-112)] -// Release input[980] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[820]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[948]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release input[820] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[948]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-240)] -// Release input[948] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[500]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-16)] -// Release input[500] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vmul.u32 Q0, Q0, r6 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[908]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(80)] -// Release input[524] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-400)] -// Release input[908] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[460]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -44)] -vmul.u32 Q0, Q0, r6 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-176)] -// Release input[460] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-160)] -// Release input[716] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[940]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-272)] -// Release input[940] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q0, Q0, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1004]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[796]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-16)] -// Release input[1004] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(160)] -// Release input[796] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[860]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(416)] -// Release input[860] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-112)] -// Release input[476] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vmul.u32 Q0, Q0, r6 -// input[572]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(272)] -// Release input[572] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[892]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-464)] -// Release input[892] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vmul.u32 Q0, Q0, r6 -// input[764]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -vqrdmulh.s32 Q2, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q2, Q4, r9 -vstrw.u32 Q3, [r12,#(16)] -// Release input[508] from Q3 -vsub.s32 Q4, Q1, Q2 -vstrw.u32 Q4, [r11,#(32)] -// Release input[764] from Q4 -vadd.s32 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 10271 -// Instruction count: 7960 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst.s deleted file mode 100644 index cf21d2e..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst.s +++ /dev/null @@ -1,9471 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 50801141 // zeta^1008 * 2^31 = 286215^1008 * 2^31 = 19477423 * 2^31 -.word 3393658379 // zeta^1008 * f(q^(-1) mod 2^32) * 2^31 = 286215^1008 * 71292929 * 2^31 -.word 34432613 // zeta^752 * 2^31 = 286215^752 * 2^31 = 27454015 * 2^31 -.word 3904004507 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 286215^752 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 19296927 // zeta^880 * 2^31 = 286215^880 * 2^31 = 24389287 * 2^31 -.word 3707921761 // zeta^880 * f(q^(-1) mod 2^32) * 2^31 = 286215^880 * 71292929 * 2^31 -.word 15646559 // zeta^624 * 2^31 = 286215^624 * 2^31 = 31786565 * 2^31 -.word 4181203105 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 286215^624 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 49831615 // zeta^944 * 2^31 = 286215^944 * 2^31 = 8413164 * 2^31 -.word 538278209 // zeta^944 * f(q^(-1) mod 2^32) * 2^31 = 286215^944 * 71292929 * 2^31 -.word 54669671 // zeta^688 * 2^31 = 286215^688 * 2^31 = 24835380 * 2^31 -.word 1588979353 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 286215^688 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 44135817 // zeta^816 * 2^31 = 286215^816 * 2^31 = 12975937 * 2^31 -.word 2977690231 // zeta^816 * f(q^(-1) mod 2^32) * 2^31 = 286215^816 * 71292929 * 2^31 -.word 24702057 // zeta^560 * 2^31 = 286215^560 * 2^31 = 9759120 * 2^31 -.word 624393111 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 286215^560 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 27214985 // zeta^976 * 2^31 = 286215^976 * 2^31 = 7037169 * 2^31 -.word 2597725047 // zeta^976 * f(q^(-1) mod 2^32) * 2^31 = 286215^976 * 71292929 * 2^31 -.word 40167763 // zeta^720 * 2^31 = 286215^720 * 2^31 = 8735396 * 2^31 -.word 558894765 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 286215^720 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 5600575 // zeta^848 * 2^31 = 286215^848 * 2^31 = 3732990 * 2^31 -.word 238838465 // zeta^848 * f(q^(-1) mod 2^32) * 2^31 = 286215^848 * 71292929 * 2^31 -.word 53619655 // zeta^592 * 2^31 = 286215^592 * 2^31 = 18327945 * 2^31 -.word 3320114233 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 286215^592 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 64695847 // zeta^912 * 2^31 = 286215^912 * 2^31 = 5665384 * 2^31 -.word 362473945 // zeta^912 * f(q^(-1) mod 2^32) * 2^31 = 286215^912 * 71292929 * 2^31 -.word 42330409 // zeta^656 * 2^31 = 286215^656 * 2^31 = 32426145 * 2^31 -.word 4222123735 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 286215^656 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 10468161 // zeta^784 * 2^31 = 286215^784 * 2^31 = 8491986 * 2^31 -.word 543321279 // zeta^784 * f(q^(-1) mod 2^32) * 2^31 = 286215^784 * 71292929 * 2^31 -.word 32014745 // zeta^528 * 2^31 = 286215^528 * 2^31 = 2121761 * 2^31 -.word 2283234919 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 286215^528 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 31671151 // zeta^768 * 2^31 = 286215^768 * 2^31 = 27506971 * 2^31 -.word 3907392657 // zeta^768 * f(q^(-1) mod 2^32) * 2^31 = 286215^768 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 63583873 // zeta^896 * 2^31 = 286215^896 * 2^31 = 29493997 * 2^31 -.word 4034523519 // zeta^896 * f(q^(-1) mod 2^32) * 2^31 = 286215^896 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 22359133 // zeta^640 * 2^31 = 286215^640 * 2^31 = 17398315 * 2^31 -.word 3260636067 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 286215^640 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 23685711 // zeta^960 * 2^31 = 286215^960 * 2^31 = 10344459 * 2^31 -.word 2809327025 // zeta^960 * f(q^(-1) mod 2^32) * 2^31 = 286215^960 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 43351377 // zeta^704 * 2^31 = 286215^704 * 2^31 = 16978151 * 2^31 -.word 3233753775 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 286215^704 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 50973647 // zeta^832 * 2^31 = 286215^832 * 2^31 = 22303942 * 2^31 -.word 1427016753 // zeta^832 * f(q^(-1) mod 2^32) * 2^31 = 286215^832 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 47020583 // zeta^576 * 2^31 = 286215^576 * 2^31 = 30033475 * 2^31 -.word 4069039577 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 286215^576 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 29829771 // zeta^992 * 2^31 = 286215^992 * 2^31 = 20484768 * 2^31 -.word 1310625141 // zeta^992 * f(q^(-1) mod 2^32) * 2^31 = 286215^992 * 71292929 * 2^31 -.word 50801141 // zeta^1008 * 2^31 = 286215^1008 * 2^31 = 19477423 * 2^31 -.word 3393658379 // zeta^1008 * f(q^(-1) mod 2^32) * 2^31 = 286215^1008 * 71292929 * 2^31 -.word 34432613 // zeta^752 * 2^31 = 286215^752 * 2^31 = 27454015 * 2^31 -.word 3904004507 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 286215^752 * 71292929 * 2^31 -.word 17750767 // zeta^736 * 2^31 = 286215^736 * 2^31 = 23442917 * 2^31 -.word 3647372561 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 286215^736 * 71292929 * 2^31 -.word 19296927 // zeta^880 * 2^31 = 286215^880 * 2^31 = 24389287 * 2^31 -.word 3707921761 // zeta^880 * f(q^(-1) mod 2^32) * 2^31 = 286215^880 * 71292929 * 2^31 -.word 15646559 // zeta^624 * 2^31 = 286215^624 * 2^31 = 31786565 * 2^31 -.word 4181203105 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 286215^624 * 71292929 * 2^31 -.word 20957653 // zeta^864 * 2^31 = 286215^864 * 2^31 = 11809804 * 2^31 -.word 755596843 // zeta^864 * f(q^(-1) mod 2^32) * 2^31 = 286215^864 * 71292929 * 2^31 -.word 49831615 // zeta^944 * 2^31 = 286215^944 * 2^31 = 8413164 * 2^31 -.word 538278209 // zeta^944 * f(q^(-1) mod 2^32) * 2^31 = 286215^944 * 71292929 * 2^31 -.word 54669671 // zeta^688 * 2^31 = 286215^688 * 2^31 = 24835380 * 2^31 -.word 1588979353 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 286215^688 * 71292929 * 2^31 -.word 37614505 // zeta^608 * 2^31 = 286215^608 * 2^31 = 7756560 * 2^31 -.word 496268375 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 286215^608 * 71292929 * 2^31 -.word 44135817 // zeta^816 * 2^31 = 286215^816 * 2^31 = 12975937 * 2^31 -.word 2977690231 // zeta^816 * f(q^(-1) mod 2^32) * 2^31 = 286215^816 * 71292929 * 2^31 -.word 24702057 // zeta^560 * 2^31 = 286215^560 * 2^31 = 9759120 * 2^31 -.word 624393111 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 286215^560 * 71292929 * 2^31 -.word 49624149 // zeta^928 * 2^31 = 286215^928 * 2^31 = 26214285 * 2^31 -.word 3824685995 // zeta^928 * f(q^(-1) mod 2^32) * 2^31 = 286215^928 * 71292929 * 2^31 -.word 27214985 // zeta^976 * 2^31 = 286215^976 * 2^31 = 7037169 * 2^31 -.word 2597725047 // zeta^976 * f(q^(-1) mod 2^32) * 2^31 = 286215^976 * 71292929 * 2^31 -.word 40167763 // zeta^720 * 2^31 = 286215^720 * 2^31 = 8735396 * 2^31 -.word 558894765 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 286215^720 * 71292929 * 2^31 -.word 26256991 // zeta^672 * 2^31 = 286215^672 * 2^31 = 1227325 * 2^31 -.word 2226008481 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 286215^672 * 71292929 * 2^31 -.word 5600575 // zeta^848 * 2^31 = 286215^848 * 2^31 = 3732990 * 2^31 -.word 238838465 // zeta^848 * f(q^(-1) mod 2^32) * 2^31 = 286215^848 * 71292929 * 2^31 -.word 53619655 // zeta^592 * 2^31 = 286215^592 * 2^31 = 18327945 * 2^31 -.word 3320114233 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 286215^592 * 71292929 * 2^31 -.word 16700807 // zeta^800 * 2^31 = 286215^800 * 2^31 = 21664476 * 2^31 -.word 1386103417 // zeta^800 * f(q^(-1) mod 2^32) * 2^31 = 286215^800 * 71292929 * 2^31 -.word 64695847 // zeta^912 * 2^31 = 286215^912 * 2^31 = 5665384 * 2^31 -.word 362473945 // zeta^912 * f(q^(-1) mod 2^32) * 2^31 = 286215^912 * 71292929 * 2^31 -.word 42330409 // zeta^656 * 2^31 = 286215^656 * 2^31 = 32426145 * 2^31 -.word 4222123735 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 286215^656 * 71292929 * 2^31 -.word 10816687 // zeta^544 * 2^31 = 286215^544 * 2^31 = 25589677 * 2^31 -.word 3784723281 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 286215^544 * 71292929 * 2^31 -.word 10468161 // zeta^784 * 2^31 = 286215^784 * 2^31 = 8491986 * 2^31 -.word 543321279 // zeta^784 * f(q^(-1) mod 2^32) * 2^31 = 286215^784 * 71292929 * 2^31 -.word 32014745 // zeta^528 * 2^31 = 286215^528 * 2^31 = 2121761 * 2^31 -.word 2283234919 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 286215^528 * 71292929 * 2^31 -.word 50801141 // zeta^1008 * 2^31 = 286215^1008 * 2^31 = 19477423 * 2^31 -.word 3393658379 // zeta^1008 * f(q^(-1) mod 2^32) * 2^31 = 286215^1008 * 71292929 * 2^31 -.word 42676385 // zeta^1016 * 2^31 = 286215^1016 * 2^31 = 4406558 * 2^31 -.word 281933663 // zeta^1016 * f(q^(-1) mod 2^32) * 2^31 = 286215^1016 * 71292929 * 2^31 -.word 28237813 // zeta^760 * 2^31 = 286215^760 * 2^31 = 15781 * 2^31 -.word 2148493323 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 286215^760 * 71292929 * 2^31 -.word 34432613 // zeta^752 * 2^31 = 286215^752 * 2^31 = 27454015 * 2^31 -.word 3904004507 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 286215^752 * 71292929 * 2^31 -.word 49494049 // zeta^888 * 2^31 = 286215^888 * 2^31 = 29780798 * 2^31 -.word 1905389535 // zeta^888 * f(q^(-1) mod 2^32) * 2^31 = 286215^888 * 71292929 * 2^31 -.word 34765151 // zeta^632 * 2^31 = 286215^632 * 2^31 = 3446166 * 2^31 -.word 220487329 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 286215^632 * 71292929 * 2^31 -.word 19296927 // zeta^880 * 2^31 = 286215^880 * 2^31 = 24389287 * 2^31 -.word 3707921761 // zeta^880 * f(q^(-1) mod 2^32) * 2^31 = 286215^880 * 71292929 * 2^31 -.word 54213995 // zeta^952 * 2^31 = 286215^952 * 2^31 = 14583628 * 2^31 -.word 933067413 // zeta^952 * f(q^(-1) mod 2^32) * 2^31 = 286215^952 * 71292929 * 2^31 -.word 8006763 // zeta^696 * 2^31 = 286215^696 * 2^31 = 20902680 * 2^31 -.word 1337363349 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 286215^696 * 71292929 * 2^31 -.word 15646559 // zeta^624 * 2^31 = 286215^624 * 2^31 = 31786565 * 2^31 -.word 4181203105 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 286215^624 * 71292929 * 2^31 -.word 15970361 // zeta^824 * 2^31 = 286215^824 * 2^31 = 8478458 * 2^31 -.word 542455751 // zeta^824 * f(q^(-1) mod 2^32) * 2^31 = 286215^824 * 71292929 * 2^31 -.word 4170189 // zeta^568 * 2^31 = 286215^568 * 2^31 = 19347624 * 2^31 -.word 1237870131 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 286215^568 * 71292929 * 2^31 -.word 49831615 // zeta^944 * 2^31 = 286215^944 * 2^31 = 8413164 * 2^31 -.word 538278209 // zeta^944 * f(q^(-1) mod 2^32) * 2^31 = 286215^944 * 71292929 * 2^31 -.word 5906831 // zeta^984 * 2^31 = 286215^984 * 2^31 = 30717302 * 2^31 -.word 1965307505 // zeta^984 * f(q^(-1) mod 2^32) * 2^31 = 286215^984 * 71292929 * 2^31 -.word 32641999 // zeta^728 * 2^31 = 286215^728 * 2^31 = 8758145 * 2^31 -.word 2707833905 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 286215^728 * 71292929 * 2^31 -.word 54669671 // zeta^688 * 2^31 = 286215^688 * 2^31 = 24835380 * 2^31 -.word 1588979353 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 286215^688 * 71292929 * 2^31 -.word 30303273 // zeta^856 * 2^31 = 286215^856 * 2^31 = 4089071 * 2^31 -.word 2409104343 // zeta^856 * f(q^(-1) mod 2^32) * 2^31 = 286215^856 * 71292929 * 2^31 -.word 53563597 // zeta^600 * 2^31 = 286215^600 * 2^31 = 19452428 * 2^31 -.word 1244575539 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 286215^600 * 71292929 * 2^31 -.word 44135817 // zeta^816 * 2^31 = 286215^816 * 2^31 = 12975937 * 2^31 -.word 2977690231 // zeta^816 * f(q^(-1) mod 2^32) * 2^31 = 286215^816 * 71292929 * 2^31 -.word 27877887 // zeta^920 * 2^31 = 286215^920 * 2^31 = 31965169 * 2^31 -.word 4192630273 // zeta^920 * f(q^(-1) mod 2^32) * 2^31 = 286215^920 * 71292929 * 2^31 -.word 54874279 // zeta^664 * 2^31 = 286215^664 * 2^31 = 3036860 * 2^31 -.word 194299737 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 286215^664 * 71292929 * 2^31 -.word 24702057 // zeta^560 * 2^31 = 286215^560 * 2^31 = 9759120 * 2^31 -.word 624393111 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 286215^560 * 71292929 * 2^31 -.word 11789331 // zeta^792 * 2^31 = 286215^792 * 2^31 = 19452799 * 2^31 -.word 3392082925 // zeta^792 * f(q^(-1) mod 2^32) * 2^31 = 286215^792 * 71292929 * 2^31 -.word 5622871 // zeta^536 * 2^31 = 286215^536 * 2^31 = 30901251 * 2^31 -.word 4124560297 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 286215^536 * 71292929 * 2^31 -.word 27214985 // zeta^976 * 2^31 = 286215^976 * 2^31 = 7037169 * 2^31 -.word 2597725047 // zeta^976 * f(q^(-1) mod 2^32) * 2^31 = 286215^976 * 71292929 * 2^31 -.word 30990117 // zeta^1000 * 2^31 = 286215^1000 * 2^31 = 988369 * 2^31 -.word 2210719963 // zeta^1000 * f(q^(-1) mod 2^32) * 2^31 = 286215^1000 * 71292929 * 2^31 -.word 34156189 // zeta^744 * 2^31 = 286215^744 * 2^31 = 21501702 * 2^31 -.word 1375689059 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 286215^744 * 71292929 * 2^31 -.word 40167763 // zeta^720 * 2^31 = 286215^720 * 2^31 = 8735396 * 2^31 -.word 558894765 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 286215^720 * 71292929 * 2^31 -.word 50516355 // zeta^872 * 2^31 = 286215^872 * 2^31 = 255720 * 2^31 -.word 16361085 // zeta^872 * f(q^(-1) mod 2^32) * 2^31 = 286215^872 * 71292929 * 2^31 -.word 26091531 // zeta^616 * 2^31 = 286215^616 * 2^31 = 1232856 * 2^31 -.word 78878709 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 286215^616 * 71292929 * 2^31 -.word 5600575 // zeta^848 * 2^31 = 286215^848 * 2^31 = 3732990 * 2^31 -.word 238838465 // zeta^848 * f(q^(-1) mod 2^32) * 2^31 = 286215^848 * 71292929 * 2^31 -.word 4276741 // zeta^936 * 2^31 = 286215^936 * 2^31 = 7554841 * 2^31 -.word 2630845947 // zeta^936 * f(q^(-1) mod 2^32) * 2^31 = 286215^936 * 71292929 * 2^31 -.word 39557271 // zeta^680 * 2^31 = 286215^680 * 2^31 = 19859369 * 2^31 -.word 3418095465 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 286215^680 * 71292929 * 2^31 -.word 53619655 // zeta^592 * 2^31 = 286215^592 * 2^31 = 18327945 * 2^31 -.word 3320114233 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 286215^592 * 71292929 * 2^31 -.word 64080051 // zeta^808 * 2^31 = 286215^808 * 2^31 = 19611677 * 2^31 -.word 3402248013 // zeta^808 * f(q^(-1) mod 2^32) * 2^31 = 286215^808 * 71292929 * 2^31 -.word 40363327 // zeta^552 * 2^31 = 286215^552 * 2^31 = 28756497 * 2^31 -.word 3987337921 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 286215^552 * 71292929 * 2^31 -.word 64695847 // zeta^912 * 2^31 = 286215^912 * 2^31 = 5665384 * 2^31 -.word 362473945 // zeta^912 * f(q^(-1) mod 2^32) * 2^31 = 286215^912 * 71292929 * 2^31 -.word 14946375 // zeta^968 * 2^31 = 286215^968 * 2^31 = 30392408 * 2^31 -.word 1944520633 // zeta^968 * f(q^(-1) mod 2^32) * 2^31 = 286215^968 * 71292929 * 2^31 -.word 53251923 // zeta^712 * 2^31 = 286215^712 * 2^31 = 21625705 * 2^31 -.word 3531106477 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 286215^712 * 71292929 * 2^31 -.word 42330409 // zeta^656 * 2^31 = 286215^656 * 2^31 = 32426145 * 2^31 -.word 4222123735 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 286215^656 * 71292929 * 2^31 -.word 60760121 // zeta^840 * 2^31 = 286215^840 * 2^31 = 27051869 * 2^31 -.word 3878275015 // zeta^840 * f(q^(-1) mod 2^32) * 2^31 = 286215^840 * 71292929 * 2^31 -.word 40454739 // zeta^584 * 2^31 = 286215^584 * 2^31 = 4314075 * 2^31 -.word 2423500205 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 286215^584 * 71292929 * 2^31 -.word 10468161 // zeta^784 * 2^31 = 286215^784 * 2^31 = 8491986 * 2^31 -.word 543321279 // zeta^784 * f(q^(-1) mod 2^32) * 2^31 = 286215^784 * 71292929 * 2^31 -.word 55066963 // zeta^904 * 2^31 = 286215^904 * 2^31 = 10010313 * 2^31 -.word 2787948205 // zeta^904 * f(q^(-1) mod 2^32) * 2^31 = 286215^904 * 71292929 * 2^31 -.word 62067539 // zeta^648 * 2^31 = 286215^648 * 2^31 = 22700705 * 2^31 -.word 3599885485 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 286215^648 * 71292929 * 2^31 -.word 32014745 // zeta^528 * 2^31 = 286215^528 * 2^31 = 2121761 * 2^31 -.word 2283234919 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 286215^528 * 71292929 * 2^31 -.word 57896497 // zeta^776 * 2^31 = 286215^776 * 2^31 = 14033313 * 2^31 -.word 3045341647 // zeta^776 * f(q^(-1) mod 2^32) * 2^31 = 286215^776 * 71292929 * 2^31 -.word 59857581 // zeta^520 * 2^31 = 286215^520 * 2^31 = 21856450 * 2^31 -.word 1398386003 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 286215^520 * 71292929 * 2^31 -.word 42676385 // zeta^1016 * 2^31 = 286215^1016 * 2^31 = 4406558 * 2^31 -.word 281933663 // zeta^1016 * f(q^(-1) mod 2^32) * 2^31 = 286215^1016 * 71292929 * 2^31 -.word 46825465 // zeta^1020 * 2^31 = 286215^1020 * 2^31 = 16508237 * 2^31 -.word 3203688455 // zeta^1020 * f(q^(-1) mod 2^32) * 2^31 = 286215^1020 * 71292929 * 2^31 -.word 36459513 // zeta^764 * 2^31 = 286215^764 * 2^31 = 2280712 * 2^31 -.word 145921031 // zeta^764 * f(q^(-1) mod 2^32) * 2^31 = 286215^764 * 71292929 * 2^31 -.word 28237813 // zeta^760 * 2^31 = 286215^760 * 2^31 = 15781 * 2^31 -.word 2148493323 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 286215^760 * 71292929 * 2^31 -.word 31181531 // zeta^892 * 2^31 = 286215^892 * 2^31 = 13396396 * 2^31 -.word 857107749 // zeta^892 * f(q^(-1) mod 2^32) * 2^31 = 286215^892 * 71292929 * 2^31 -.word 30379019 // zeta^636 * 2^31 = 286215^636 * 2^31 = 16079834 * 2^31 -.word 1028795381 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 286215^636 * 71292929 * 2^31 -.word 49494049 // zeta^888 * 2^31 = 286215^888 * 2^31 = 29780798 * 2^31 -.word 1905389535 // zeta^888 * f(q^(-1) mod 2^32) * 2^31 = 286215^888 * 71292929 * 2^31 -.word 22115117 // zeta^956 * 2^31 = 286215^956 * 2^31 = 15059014 * 2^31 -.word 963482835 // zeta^956 * f(q^(-1) mod 2^32) * 2^31 = 286215^956 * 71292929 * 2^31 -.word 58687131 // zeta^700 * 2^31 = 286215^700 * 2^31 = 22429089 * 2^31 -.word 3582507365 // zeta^700 * f(q^(-1) mod 2^32) * 2^31 = 286215^700 * 71292929 * 2^31 -.word 34765151 // zeta^632 * 2^31 = 286215^632 * 2^31 = 3446166 * 2^31 -.word 220487329 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 286215^632 * 71292929 * 2^31 -.word 31362151 // zeta^828 * 2^31 = 286215^828 * 2^31 = 3754664 * 2^31 -.word 240225177 // zeta^828 * f(q^(-1) mod 2^32) * 2^31 = 286215^828 * 71292929 * 2^31 -.word 29454233 // zeta^572 * 2^31 = 286215^572 * 2^31 = 28695573 * 2^31 -.word 3983439975 // zeta^572 * f(q^(-1) mod 2^32) * 2^31 = 286215^572 * 71292929 * 2^31 -.word 54213995 // zeta^952 * 2^31 = 286215^952 * 2^31 = 14583628 * 2^31 -.word 933067413 // zeta^952 * f(q^(-1) mod 2^32) * 2^31 = 286215^952 * 71292929 * 2^31 -.word 12244249 // zeta^988 * 2^31 = 286215^988 * 2^31 = 1656389 * 2^31 -.word 2253460199 // zeta^988 * f(q^(-1) mod 2^32) * 2^31 = 286215^988 * 71292929 * 2^31 -.word 41856427 // zeta^732 * 2^31 = 286215^732 * 2^31 = 13082561 * 2^31 -.word 2984512085 // zeta^732 * f(q^(-1) mod 2^32) * 2^31 = 286215^732 * 71292929 * 2^31 -.word 8006763 // zeta^696 * 2^31 = 286215^696 * 2^31 = 20902680 * 2^31 -.word 1337363349 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 286215^696 * 71292929 * 2^31 -.word 61228467 // zeta^860 * 2^31 = 286215^860 * 2^31 = 16386641 * 2^31 -.word 3195908685 // zeta^860 * f(q^(-1) mod 2^32) * 2^31 = 286215^860 * 71292929 * 2^31 -.word 27503845 // zeta^604 * 2^31 = 286215^604 * 2^31 = 17864119 * 2^31 -.word 3290438427 // zeta^604 * f(q^(-1) mod 2^32) * 2^31 = 286215^604 * 71292929 * 2^31 -.word 15970361 // zeta^824 * 2^31 = 286215^824 * 2^31 = 8478458 * 2^31 -.word 542455751 // zeta^824 * f(q^(-1) mod 2^32) * 2^31 = 286215^824 * 71292929 * 2^31 -.word 11828069 // zeta^924 * 2^31 = 286215^924 * 2^31 = 18178781 * 2^31 -.word 3310570651 // zeta^924 * f(q^(-1) mod 2^32) * 2^31 = 286215^924 * 71292929 * 2^31 -.word 26556411 // zeta^668 * 2^31 = 286215^668 * 2^31 = 26330978 * 2^31 -.word 1684668421 // zeta^668 * f(q^(-1) mod 2^32) * 2^31 = 286215^668 * 71292929 * 2^31 -.word 4170189 // zeta^568 * 2^31 = 286215^568 * 2^31 = 19347624 * 2^31 -.word 1237870131 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 286215^568 * 71292929 * 2^31 -.word 51861145 // zeta^796 * 2^31 = 286215^796 * 2^31 = 6343125 * 2^31 -.word 2553319783 // zeta^796 * f(q^(-1) mod 2^32) * 2^31 = 286215^796 * 71292929 * 2^31 -.word 36544089 // zeta^540 * 2^31 = 286215^540 * 2^31 = 9522304 * 2^31 -.word 609241511 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 286215^540 * 71292929 * 2^31 -.word 5906831 // zeta^984 * 2^31 = 286215^984 * 2^31 = 30717302 * 2^31 -.word 1965307505 // zeta^984 * f(q^(-1) mod 2^32) * 2^31 = 286215^984 * 71292929 * 2^31 -.word 22546241 // zeta^1004 * 2^31 = 286215^1004 * 2^31 = 28199455 * 2^31 -.word 3951698111 // zeta^1004 * f(q^(-1) mod 2^32) * 2^31 = 286215^1004 * 71292929 * 2^31 -.word 38046867 // zeta^748 * 2^31 = 286215^748 * 2^31 = 17555098 * 2^31 -.word 1123183469 // zeta^748 * f(q^(-1) mod 2^32) * 2^31 = 286215^748 * 71292929 * 2^31 -.word 32641999 // zeta^728 * 2^31 = 286215^728 * 2^31 = 8758145 * 2^31 -.word 2707833905 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 286215^728 * 71292929 * 2^31 -.word 27734805 // zeta^876 * 2^31 = 286215^876 * 2^31 = 1331690 * 2^31 -.word 85202155 // zeta^876 * f(q^(-1) mod 2^32) * 2^31 = 286215^876 * 71292929 * 2^31 -.word 28876291 // zeta^620 * 2^31 = 286215^620 * 2^31 = 19461786 * 2^31 -.word 1245174269 // zeta^620 * f(q^(-1) mod 2^32) * 2^31 = 286215^620 * 71292929 * 2^31 -.word 30303273 // zeta^856 * 2^31 = 286215^856 * 2^31 = 4089071 * 2^31 -.word 2409104343 // zeta^856 * f(q^(-1) mod 2^32) * 2^31 = 286215^856 * 71292929 * 2^31 -.word 37621937 // zeta^940 * 2^31 = 286215^940 * 2^31 = 16812647 * 2^31 -.word 3223164751 // zeta^940 * f(q^(-1) mod 2^32) * 2^31 = 286215^940 * 71292929 * 2^31 -.word 1992249 // zeta^684 * 2^31 = 286215^684 * 2^31 = 20349512 * 2^31 -.word 1301971399 // zeta^684 * f(q^(-1) mod 2^32) * 2^31 = 286215^684 * 71292929 * 2^31 -.word 53563597 // zeta^600 * 2^31 = 286215^600 * 2^31 = 19452428 * 2^31 -.word 1244575539 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 286215^600 * 71292929 * 2^31 -.word 59370589 // zeta^812 * 2^31 = 286215^812 * 2^31 = 33077723 * 2^31 -.word 4263812003 // zeta^812 * f(q^(-1) mod 2^32) * 2^31 = 286215^812 * 71292929 * 2^31 -.word 62535263 // zeta^556 * 2^31 = 286215^556 * 2^31 = 266968 * 2^31 -.word 17080737 // zeta^556 * f(q^(-1) mod 2^32) * 2^31 = 286215^556 * 71292929 * 2^31 -.word 27877887 // zeta^920 * 2^31 = 286215^920 * 2^31 = 31965169 * 2^31 -.word 4192630273 // zeta^920 * f(q^(-1) mod 2^32) * 2^31 = 286215^920 * 71292929 * 2^31 -.word 17316887 // zeta^972 * 2^31 = 286215^972 * 2^31 = 26905985 * 2^31 -.word 3868941289 // zeta^972 * f(q^(-1) mod 2^32) * 2^31 = 286215^972 * 71292929 * 2^31 -.word 37759397 // zeta^716 * 2^31 = 286215^716 * 2^31 = 1837226 * 2^31 -.word 117546587 // zeta^716 * f(q^(-1) mod 2^32) * 2^31 = 286215^716 * 71292929 * 2^31 -.word 54874279 // zeta^664 * 2^31 = 286215^664 * 2^31 = 3036860 * 2^31 -.word 194299737 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 286215^664 * 71292929 * 2^31 -.word 49424125 // zeta^844 * 2^31 = 286215^844 * 2^31 = 8363900 * 2^31 -.word 535126275 // zeta^844 * f(q^(-1) mod 2^32) * 2^31 = 286215^844 * 71292929 * 2^31 -.word 27346539 // zeta^588 * 2^31 = 286215^588 * 2^31 = 27959065 * 2^31 -.word 3936317845 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 286215^588 * 71292929 * 2^31 -.word 11789331 // zeta^792 * 2^31 = 286215^792 * 2^31 = 19452799 * 2^31 -.word 3392082925 // zeta^792 * f(q^(-1) mod 2^32) * 2^31 = 286215^792 * 71292929 * 2^31 -.word 40459631 // zeta^908 * 2^31 = 286215^908 * 2^31 = 24239310 * 2^31 -.word 1550842513 // zeta^908 * f(q^(-1) mod 2^32) * 2^31 = 286215^908 * 71292929 * 2^31 -.word 43261973 // zeta^652 * 2^31 = 286215^652 * 2^31 = 19190655 * 2^31 -.word 3375310827 // zeta^652 * f(q^(-1) mod 2^32) * 2^31 = 286215^652 * 71292929 * 2^31 -.word 5622871 // zeta^536 * 2^31 = 286215^536 * 2^31 = 30901251 * 2^31 -.word 4124560297 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 286215^536 * 71292929 * 2^31 -.word 8722395 // zeta^780 * 2^31 = 286215^780 * 2^31 = 27391270 * 2^31 -.word 1752506405 // zeta^780 * f(q^(-1) mod 2^32) * 2^31 = 286215^780 * 71292929 * 2^31 -.word 6423675 // zeta^524 * 2^31 = 286215^524 * 2^31 = 18242188 * 2^31 -.word 1167143813 // zeta^524 * f(q^(-1) mod 2^32) * 2^31 = 286215^524 * 71292929 * 2^31 -.word 30990117 // zeta^1000 * 2^31 = 286215^1000 * 2^31 = 988369 * 2^31 -.word 2210719963 // zeta^1000 * f(q^(-1) mod 2^32) * 2^31 = 286215^1000 * 71292929 * 2^31 -.word 65179259 // zeta^1012 * 2^31 = 286215^1012 * 2^31 = 22978057 * 2^31 -.word 3617630597 // zeta^1012 * f(q^(-1) mod 2^32) * 2^31 = 286215^1012 * 71292929 * 2^31 -.word 59951743 // zeta^756 * 2^31 = 286215^756 * 2^31 = 21060944 * 2^31 -.word 1347489153 // zeta^756 * f(q^(-1) mod 2^32) * 2^31 = 286215^756 * 71292929 * 2^31 -.word 34156189 // zeta^744 * 2^31 = 286215^744 * 2^31 = 21501702 * 2^31 -.word 1375689059 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 286215^744 * 71292929 * 2^31 -.word 26490293 // zeta^884 * 2^31 = 286215^884 * 2^31 = 26632199 * 2^31 -.word 3851424331 // zeta^884 * f(q^(-1) mod 2^32) * 2^31 = 286215^884 * 71292929 * 2^31 -.word 17634531 // zeta^628 * 2^31 = 286215^628 * 2^31 = 18214722 * 2^31 -.word 1165386525 // zeta^628 * f(q^(-1) mod 2^32) * 2^31 = 286215^628 * 71292929 * 2^31 -.word 50516355 // zeta^872 * 2^31 = 286215^872 * 2^31 = 255720 * 2^31 -.word 16361085 // zeta^872 * f(q^(-1) mod 2^32) * 2^31 = 286215^872 * 71292929 * 2^31 -.word 41972451 // zeta^948 * 2^31 = 286215^948 * 2^31 = 19587949 * 2^31 -.word 3400729885 // zeta^948 * f(q^(-1) mod 2^32) * 2^31 = 286215^948 * 71292929 * 2^31 -.word 60320869 // zeta^692 * 2^31 = 286215^692 * 2^31 = 8415094 * 2^31 -.word 538401691 // zeta^692 * f(q^(-1) mod 2^32) * 2^31 = 286215^692 * 71292929 * 2^31 -.word 26091531 // zeta^616 * 2^31 = 286215^616 * 2^31 = 1232856 * 2^31 -.word 78878709 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 286215^616 * 71292929 * 2^31 -.word 53470077 // zeta^820 * 2^31 = 286215^820 * 2^31 = 9730603 * 2^31 -.word 2770052227 // zeta^820 * f(q^(-1) mod 2^32) * 2^31 = 286215^820 * 71292929 * 2^31 -.word 48566219 // zeta^564 * 2^31 = 286215^564 * 2^31 = 10704739 * 2^31 -.word 2832377909 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 286215^564 * 71292929 * 2^31 -.word 4276741 // zeta^936 * 2^31 = 286215^936 * 2^31 = 7554841 * 2^31 -.word 2630845947 // zeta^936 * f(q^(-1) mod 2^32) * 2^31 = 286215^936 * 71292929 * 2^31 -.word 16490153 // zeta^980 * 2^31 = 286215^980 * 2^31 = 408482 * 2^31 -.word 26134871 // zeta^980 * f(q^(-1) mod 2^32) * 2^31 = 286215^980 * 71292929 * 2^31 -.word 28235681 // zeta^724 * 2^31 = 286215^724 * 2^31 = 26159215 * 2^31 -.word 3821162591 // zeta^724 * f(q^(-1) mod 2^32) * 2^31 = 286215^724 * 71292929 * 2^31 -.word 39557271 // zeta^680 * 2^31 = 286215^680 * 2^31 = 19859369 * 2^31 -.word 3418095465 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 286215^680 * 71292929 * 2^31 -.word 20173291 // zeta^852 * 2^31 = 286215^852 * 2^31 = 29591261 * 2^31 -.word 4040746517 // zeta^852 * f(q^(-1) mod 2^32) * 2^31 = 286215^852 * 71292929 * 2^31 -.word 53760749 // zeta^596 * 2^31 = 286215^596 * 2^31 = 15340829 * 2^31 -.word 3128997139 // zeta^596 * f(q^(-1) mod 2^32) * 2^31 = 286215^596 * 71292929 * 2^31 -.word 64080051 // zeta^808 * 2^31 = 286215^808 * 2^31 = 19611677 * 2^31 -.word 3402248013 // zeta^808 * f(q^(-1) mod 2^32) * 2^31 = 286215^808 * 71292929 * 2^31 -.word 1785087 // zeta^916 * 2^31 = 286215^916 * 2^31 = 1487922 * 2^31 -.word 95197953 // zeta^916 * f(q^(-1) mod 2^32) * 2^31 = 286215^916 * 71292929 * 2^31 -.word 39175009 // zeta^660 * 2^31 = 286215^660 * 2^31 = 2082830 * 2^31 -.word 133260447 // zeta^660 * f(q^(-1) mod 2^32) * 2^31 = 286215^660 * 71292929 * 2^31 -.word 40363327 // zeta^552 * 2^31 = 286215^552 * 2^31 = 28756497 * 2^31 -.word 3987337921 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 286215^552 * 71292929 * 2^31 -.word 5942977 // zeta^788 * 2^31 = 286215^788 * 2^31 = 31126270 * 2^31 -.word 1991473471 // zeta^788 * f(q^(-1) mod 2^32) * 2^31 = 286215^788 * 71292929 * 2^31 -.word 46872159 // zeta^532 * 2^31 = 286215^532 * 2^31 = 11000739 * 2^31 -.word 2851316129 // zeta^532 * f(q^(-1) mod 2^32) * 2^31 = 286215^532 * 71292929 * 2^31 -.word 14946375 // zeta^968 * 2^31 = 286215^968 * 2^31 = 30392408 * 2^31 -.word 1944520633 // zeta^968 * f(q^(-1) mod 2^32) * 2^31 = 286215^968 * 71292929 * 2^31 -.word 54391843 // zeta^996 * 2^31 = 286215^996 * 2^31 = 5809404 * 2^31 -.word 371688413 // zeta^996 * f(q^(-1) mod 2^32) * 2^31 = 286215^996 * 71292929 * 2^31 -.word 16695421 // zeta^740 * 2^31 = 286215^740 * 2^31 = 15164721 * 2^31 -.word 3117729667 // zeta^740 * f(q^(-1) mod 2^32) * 2^31 = 286215^740 * 71292929 * 2^31 -.word 53251923 // zeta^712 * 2^31 = 286215^712 * 2^31 = 21625705 * 2^31 -.word 3531106477 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 286215^712 * 71292929 * 2^31 -.word 44381723 // zeta^868 * 2^31 = 286215^868 * 2^31 = 23877757 * 2^31 -.word 3675193829 // zeta^868 * f(q^(-1) mod 2^32) * 2^31 = 286215^868 * 71292929 * 2^31 -.word 66751131 // zeta^612 * 2^31 = 286215^612 * 2^31 = 3914592 * 2^31 -.word 250457445 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 286215^612 * 71292929 * 2^31 -.word 60760121 // zeta^840 * 2^31 = 286215^840 * 2^31 = 27051869 * 2^31 -.word 3878275015 // zeta^840 * f(q^(-1) mod 2^32) * 2^31 = 286215^840 * 71292929 * 2^31 -.word 43981805 // zeta^932 * 2^31 = 286215^932 * 2^31 = 11142392 * 2^31 -.word 712895507 // zeta^932 * f(q^(-1) mod 2^32) * 2^31 = 286215^932 * 71292929 * 2^31 -.word 19851773 // zeta^676 * 2^31 = 286215^676 * 2^31 = 25206915 * 2^31 -.word 3760233987 // zeta^676 * f(q^(-1) mod 2^32) * 2^31 = 286215^676 * 71292929 * 2^31 -.word 40454739 // zeta^584 * 2^31 = 286215^584 * 2^31 = 4314075 * 2^31 -.word 2423500205 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 286215^584 * 71292929 * 2^31 -.word 66659489 // zeta^804 * 2^31 = 286215^804 * 2^31 = 16126790 * 2^31 -.word 1031799647 // zeta^804 * f(q^(-1) mod 2^32) * 2^31 = 286215^804 * 71292929 * 2^31 -.word 2982887 // zeta^548 * 2^31 = 286215^548 * 2^31 = 1752167 * 2^31 -.word 2259588121 // zeta^548 * f(q^(-1) mod 2^32) * 2^31 = 286215^548 * 71292929 * 2^31 -.word 55066963 // zeta^904 * 2^31 = 286215^904 * 2^31 = 10010313 * 2^31 -.word 2787948205 // zeta^904 * f(q^(-1) mod 2^32) * 2^31 = 286215^904 * 71292929 * 2^31 -.word 26160385 // zeta^964 * 2^31 = 286215^964 * 2^31 = 6613966 * 2^31 -.word 423164671 // zeta^964 * f(q^(-1) mod 2^32) * 2^31 = 286215^964 * 71292929 * 2^31 -.word 61355527 // zeta^708 * 2^31 = 286215^708 * 2^31 = 10235835 * 2^31 -.word 2802377209 // zeta^708 * f(q^(-1) mod 2^32) * 2^31 = 286215^708 * 71292929 * 2^31 -.word 62067539 // zeta^648 * 2^31 = 286215^648 * 2^31 = 22700705 * 2^31 -.word 3599885485 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 286215^648 * 71292929 * 2^31 -.word 18833475 // zeta^836 * 2^31 = 286215^836 * 2^31 = 19186493 * 2^31 -.word 3375044541 // zeta^836 * f(q^(-1) mod 2^32) * 2^31 = 286215^836 * 71292929 * 2^31 -.word 53374797 // zeta^580 * 2^31 = 286215^580 * 2^31 = 14411664 * 2^31 -.word 922065075 // zeta^580 * f(q^(-1) mod 2^32) * 2^31 = 286215^580 * 71292929 * 2^31 -.word 57896497 // zeta^776 * 2^31 = 286215^776 * 2^31 = 14033313 * 2^31 -.word 3045341647 // zeta^776 * f(q^(-1) mod 2^32) * 2^31 = 286215^776 * 71292929 * 2^31 -.word 55382253 // zeta^900 * 2^31 = 286215^900 * 2^31 = 6317924 * 2^31 -.word 404223763 // zeta^900 * f(q^(-1) mod 2^32) * 2^31 = 286215^900 * 71292929 * 2^31 -.word 26227005 // zeta^644 * 2^31 = 286215^644 * 2^31 = 17537602 * 2^31 -.word 1122064067 // zeta^644 * f(q^(-1) mod 2^32) * 2^31 = 286215^644 * 71292929 * 2^31 -.word 59857581 // zeta^520 * 2^31 = 286215^520 * 2^31 = 21856450 * 2^31 -.word 1398386003 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 286215^520 * 71292929 * 2^31 -.word 60426631 // zeta^772 * 2^31 = 286215^772 * 2^31 = 4231493 * 2^31 -.word 2418216569 // zeta^772 * f(q^(-1) mod 2^32) * 2^31 = 286215^772 * 71292929 * 2^31 -.word 32956195 // zeta^516 * 2^31 = 286215^516 * 2^31 = 15772976 * 2^31 -.word 1009162461 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 286215^516 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst, %function -.global ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst -ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q2, r3 -// input[0]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 0)] -vmul.u32 Q2, Q2, r2 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vsub.s32 Q0, Q3, Q4 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r9 -// input[524]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 20)] -vsub.s32 Q2, Q5, Q1 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(16)] -// Release input[4] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q4 -// input[524]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vmul.u32 Q6, Q6, r2 -// input[520]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 16)] -vsub.s32 Q0, Q2, Q4 -// input[516]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vqrdmlah.s32 Q1, Q6, r9 -// input[268]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 16)] -vstrw.u32 Q3, [r0,#(0)] -// Release input[0] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(80)] -// Release input[524] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(48)] -// Release input[516] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q4 -// input[268]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vmul.u32 Q7, Q7, r2 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vsub.s32 Q0, Q3, Q4 -// input[260]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vqrdmlah.s32 Q1, Q7, r9 -// input[780]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 24)] -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(64)] -// Release input[268] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(32)] -// Release input[260] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q4 -// input[780]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vmul.u32 Q6, Q6, r2 -// input[776]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 20)] -vsub.s32 Q0, Q2, Q4 -// input[772]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 16)] -vqrdmlah.s32 Q1, Q6, r9 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(96)] -// Release input[780] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(64)] -// Release input[772] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(80)] -// Release input[776] from Q4 -// input[140]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vsub.s32 Q0, Q3, Q4 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q7, r9 -// input[652]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -104)] -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-448)] -// Release input[140] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q4 -// input[652]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vmul.u32 Q6, Q6, r2 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vsub.s32 Q0, Q2, Q4 -// input[644]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -112)] -vqrdmlah.s32 Q1, Q6, r9 -// input[396]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -108)] -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-416)] -// Release input[652] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-448)] -// Release input[644] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-432)] -// Release input[648] from Q4 -// input[396]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q7, Q7, r2 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vsub.s32 Q0, Q3, Q4 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q7, r9 -// input[908]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -100)] -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-432)] -// Release input[396] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-464)] -// Release input[388] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q4 -// input[908]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vmul.u32 Q6, Q6, r2 -// input[904]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -104)] -vsub.s32 Q0, Q2, Q4 -// input[900]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -108)] -vqrdmlah.s32 Q1, Q6, r9 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-400)] -// Release input[908] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-432)] -// Release input[900] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-416)] -// Release input[904] from Q4 -// input[76]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmul.u32 Q7, Q7, r2 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vsub.s32 Q0, Q3, Q4 -// input[68]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q7, r9 -// input[588]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 84)] -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(304)] -// Release input[76] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(272)] -// Release input[68] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q4 -// input[588]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vmul.u32 Q6, Q6, r2 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vsub.s32 Q0, Q2, Q4 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q6, r9 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(336)] -// Release input[588] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(304)] -// Release input[580] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(320)] -// Release input[584] from Q4 -// input[332]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vmul.u32 Q7, Q7, r2 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vsub.s32 Q0, Q3, Q4 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q7, r9 -// input[844]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 88)] -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(320)] -// Release input[332] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(288)] -// Release input[324] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q4 -// input[844]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vmul.u32 Q6, Q6, r2 -// input[840]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 84)] -vsub.s32 Q0, Q2, Q4 -// input[836]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 80)] -vqrdmlah.s32 Q1, Q6, r9 -// input[204]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -48)] -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(352)] -// Release input[844] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(320)] -// Release input[836] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q4 -// input[204]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vmul.u32 Q7, Q7, r2 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vsub.s32 Q0, Q3, Q4 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q7, r9 -// input[716]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -40)] -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-192)] -// Release input[204] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-224)] -// Release input[196] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q4 -// input[716]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vmul.u32 Q6, Q6, r2 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vsub.s32 Q0, Q2, Q4 -// input[708]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q6, r9 -// input[460]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -44)] -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-160)] -// Release input[716] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-192)] -// Release input[708] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-176)] -// Release input[712] from Q4 -// input[460]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vmul.u32 Q7, Q7, r2 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vsub.s32 Q0, Q3, Q4 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vqrdmlah.s32 Q1, Q7, r9 -// input[972]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -36)] -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-176)] -// Release input[460] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-208)] -// Release input[452] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q4 -// input[972]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vmul.u32 Q6, Q6, r2 -// input[968]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -40)] -vsub.s32 Q0, Q2, Q4 -// input[964]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -44)] -vqrdmlah.s32 Q1, Q6, r9 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-144)] -// Release input[972] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-176)] -// Release input[964] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q4 -// input[44]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q7, Q7, r2 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vsub.s32 Q0, Q3, Q4 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q7, r9 -// input[556]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 52)] -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(176)] -// Release input[44] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q4 -// input[556]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[544]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 40)] -vmul.u32 Q6, Q6, r2 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vsub.s32 Q0, Q2, Q4 -// input[548]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 44)] -vqrdmlah.s32 Q1, Q6, r9 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(208)] -// Release input[556] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(176)] -// Release input[548] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(192)] -// Release input[552] from Q4 -// input[300]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q7, Q7, r2 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vsub.s32 Q0, Q3, Q4 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q7, r9 -// input[812]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 56)] -vstrw.u32 Q2, [r12,#(160)] -// Release input[544] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(192)] -// Release input[300] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(160)] -// Release input[292] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(176)] -// Release input[296] from Q4 -// input[812]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vmul.u32 Q6, Q6, r2 -// input[808]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 52)] -vsub.s32 Q0, Q2, Q4 -// input[804]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 48)] -vqrdmlah.s32 Q1, Q6, r9 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(224)] -// Release input[812] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(192)] -// Release input[804] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(208)] -// Release input[808] from Q4 -// input[172]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vsub.s32 Q0, Q3, Q4 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q7, r9 -// input[684]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-320)] -// Release input[172] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-336)] -// Release input[168] from Q4 -// input[684]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[672]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vmul.u32 Q6, Q6, r2 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vsub.s32 Q0, Q2, Q4 -// input[676]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q6, r9 -// input[428]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -76)] -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-288)] -// Release input[684] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-320)] -// Release input[676] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-304)] -// Release input[680] from Q4 -// input[428]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q7, Q7, r2 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vsub.s32 Q0, Q3, Q4 -// input[420]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q7, r9 -// input[940]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -68)] -vstrw.u32 Q2, [r11,#(-336)] -// Release input[672] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-304)] -// Release input[428] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-336)] -// Release input[420] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-320)] -// Release input[424] from Q4 -// input[940]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[928]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -80)] -vmul.u32 Q6, Q6, r2 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vsub.s32 Q0, Q2, Q4 -// input[932]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -76)] -vqrdmlah.s32 Q1, Q6, r9 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-272)] -// Release input[940] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-304)] -// Release input[932] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-288)] -// Release input[936] from Q4 -// input[108]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q7, Q7, r2 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vsub.s32 Q0, Q3, Q4 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q7, r9 -// input[620]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 116)] -vstrw.u32 Q2, [r10,#(-320)] -// Release input[928] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(432)] -// Release input[108] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(400)] -// Release input[100] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(416)] -// Release input[104] from Q4 -// input[620]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vmul.u32 Q6, Q6, r2 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vsub.s32 Q0, Q2, Q4 -// input[612]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q6, r9 -// input[364]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 112)] -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(464)] -// Release input[620] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(432)] -// Release input[612] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(448)] -// Release input[616] from Q4 -// input[364]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q7, Q7, r2 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vsub.s32 Q0, Q3, Q4 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q7, r9 -// input[876]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 120)] -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(448)] -// Release input[364] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(416)] -// Release input[356] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(432)] -// Release input[360] from Q4 -// input[876]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[864]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 108)] -vmul.u32 Q6, Q6, r2 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vsub.s32 Q0, Q2, Q4 -// input[868]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q6, r9 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(480)] -// Release input[876] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(448)] -// Release input[868] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(464)] -// Release input[872] from Q4 -// input[236]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q7, Q7, r2 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vsub.s32 Q0, Q3, Q4 -// input[228]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q7, r9 -// input[748]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -8)] -vstrw.u32 Q2, [r11,#(432)] -// Release input[864] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-64)] -// Release input[236] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-96)] -// Release input[228] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-80)] -// Release input[232] from Q4 -// input[748]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vmul.u32 Q6, Q6, r2 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vsub.s32 Q0, Q2, Q4 -// input[740]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vqrdmlah.s32 Q1, Q6, r9 -// input[492]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -12)] -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-32)] -// Release input[748] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-64)] -// Release input[740] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-48)] -// Release input[744] from Q4 -// input[492]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q7, Q7, r2 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vsub.s32 Q0, Q3, Q4 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q7, r9 -// input[1004]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -4)] -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-48)] -// Release input[492] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-80)] -// Release input[484] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q4 -// input[1004]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vmul.u32 Q6, Q6, r2 -// input[1000]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -8)] -vsub.s32 Q0, Q2, Q4 -// input[996]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -12)] -vqrdmlah.s32 Q1, Q6, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-16)] -// Release input[1004] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-48)] -// Release input[996] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-32)] -// Release input[1000] from Q4 -// input[28]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q7, Q7, r2 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vsub.s32 Q0, Q3, Q4 -// input[20]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q7, r9 -// input[540]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 36)] -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(80)] -// Release input[20] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(96)] -// Release input[24] from Q4 -// input[540]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[528]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 24)] -vmul.u32 Q6, Q6, r2 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vsub.s32 Q0, Q2, Q4 -// input[532]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q6, r9 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(144)] -// Release input[540] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(112)] -// Release input[532] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(128)] -// Release input[536] from Q4 -// input[284]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q7, Q7, r2 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vsub.s32 Q0, Q3, Q4 -// input[276]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q7, r9 -// input[796]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 40)] -vstrw.u32 Q2, [r12,#(96)] -// Release input[528] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(128)] -// Release input[284] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(96)] -// Release input[276] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q4 -// input[796]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vmul.u32 Q6, Q6, r2 -// input[792]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 36)] -vsub.s32 Q0, Q2, Q4 -// input[788]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 32)] -vqrdmlah.s32 Q1, Q6, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(160)] -// Release input[796] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(128)] -// Release input[788] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q4 -// input[156]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vsub.s32 Q0, Q3, Q4 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q7, r9 -// input[668]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -88)] -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-400)] -// Release input[152] from Q4 -// input[668]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q6, r2 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vsub.s32 Q0, Q2, Q4 -// input[660]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q6, r9 -// input[412]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -92)] -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-352)] -// Release input[668] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-384)] -// Release input[660] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-368)] -// Release input[664] from Q4 -// input[412]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q7, Q7, r2 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vsub.s32 Q0, Q3, Q4 -// input[404]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q7, r9 -// input[924]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -84)] -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-368)] -// Release input[412] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-400)] -// Release input[404] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-384)] -// Release input[408] from Q4 -// input[924]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[912]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -96)] -vmul.u32 Q6, Q6, r2 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vsub.s32 Q0, Q2, Q4 -// input[916]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -92)] -vqrdmlah.s32 Q1, Q6, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-336)] -// Release input[924] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-368)] -// Release input[916] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-352)] -// Release input[920] from Q4 -// input[92]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q7, Q7, r2 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vsub.s32 Q0, Q3, Q4 -// input[84]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q7, r9 -// input[604]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 100)] -vstrw.u32 Q2, [r10,#(-384)] -// Release input[912] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q4 -// input[604]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[592]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 88)] -vmul.u32 Q6, Q6, r2 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vsub.s32 Q0, Q2, Q4 -// input[596]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q6, r9 -// input[348]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(400)] -// Release input[604] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(384)] -// Release input[600] from Q4 -// input[348]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q7, Q7, r2 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vsub.s32 Q0, Q3, Q4 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q7, r9 -// input[860]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 104)] -vstrw.u32 Q2, [r12,#(352)] -// Release input[592] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(384)] -// Release input[348] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(368)] -// Release input[344] from Q4 -// input[860]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q6, r2 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vsub.s32 Q0, Q2, Q4 -// input[852]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 96)] -vqrdmlah.s32 Q1, Q6, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -32)] -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(416)] -// Release input[860] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(384)] -// Release input[852] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(400)] -// Release input[856] from Q4 -// input[220]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q7, Q7, r2 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vsub.s32 Q0, Q3, Q4 -// input[212]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q7, r9 -// input[732]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-160)] -// Release input[212] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q4 -// input[732]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vmul.u32 Q6, Q6, r2 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vsub.s32 Q0, Q2, Q4 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q6, r9 -// input[476]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -28)] -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-96)] -// Release input[732] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-128)] -// Release input[724] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q4 -// input[476]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q7, Q7, r2 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vsub.s32 Q0, Q3, Q4 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q7, r9 -// input[988]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -20)] -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-112)] -// Release input[476] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-144)] -// Release input[468] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-128)] -// Release input[472] from Q4 -// input[988]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vmul.u32 Q6, Q6, r2 -// input[984]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -24)] -vsub.s32 Q0, Q2, Q4 -// input[980]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -28)] -vqrdmlah.s32 Q1, Q6, r9 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-80)] -// Release input[988] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-112)] -// Release input[980] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-96)] -// Release input[984] from Q4 -// input[60]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q7, Q7, r2 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vsub.s32 Q0, Q3, Q4 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q7, r9 -// input[572]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 68)] -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(240)] -// Release input[60] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q4 -// input[572]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vmul.u32 Q6, Q6, r2 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vsub.s32 Q0, Q2, Q4 -// input[564]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q6, r9 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r12,#(272)] -// Release input[572] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(240)] -// Release input[564] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q4 -// input[316]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q7, Q7, r2 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vsub.s32 Q0, Q3, Q4 -// input[308]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q7, r9 -// input[828]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 72)] -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(256)] -// Release input[316] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(224)] -// Release input[308] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q4 -// input[828]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[816]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 60)] -vmul.u32 Q6, Q6, r2 -// input[824]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 68)] -vsub.s32 Q0, Q2, Q4 -// input[820]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 64)] -vqrdmlah.s32 Q1, Q6, r9 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(288)] -// Release input[828] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(256)] -// Release input[820] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(272)] -// Release input[824] from Q4 -// input[188]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vsub.s32 Q0, Q3, Q4 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q7, r9 -// input[700]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -56)] -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(-256)] -// Release input[188] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q4 -// input[700]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vmul.u32 Q6, Q6, r2 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vsub.s32 Q0, Q2, Q4 -// input[692]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vqrdmlah.s32 Q1, Q6, r9 -// input[444]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -60)] -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-224)] -// Release input[700] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-256)] -// Release input[692] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-240)] -// Release input[696] from Q4 -// input[444]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q7, Q7, r2 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vsub.s32 Q0, Q3, Q4 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q7, r9 -// input[956]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -52)] -vstrw.u32 Q2, [r11,#(-272)] -// Release input[688] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-240)] -// Release input[444] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-272)] -// Release input[436] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-256)] -// Release input[440] from Q4 -// input[956]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vmul.u32 Q6, Q6, r2 -// input[952]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -56)] -vsub.s32 Q0, Q2, Q4 -// input[948]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -60)] -vqrdmlah.s32 Q1, Q6, r9 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-208)] -// Release input[956] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-240)] -// Release input[948] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q4 -// input[124]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q7, Q7, r2 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vsub.s32 Q0, Q3, Q4 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q7, r9 -// input[636]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r0,#(496)] -// Release input[124] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r0,#(464)] -// Release input[116] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q4 -// input[636]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[624]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 120)] -vmul.u32 Q6, Q6, r2 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vsub.s32 Q0, Q2, Q4 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q6, r9 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(-480)] -// Release input[636] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(496)] -// Release input[628] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(-496)] -// Release input[632] from Q4 -// input[380]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q7, Q7, r2 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vsub.s32 Q0, Q3, Q4 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q7, r9 -// input[892]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -116)] -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(-496)] -// Release input[380] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(480)] -// Release input[372] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q4 -// input[892]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[880]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vmul.u32 Q6, Q6, r2 -// input[888]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -120)] -vsub.s32 Q0, Q2, Q4 -// input[884]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vqrdmlah.s32 Q1, Q6, r9 -// input[252]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 0)] -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-464)] -// Release input[892] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(-496)] -// Release input[884] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(-480)] -// Release input[888] from Q4 -// input[252]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q7, Q7, r2 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vsub.s32 Q0, Q3, Q4 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q7, r9 -// input[764]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 8)] -vstrw.u32 Q2, [r11,#(496)] -// Release input[880] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r14,#(0)] -// Release input[252] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q4 -// input[764]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q6, r2 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vsub.s32 Q0, Q2, Q4 -// input[756]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q6, r9 -// input[508]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(32)] -// Release input[764] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(0)] -// Release input[756] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q4 -// input[508]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vmul.u32 Q7, Q7, r2 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vsub.s32 Q0, Q3, Q4 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vqrdmlah.s32 Q1, Q7, r9 -// input[1020]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * 12)] -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r12,#(16)] -// Release input[508] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(-16)] -// Release input[500] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q4 -// input[1020]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vmul.u32 Q6, Q6, r2 -// input[1016]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * 8)] -vsub.s32 Q0, Q2, Q4 -// input[1012]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * 4)] -vqrdmlah.s32 Q1, Q6, r9 -// input[48]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 48)] -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(48)] -// Release input[1020] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r10,#(16)] -// Release input[1012] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q4 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Already loaded as Q7 -vqrdmulh.s32 Q0, Q7, r7 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q7, Q7, r6 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vqrdmlah.s32 Q0, Q7, r9 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmulh.s32 Q2, Q1, r7 -vsub.s32 Q7, Q3, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q2, Q1, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q4, Q7, r3 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q7, Q7, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q4, Q7, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q1, Q1, Q4 -vstrw.u32 Q7, [r0,#(192)] -// Release input[48] from Q7 -vqrdmlah.s32 Q5, Q3, r9 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q3, Q0, Q5 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vadd.s32 Q0, Q0, Q5 -// input[560]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[816]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[784]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[800]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(112)] -// Release input[784] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release input[800] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[912]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[928]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-384)] -// Release input[912] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(288)] -// Release input[576] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[880]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[848]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[864]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(368)] -// Release input[848] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[752]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[976]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[992]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -16)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-128)] -// Release input[976] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-64)] -// Release input[992] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[568]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[792]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[808]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(144)] -// Release input[792] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(208)] -// Release input[808] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[696]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[648]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-240)] -// Release input[696] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-432)] -// Release input[648] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[952]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[952]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-448)] -// Release input[392] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[888]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-480)] -// Release input[888] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[760]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[1000]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-32)] -// Release input[1000] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[564]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[820]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[788]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[804]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(128)] -// Release input[788] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[692]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[644]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-256)] -// Release input[692] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-448)] -// Release input[644] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[948]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[948]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-240)] -// Release input[948] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[324]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[852]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(288)] -// Release input[324] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(384)] -// Release input[852] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[708]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-192)] -// Release input[708] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1012]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1012]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[980]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[996]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -12)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(16)] -// Release input[1012] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-112)] -// Release input[980] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[828]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[812]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(160)] -// Release input[796] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(224)] -// Release input[812] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-336)] -// Release input[924] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[892]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[892]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[860]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-464)] -// Release input[892] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(416)] -// Release input[860] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-112)] -// Release input[476] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[988]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[1004]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -4)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-80)] -// Release input[988] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-16)] -// Release input[1004] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmul.u32 Q2, Q2, r6 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vmul.u32 Q0, Q0, r6 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[448]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vmul.u32 Q1, Q1, r6 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[960]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[832]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[896]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(304)] -// Release input[832] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-448)] -// Release input[896] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[736]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[736]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[480]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[736] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[480]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-96)] -// Release input[480] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[992]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[864]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[928]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(432)] -// Release input[864] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[208]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[720]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[976]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[976]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[848]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-128)] -// Release input[976] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(368)] -// Release input[848] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(112)] -// Release input[784] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[496]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[496]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[880]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[944]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -64)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(496)] -// Release input[880] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-256)] -// Release input[944] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r6 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[712]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[456]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[968]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[840]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[904]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-416)] -// Release input[904] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[232]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[744]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[744]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-48)] -// Release input[744] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1000]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[808]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-32)] -// Release input[1000] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[216]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(208)] -// Release input[808] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[472]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[984]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-128)] -// Release input[472] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[984]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-96)] -// Release input[984] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(144)] -// Release input[792] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1016]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1016]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[888]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -120)] -vmul.u32 Q0, Q0, r6 -// input[952]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -56)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-480)] -// Release input[888] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-224)] -// Release input[952] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[196]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q1, Q1, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[708]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[964]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[836]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[772]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(320)] -// Release input[836] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[228]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(64)] -// Release input[772] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[484]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[484]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[996]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-80)] -// Release input[484] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[996]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[804]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-48)] -// Release input[996] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(192)] -// Release input[804] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[724]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[724]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[532]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[724] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[468]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(112)] -// Release input[532] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[980]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-144)] -// Release input[468] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[980]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[852]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-112)] -// Release input[980] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(384)] -// Release input[852] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[884]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-496)] -// Release input[884] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q0, Q0, r6 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[972]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[844]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[908]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[780]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release input[844] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-400)] -// Release input[908] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(96)] -// Release input[780] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[556]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[492]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(208)] -// Release input[556] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1004]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[876]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-16)] -// Release input[1004] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release input[876] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[732]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[732]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[540]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-96)] -// Release input[732] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(144)] -// Release input[540] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[860]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[796]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(416)] -// Release input[860] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release input[796] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vmul.u32 Q0, Q0, r6 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[892]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -116)] -vmul.u32 Q2, Q2, r6 -// input[956]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-464)] -// Release input[892] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-208)] -// Release input[956] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vmul.u32 Q0, Q0, r6 -// input[512]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(32)] -// Release input[512] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[896]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[832]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vmul.u32 Q2, Q2, r6 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[704]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release input[704] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[800]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[928]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[928]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-320)] -// Release input[928] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[864]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q0, Q0, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(432)] -// Release input[864] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[992]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q1, Q1, r6 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[784]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r6 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[912]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[848]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-384)] -// Release input[912] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[848]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(368)] -// Release input[848] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q2, Q2, r6 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r6 -// input[560]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(224)] -// Release input[560] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[880]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(496)] -// Release input[880] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vmul.u32 Q0, Q0, r6 -// input[752]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -4)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(0)] -// Release input[1008] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-16)] -// Release input[752] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[776]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q1, Q1, r6 -// input[520]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(64)] -// Release input[520] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[904]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[840]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q0, Q0, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[968]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r6 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-176)] -// Release input[712] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[808]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q2, Q2, r6 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[936]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[936]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[872]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-288)] -// Release input[936] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[872]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q1, Q1, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[104]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(464)] -// Release input[872] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q2, Q2, r6 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(416)] -// Release input[104] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[792]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[920]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[920]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[152]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[856]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-352)] -// Release input[920] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[856]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q2, Q2, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-400)] -// Release input[152] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[984]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(400)] -// Release input[856] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[984]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-96)] -// Release input[984] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[824]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q1, Q1, r6 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[952]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[952]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[888]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-224)] -// Release input[952] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[888]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q0, Q0, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-480)] -// Release input[888] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q1, Q1, r6 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(16)] -// Release input[760] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[772]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vmul.u32 Q2, Q2, r6 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[900]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[836]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q1, Q1, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[452]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -52)] -vmul.u32 Q2, Q2, r6 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[804]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q0, Q0, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[932]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[932]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[164]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[868]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-304)] -// Release input[932] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[868]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-352)] -// Release input[164] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[996]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(448)] -// Release input[868] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[996]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q0, Q0, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-48)] -// Release input[996] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[788]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[20]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[916]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[916]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(80)] -// Release input[20] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[852]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-368)] -// Release input[916] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[852]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q0, Q0, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[980]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(384)] -// Release input[852] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[980]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q1, Q1, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-112)] -// Release input[980] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[820]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q2, Q2, r6 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[948]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[948]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[884]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-240)] -// Release input[948] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[884]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q1, Q1, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-496)] -// Release input[884] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[500]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -4)] -vmul.u32 Q2, Q2, r6 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-16)] -// Release input[500] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vmul.u32 Q0, Q0, r6 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(80)] -// Release input[524] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmul.u32 Q2, Q2, r6 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[460]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -44)] -vmul.u32 Q0, Q0, r6 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-176)] -// Release input[460] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-160)] -// Release input[716] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q1, Q1, r6 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q0, Q0, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q1, Q1, r6 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r6 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q1, Q1, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vmul.u32 Q2, Q2, r6 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-112)] -// Release input[476] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vmul.u32 Q0, Q0, r6 -// input[572]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(272)] -// Release input[572] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vmul.u32 Q0, Q0, r6 -// input[764]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -vqrdmulh.s32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r12,#(16)] -// Release input[508] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(32)] -// Release input[764] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 9439 -// Instruction count: 7128 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double.s deleted file mode 100644 index 91c6b3a..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double.s +++ /dev/null @@ -1,11458 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 = 17791697 * 2^31 -.word 3285804833 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 = 29333180 * 2^31 -.word 1876750725 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 = 16027071 * 2^31 -.word 3172903227 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 = 27246749 * 2^31 -.word 3890743531 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 = 19153009 * 2^31 -.word 3372902219 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 = 14378180 * 2^31 -.word 919922753 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 = 23328838 * 2^31 -.word 1492590085 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 = 26950707 * 2^31 -.word 3871802623 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 = 31812506 * 2^31 -.word 2035379173 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 = 17437883 * 2^31 -.word 3263167647 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 = 8357758 * 2^31 -.word 534733307 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 = 22422281 * 2^31 -.word 3582071787 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 = 29650081 * 2^31 -.word 4044509849 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 = 9686916 * 2^31 -.word 619773465 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 = 18399952 * 2^31 -.word 1177237627 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 = 27755269 * 2^31 -.word 3923278881 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 = 22563934 * 2^31 -.word 1443651165 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 = 2438403 * 2^31 -.word 2303493823 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 = 31481843 * 2^31 -.word 4161706847 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 = 32076751 * 2^31 -.word 4199769341 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 = 18223844 * 2^31 -.word 1165970155 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 = 3973412 * 2^31 -.word 254220777 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 = 7405458 * 2^31 -.word 473804703 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 = 33156191 * 2^31 -.word 4268832423 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 = 22859934 * 2^31 -.word 1462589385 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 = 23834070 * 2^31 -.word 1524915067 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 = 25149579 * 2^31 -.word 3756565603 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 = 13976724 * 2^31 -.word 894237409 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 = 15349951 * 2^31 -.word 3129580769 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 = 6932474 * 2^31 -.word 443542963 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 = 12503729 * 2^31 -.word 2947478141 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 = 10586616 * 2^31 -.word 677336697 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 = 15322485 * 2^31 -.word 3127823481 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 = 6173403 * 2^31 -.word 2542460889 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 = 14374018 * 2^31 -.word 919656467 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 = 9325363 * 2^31 -.word 2744124781 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 = 5605608 * 2^31 -.word 358649449 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 = 25200773 * 2^31 -.word 3759841019 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 = 31727447 * 2^31 -.word 4177420707 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 = 6658688 * 2^31 -.word 426026005 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 = 33297705 * 2^31 -.word 4277886557 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 = 486950 * 2^31 -.word 31155291 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 = 13215161 * 2^31 -.word 2992995895 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 = 16752026 * 2^31 -.word 1071802543 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 = 14102887 * 2^31 -.word 3049793025 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 = 32232983 * 2^31 -.word 4209765139 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 = 16009575 * 2^31 -.word 3171783825 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 = 5365218 * 2^31 -.word 343269183 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 = 24042369 * 2^31 -.word 3685725783 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 = 27221548 * 2^31 -.word 1741647511 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 = 7233695 * 2^31 -.word 2610298873 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 = 15385892 * 2^31 -.word 984396643 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 = 15700554 * 2^31 -.word 1004528867 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 = 17178032 * 2^31 -.word 1099058609 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 = 20482112 * 2^31 -.word 1310455209 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 = 31908284 * 2^31 -.word 2041507095 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 = 4869100 * 2^31 -.word 311527319 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 = 29810009 * 2^31 -.word 4054742117 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 = 11135584 * 2^31 -.word 712459929 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 = 18505659 * 2^31 -.word 3331484459 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 = 17484839 * 2^31 -.word 3266171913 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 = 20168277 * 2^31 -.word 3437859545 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 = 31283961 * 2^31 -.word 4149046263 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 = 17056436 * 2^31 -.word 1091278839 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete_double, %function -.global ntt_1024_u32_33564673_286215_incomplete_double -ntt_1024_u32_33564673_286215_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r7 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vqrdmulh.s32 Q4, Q2, r7 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r9 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[772]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vmul.u32 Q4, Q4, r6 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r7 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r9 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vqrdmlah.s32 Q6, Q3, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[776]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(48)] -// Release input[264] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[784]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[788]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[792]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[800]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[804]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[808]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[816]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(208)] -// Release input[304] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[820]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[824]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[832]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[836]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[840]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[840]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(336)] -// Release input[840] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[848]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[848]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[852]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(368)] -// Release input[848] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[852]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[84]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[856]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(384)] -// Release input[852] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[856]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(336)] -// Release input[84] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(400)] -// Release input[856] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[864]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[868]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(432)] -// Release input[864] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[868]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[872]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(448)] -// Release input[868] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[872]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(464)] -// Release input[872] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[880]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[880]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[884]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(496)] -// Release input[880] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[884]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[888]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-496)] -// Release input[884] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[888]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-480)] -// Release input[888] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[896]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[900]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[900]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-432)] -// Release input[900] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[904]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[912]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[916]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-384)] -// Release input[912] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[916]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[148]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[920]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-368)] -// Release input[916] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[920]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-416)] -// Release input[148] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-352)] -// Release input[920] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[928]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[928]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[932]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-320)] -// Release input[928] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[932]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[936]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-304)] -// Release input[932] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[936]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-288)] -// Release input[936] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[944]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[948]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[948]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[952]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-240)] -// Release input[948] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[952]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-224)] -// Release input[952] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[960]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[704]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[448]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-192)] -// Release input[960] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-224)] -// Release input[448] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[968]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-192)] -// Release input[456] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[716]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release input[716] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-176)] -// Release input[460] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[980]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[980]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[984]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-112)] -// Release input[980] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[984]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-160)] -// Release input[212] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-96)] -// Release input[984] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[992]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[996]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[996]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-48)] -// Release input[996] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[232]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-80)] -// Release input[232] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1008]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[752]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[496]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(0)] -// Release input[1008] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-32)] -// Release input[496] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r6 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(0)] -// Release input[504] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[508]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(32)] -// Release input[764] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(16)] -// Release input[508] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vmul.u32 Q2, Q2, r6 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[196]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[204]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[208]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[20]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[216]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(80)] -// Release input[20] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(96)] -// Release input[24] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[228]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[232]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(144)] -// Release input[36] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[448]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[456]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[468]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[276]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-144)] -// Release input[468] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[472]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(96)] -// Release input[276] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-128)] -// Release input[472] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[480]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[480]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[484]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-96)] -// Release input[480] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[484]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-80)] -// Release input[484] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[492]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[496]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[708]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[712]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[720]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[724]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[724]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(96)] -// Release input[528] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[724] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(112)] -// Release input[532] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[732]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[732]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(128)] -// Release input[536] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[736]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-96)] -// Release input[732] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[736]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[544]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[736] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(160)] -// Release input[544] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[548]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[744]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[744]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(176)] -// Release input[548] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[552]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-48)] -// Release input[744] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(192)] -// Release input[552] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[568]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(256)] -// Release input[568] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[896]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -112)] -vmul.u32 Q2, Q2, r6 -// input[832]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-448)] -// Release input[896] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release input[832] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[964]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[900]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[968]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[840]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(336)] -// Release input[840] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[972]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[908]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -100)] -vmul.u32 Q2, Q2, r6 -// input[844]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[976]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-400)] -// Release input[908] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(352)] -// Release input[844] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[976]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[912]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[784]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[980]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-128)] -// Release input[976] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-384)] -// Release input[912] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[980]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release input[784] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[788]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[984]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-112)] -// Release input[980] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[984]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(128)] -// Release input[788] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-96)] -// Release input[984] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-336)] -// Release input[924] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[992]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[864]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[996]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[996]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-48)] -// Release input[996] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1000]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[808]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1004]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-32)] -// Release input[1000] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1004]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -// Release input[808] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-16)] -// Release input[1004] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[944]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -64)] -vmul.u32 Q2, Q2, r6 -// input[880]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-256)] -// Release input[944] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(496)] -// Release input[880] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[948]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1016]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-240)] -// Release input[948] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1016]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[888]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -// Release input[820] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-480)] -// Release input[888] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[956]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -52)] -vmul.u32 Q2, Q2, r6 -// input[892]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-208)] -// Release input[956] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-464)] -// Release input[892] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[560]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[564]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(32)] -// Release input[512] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[568]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[524]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(80)] -// Release input[524] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[580]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(304)] -// Release input[580] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[584]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(320)] -// Release input[584] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[588]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(336)] -// Release input[588] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[692]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-256)] -// Release input[692] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[696]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-240)] -// Release input[696] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release input[648] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[752]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[760]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[800]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[784]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(176)] -// Release input[800] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(112)] -// Release input[784] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[820]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[804]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[772]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(192)] -// Release input[804] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[792]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(64)] -// Release input[772] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[776]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(144)] -// Release input[792] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[828]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[812]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[796]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(80)] -// Release input[776] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[780]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release input[812] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(160)] -// Release input[796] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[864]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(96)] -// Release input[780] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[832]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(432)] -// Release input[864] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(304)] -// Release input[832] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[836]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[888]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(320)] -// Release input[836] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[840]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[892]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-480)] -// Release input[888] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[892]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[876]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release input[840] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-464)] -// Release input[892] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release input[876] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[948]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[948]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-448)] -// Release input[896] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[900]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[952]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-240)] -// Release input[948] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[952]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-432)] -// Release input[900] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q2, Q2, r6 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-416)] -// Release input[904] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[908]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[992]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[976]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-400)] -// Release input[908] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1012]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-64)] -// Release input[992] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-128)] -// Release input[976] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1012]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[964]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(16)] -// Release input[1012] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[984]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-176)] -// Release input[964] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-96)] -// Release input[984] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[1004]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[988]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[972]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-16)] -// Release input[1004] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-80)] -// Release input[988] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-144)] -// Release input[972] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q0, [r1,#(96)] -vqrdmulh.s32 Q7, Q0, r3 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r1,#(64)] -vqrdmlah.s32 Q7, Q0, r9 -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q3, r3 -vsub.s32 Q4, Q1, Q6 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q4, [r1,#(32)] -vqrdmlah.s32 Q7, Q3, r9 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(0)]! -vqrdmlah.s32 Q7, Q4, r9 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q0, Q1, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q1, Q1, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release input[0] from Q1 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r6 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q2, Q3, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q2, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q3, r5 -vsub.s32 Q2, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q2, [r1,#(224)] -vqrdmulh.s32 Q7, Q2, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q2, r9 -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q3, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q3, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q3, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q1 -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q3, r9 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[16] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q4, Q4, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vmul.u32 Q3, Q3, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q4, Q4, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r6 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q4, Q4, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vmul.u32 Q3, Q3, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q4, Q4, r6 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r6 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r6 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r6 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[252] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[248] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r6 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[268] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[264] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[260] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[256] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r6 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[272]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[284] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[280] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[276] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[272] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[300]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q4, Q4, r6 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[300] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[296] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[292] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[288] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[316]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r6 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[316] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[312] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[308] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[304] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r6 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[332] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[328] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[324] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[320] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r6 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[348] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[344] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[340] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[336] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q4, Q4, r6 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[364] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[360] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[356] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[352] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q3, Q3, r6 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[380] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[376] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[372] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[368] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[396]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vmul.u32 Q4, Q4, r6 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[396] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[392] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[388] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[384] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[412]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[408]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -96)] -vmul.u32 Q3, Q3, r6 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[412] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[408] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[404] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[400] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[428]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[424]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -80)] -vmul.u32 Q4, Q4, r6 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[428] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[424] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[420] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[416] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[444]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[444] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[440] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[436] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[432] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[460]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vmul.u32 Q4, Q4, r6 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[460] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[456] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[452] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[448] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[476]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vmul.u32 Q3, Q3, r6 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[476] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[472] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[468] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[464] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[492]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vmul.u32 Q4, Q4, r6 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[492] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[488] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[484] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[480] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vmul.u32 Q3, Q3, r6 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[508] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[504] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[500] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[496] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[524]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vmul.u32 Q4, Q4, r6 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[524] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[520] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[516] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[512] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[540]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vmul.u32 Q3, Q3, r6 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[540] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[536] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[532] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[528] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[556]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vmul.u32 Q4, Q4, r6 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[556] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[552] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[548] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[544] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[572]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vmul.u32 Q3, Q3, r6 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[572] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[568] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[564] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[560] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[588]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vmul.u32 Q4, Q4, r6 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[588] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[584] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[580] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[576] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[604]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[600]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 96)] -vmul.u32 Q3, Q3, r6 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[604] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[600] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[596] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[592] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[620]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[616]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 112)] -vmul.u32 Q4, Q4, r6 -// input[612]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[608]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[620] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[616] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[612] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[608] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[636]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q3, Q3, r6 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[636] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[632] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[628] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[624] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[652]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vmul.u32 Q4, Q4, r6 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[652] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[648] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[644] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[640] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[668]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[664]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q3, Q3, r6 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[656]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[668] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[664] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[660] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[656] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[684]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[680]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -76)] -vmul.u32 Q4, Q4, r6 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[672]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[684] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[680] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[676] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[672] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[700]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vmul.u32 Q3, Q3, r6 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[700] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[696] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[692] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[688] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[716]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vmul.u32 Q4, Q4, r6 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[716] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[712] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[708] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[704] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[732]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q3, Q3, r6 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[732] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[728] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[724] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[720] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[748]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vmul.u32 Q4, Q4, r6 -// input[740]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[748] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[744] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[740] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[736] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[764]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vmul.u32 Q3, Q3, r6 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[780]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[764] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[760] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[756] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[752] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vmul.u32 Q4, Q4, r6 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[780] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[776] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[772] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[768] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[812]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[796] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[792] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[788] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[784] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[808]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vmul.u32 Q4, Q4, r6 -// input[804]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[800]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[828]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[812] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[808] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[804] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[800] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q3, Q3, r6 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[844]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[828] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[824] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[820] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[816] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[840]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 84)] -vmul.u32 Q4, Q4, r6 -// input[836]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[832]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[860]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[844] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[840] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[836] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[832] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[856]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vmul.u32 Q3, Q3, r6 -// input[852]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[848]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[860] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[856] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[852] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[848] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[872]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 116)] -vmul.u32 Q4, Q4, r6 -// input[868]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 112)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[892]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[876] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[872] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[868] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[864] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vmul.u32 Q3, Q3, r6 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[908]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[892] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[888] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[884] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[880] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vmul.u32 Q4, Q4, r6 -// input[900]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -108)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[908] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[904] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[900] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[896] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[920]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[916]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -92)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[924] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[920] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[916] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[912] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[936]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -72)] -vmul.u32 Q4, Q4, r6 -// input[932]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[928]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[956]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[940] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[936] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[932] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[928] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[952]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -56)] -vmul.u32 Q3, Q3, r6 -// input[948]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -60)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[944]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[972]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[956] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[952] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[948] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[944] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vmul.u32 Q4, Q4, r6 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[988]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[972] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[968] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[964] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[960] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[984]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -24)] -vmul.u32 Q3, Q3, r6 -// input[980]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -28)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[976]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[1004]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[988] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[984] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[980] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[976] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vmul.u32 Q4, Q4, r6 -// input[996]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[1020]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[1004] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[1000] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[996] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[992] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vmul.u32 Q3, Q3, r6 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -vqrdmulh.s32 Q4, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q6, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q6, Q3, r9 -// Release input[1020] from Q3 -vqrdmlah.s32 Q4, Q2, r9 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1,#(240)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q6, Q1, r9 -vstrw.u32 Q6, [r1,#(208)] -// Release input[1016] from Q1 -vqrdmulh.s32 Q6, Q2, r5 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q6, Q6 -// Release input[1012] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q6, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[1008] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 11426 -// Instruction count: 9115 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double_rev4.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double_rev4.s deleted file mode 100644 index b3ebeaa..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_double_rev4.s +++ /dev/null @@ -1,11463 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 = 17791697 * 2^31 -.word 3285804833 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 = 29333180 * 2^31 -.word 1876750725 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 = 16027071 * 2^31 -.word 3172903227 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 = 27246749 * 2^31 -.word 3890743531 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 = 19153009 * 2^31 -.word 3372902219 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 = 14378180 * 2^31 -.word 919922753 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 = 23328838 * 2^31 -.word 1492590085 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 = 26950707 * 2^31 -.word 3871802623 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 = 31812506 * 2^31 -.word 2035379173 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 = 17437883 * 2^31 -.word 3263167647 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 = 8357758 * 2^31 -.word 534733307 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 = 22422281 * 2^31 -.word 3582071787 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 = 29650081 * 2^31 -.word 4044509849 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 = 9686916 * 2^31 -.word 619773465 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 = 18399952 * 2^31 -.word 1177237627 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 = 27755269 * 2^31 -.word 3923278881 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 = 22563934 * 2^31 -.word 1443651165 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 = 2438403 * 2^31 -.word 2303493823 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 = 31481843 * 2^31 -.word 4161706847 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 = 32076751 * 2^31 -.word 4199769341 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 = 18223844 * 2^31 -.word 1165970155 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 = 3973412 * 2^31 -.word 254220777 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 = 7405458 * 2^31 -.word 473804703 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 = 33156191 * 2^31 -.word 4268832423 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 = 22859934 * 2^31 -.word 1462589385 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 = 23834070 * 2^31 -.word 1524915067 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 = 25149579 * 2^31 -.word 3756565603 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 = 13976724 * 2^31 -.word 894237409 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 = 15349951 * 2^31 -.word 3129580769 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 = 6932474 * 2^31 -.word 443542963 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 = 12503729 * 2^31 -.word 2947478141 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 = 10586616 * 2^31 -.word 677336697 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 = 15322485 * 2^31 -.word 3127823481 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 = 6173403 * 2^31 -.word 2542460889 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 = 14374018 * 2^31 -.word 919656467 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 = 9325363 * 2^31 -.word 2744124781 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 = 5605608 * 2^31 -.word 358649449 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 = 25200773 * 2^31 -.word 3759841019 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 = 31727447 * 2^31 -.word 4177420707 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 = 6658688 * 2^31 -.word 426026005 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 = 33297705 * 2^31 -.word 4277886557 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 = 486950 * 2^31 -.word 31155291 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 = 13215161 * 2^31 -.word 2992995895 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 = 16752026 * 2^31 -.word 1071802543 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 = 14102887 * 2^31 -.word 3049793025 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 = 32232983 * 2^31 -.word 4209765139 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 = 16009575 * 2^31 -.word 3171783825 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 = 5365218 * 2^31 -.word 343269183 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 = 24042369 * 2^31 -.word 3685725783 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 = 27221548 * 2^31 -.word 1741647511 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 = 7233695 * 2^31 -.word 2610298873 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 = 15385892 * 2^31 -.word 984396643 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 = 15700554 * 2^31 -.word 1004528867 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 = 17178032 * 2^31 -.word 1099058609 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 = 20482112 * 2^31 -.word 1310455209 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 = 31908284 * 2^31 -.word 2041507095 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 = 4869100 * 2^31 -.word 311527319 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 = 29810009 * 2^31 -.word 4054742117 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 = 11135584 * 2^31 -.word 712459929 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 = 18505659 * 2^31 -.word 3331484459 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 = 17484839 * 2^31 -.word 3266171913 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 = 20168277 * 2^31 -.word 3437859545 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 = 31283961 * 2^31 -.word 4149046263 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 = 17056436 * 2^31 -.word 1091278839 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.text -rev4: .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete_double_rev4, %function -.global ntt_1024_u32_33564673_286215_incomplete_double_rev4 -ntt_1024_u32_33564673_286215_incomplete_double_rev4: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -adr r4, rev4 -vldrb.u32 Q0, [r4] -vadd.u32 Q0, Q0, r0 -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q2, Q1, r7 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vmul.u32 Q1, Q1, r6 -// input[256]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r9 -vqrdmulh.s32 Q5, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q3, Q2, Q5 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q1, r9 -// input[772]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 16)] -vqrdmulh.s32 Q7, Q4, r5 -vsub.s32 Q1, Q3, Q6 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q6 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmlah.s32 Q7, Q4, r9 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vsub.s32 Q4, Q2, Q7 -vstrw.u32 Q4, [r14,#(16)] -// Release input[256] from Q4 -vadd.s32 Q2, Q2, Q7 -// input[772]: Already loaded as Q5 -vqrdmulh.s32 Q1, Q5, r7 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vmul.u32 Q5, Q5, r6 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q1, Q5, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q5, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q6, Q5, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q5, Q5, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q5, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q7, Q4, r5 -vsub.s32 Q5, Q3, Q6 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q6 -vstrw.u32 Q5, [r11,#(64)] -// Release input[772] from Q5 -vqrdmlah.s32 Q7, Q4, r9 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vsub.s32 Q4, Q1, Q7 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q1, Q1, Q7 -// input[776]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[520]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 16)] -vmul.u32 Q2, Q2, r6 -// input[264]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 12)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[780]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(64)] -// Release input[520] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(48)] -// Release input[264] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[780]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmul.u32 Q1, Q1, r6 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[784]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(96)] -// Release input[780] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(80)] -// Release input[524] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(64)] -// Release input[268] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[784]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vmul.u32 Q3, Q3, r6 -// input[272]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release input[784] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(80)] -// Release input[272] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[788]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[276]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(96)] -// Release input[276] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[792]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(144)] -// Release input[792] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(112)] -// Release input[280] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[796]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(160)] -// Release input[796] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(128)] -// Release input[284] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[800]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[288]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[804]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(144)] -// Release input[288] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[804]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(192)] -// Release input[804] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(160)] -// Release input[292] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[808]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vmul.u32 Q3, Q3, r6 -// input[296]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(176)] -// Release input[296] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[812]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(192)] -// Release input[300] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[816]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[560]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[820]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 64)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(224)] -// Release input[560] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(208)] -// Release input[304] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[820]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vmul.u32 Q3, Q3, r6 -// input[308]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(256)] -// Release input[820] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(224)] -// Release input[308] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vmul.u32 Q2, Q2, r6 -// input[312]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 60)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(240)] -// Release input[312] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[828]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[572]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmul.u32 Q1, Q1, r6 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[832]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 76)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(272)] -// Release input[572] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(256)] -// Release input[316] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[832]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vmul.u32 Q3, Q3, r6 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[836]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(304)] -// Release input[832] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(272)] -// Release input[320] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[836]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[840]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 84)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(320)] -// Release input[836] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(288)] -// Release input[324] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[840]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[844]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(336)] -// Release input[840] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(304)] -// Release input[328] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[844]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(352)] -// Release input[844] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(320)] -// Release input[332] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[848]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[852]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 96)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[852]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(384)] -// Release input[852] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[856]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vmul.u32 Q3, Q3, r6 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[860]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[860]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[864]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(416)] -// Release input[860] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[864]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(432)] -// Release input[864] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(400)] -// Release input[352] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[868]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vmul.u32 Q3, Q3, r6 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[872]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(416)] -// Release input[356] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[872]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[876]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 120)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(464)] -// Release input[872] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(432)] -// Release input[360] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[876]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[880]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 124)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(480)] -// Release input[876] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(448)] -// Release input[364] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[880]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vmul.u32 Q3, Q3, r6 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(496)] -// Release input[880] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(464)] -// Release input[368] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(480)] -// Release input[372] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[888]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[892]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -116)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-480)] -// Release input[888] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(496)] -// Release input[376] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[892]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vmul.u32 Q3, Q3, r6 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-464)] -// Release input[892] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-496)] -// Release input[380] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[896]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmul.u32 Q2, Q2, r6 -// input[384]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -120)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[900]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -108)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-480)] -// Release input[384] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[900]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-432)] -// Release input[900] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-464)] -// Release input[388] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[904]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vmul.u32 Q3, Q3, r6 -// input[392]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[908]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-448)] -// Release input[392] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[908]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmul.u32 Q2, Q2, r6 -// input[396]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -108)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[912]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -96)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-400)] -// Release input[908] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-432)] -// Release input[396] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[912]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-384)] -// Release input[912] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-416)] -// Release input[400] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[916]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vmul.u32 Q3, Q3, r6 -// input[404]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[920]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -88)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-400)] -// Release input[404] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[920]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[408]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -96)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[924]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -84)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-352)] -// Release input[920] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-384)] -// Release input[408] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[924]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-336)] -// Release input[924] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-368)] -// Release input[412] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[928]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmul.u32 Q3, Q3, r6 -// input[416]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[932]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -76)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-352)] -// Release input[416] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[932]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[420]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -84)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[936]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -72)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-304)] -// Release input[932] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-336)] -// Release input[420] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[936]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-288)] -// Release input[936] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-320)] -// Release input[424] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[940]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmul.u32 Q3, Q3, r6 -// input[428]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-304)] -// Release input[428] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vmul.u32 Q2, Q2, r6 -// input[432]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -72)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[948]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -60)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-288)] -// Release input[432] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[948]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-240)] -// Release input[948] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-272)] -// Release input[436] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[952]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vmul.u32 Q3, Q3, r6 -// input[440]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -64)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-256)] -// Release input[440] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vmul.u32 Q2, Q2, r6 -// input[444]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -60)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-240)] -// Release input[444] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[960]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[704]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmul.u32 Q1, Q1, r6 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[964]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-208)] -// Release input[704] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-224)] -// Release input[448] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[964]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmul.u32 Q3, Q3, r6 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[968]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-176)] -// Release input[964] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-208)] -// Release input[452] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[968]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vmul.u32 Q2, Q2, r6 -// input[456]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -48)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[972]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-160)] -// Release input[968] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-176)] -// Release input[712] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-192)] -// Release input[456] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[972]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmul.u32 Q1, Q1, r6 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[976]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-144)] -// Release input[972] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-160)] -// Release input[716] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-176)] -// Release input[460] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[976]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vmul.u32 Q3, Q3, r6 -// input[464]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[980]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-128)] -// Release input[976] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-160)] -// Release input[464] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[980]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[984]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-112)] -// Release input[980] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-144)] -// Release input[468] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[984]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[988]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-96)] -// Release input[984] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-128)] -// Release input[472] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[988]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmul.u32 Q3, Q3, r6 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-80)] -// Release input[988] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-112)] -// Release input[476] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[992]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[480]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -24)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[996]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-96)] -// Release input[480] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[996]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-48)] -// Release input[996] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-80)] -// Release input[484] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1000]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vmul.u32 Q3, Q3, r6 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1004]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-64)] -// Release input[488] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1004]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-16)] -// Release input[1004] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-48)] -// Release input[492] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[752]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1012]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-16)] -// Release input[752] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-32)] -// Release input[496] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1012]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vmul.u32 Q3, Q3, r6 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(16)] -// Release input[1012] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-16)] -// Release input[500] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vmul.u32 Q2, Q2, r6 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(16)] -// Release input[760] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(0)] -// Release input[504] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[764]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmul.u32 Q1, Q1, r6 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(32)] -// Release input[764] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(16)] -// Release input[508] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vmul.u32 Q3, Q3, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[196]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[68]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(272)] -// Release input[68] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(304)] -// Release input[76] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(320)] -// Release input[80] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[84]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[216]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(352)] -// Release input[88] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[96]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(384)] -// Release input[96] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(400)] -// Release input[100] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(416)] -// Release input[104] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(432)] -// Release input[108] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(464)] -// Release input[116] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(480)] -// Release input[120] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(272)] -// Release input[320] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(288)] -// Release input[324] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[456]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(304)] -// Release input[328] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(320)] -// Release input[332] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[468]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[480]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(400)] -// Release input[352] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(416)] -// Release input[356] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(432)] -// Release input[360] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[492]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(448)] -// Release input[364] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(464)] -// Release input[368] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(480)] -// Release input[372] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[504]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(496)] -// Release input[376] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-496)] -// Release input[380] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[704]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[576]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(288)] -// Release input[576] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[708]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(304)] -// Release input[580] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[584]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(320)] -// Release input[584] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[588]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(336)] -// Release input[588] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[720]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[596]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[728]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[600]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-112)] -// Release input[728] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[732]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[540]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[608]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(144)] -// Release input[540] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[740]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(416)] -// Release input[608] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[740]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[612]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-64)] -// Release input[740] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(432)] -// Release input[612] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[744]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(448)] -// Release input[616] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[620]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(464)] -// Release input[620] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[624]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(480)] -// Release input[624] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[756]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(496)] -// Release input[628] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[632]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -124)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[632] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[636]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[960]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-480)] -// Release input[636] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[896]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[832]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-192)] -// Release input[960] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-448)] -// Release input[896] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(304)] -// Release input[832] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[836]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[772]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(320)] -// Release input[836] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[968]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[904]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[840]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(64)] -// Release input[772] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[972]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-416)] -// Release input[904] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(336)] -// Release input[840] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[972]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[908]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[844]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[780]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-144)] -// Release input[972] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-400)] -// Release input[908] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(352)] -// Release input[844] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[848]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(96)] -// Release input[780] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[784]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[980]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(368)] -// Release input[848] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[980]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[852]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(112)] -// Release input[784] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-112)] -// Release input[980] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(384)] -// Release input[852] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[984]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[856]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(400)] -// Release input[856] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[860]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(144)] -// Release input[792] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(416)] -// Release input[860] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[992]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[928]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[864]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(160)] -// Release input[796] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(432)] -// Release input[864] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[996]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[868]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[804]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(448)] -// Release input[868] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[872]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(192)] -// Release input[804] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(464)] -// Release input[872] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[876]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1008]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(480)] -// Release input[876] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1008]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[944]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[880]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(0)] -// Release input[1008] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-256)] -// Release input[944] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(496)] -// Release input[880] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[884]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[820]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r10,#(-496)] -// Release input[884] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[952]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[888]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(256)] -// Release input[820] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1020]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-224)] -// Release input[952] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r10,#(-480)] -// Release input[888] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1020]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[956]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -52)] -vmul.u32 Q3, Q3, r6 -// input[892]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -116)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -vqrdmulh.s32 Q2, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(48)] -// Release input[1020] from Q3 -vqrdmlah.s32 Q2, Q5, r9 -vstrw.u32 Q4, [r10,#(-208)] -// Release input[956] from Q4 -vsub.s32 Q5, Q1, Q2 -vstrw.u32 Q5, [r10,#(-464)] -// Release input[892] from Q5 -vadd.s32 Q1, Q1, Q2 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 48)] -vqrdmulh.s32 Q3, Q2, r7 -// input[32]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 32)] -vmul.u32 Q2, Q2, r6 -// input[16]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 16)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[0]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 0)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[52]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(64)] -// Release input[16] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[20]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(0)] -// Release input[0] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[56]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(80)] -// Release input[20] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[56]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[40]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[44]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[28]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[12]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(112)] -// Release input[28] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[80]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[64]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(320)] -// Release input[80] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[84]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[68]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[72]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(352)] -// Release input[88] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[92]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[176]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[128]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[148]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[132]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[184]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[156]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[140]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[208]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[192]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[212]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[248]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-160)] -// Release input[212] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[216]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-144)] -// Release input[216] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[220]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[204]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-128)] -// Release input[220] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[304]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[272]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(80)] -// Release input[272] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[308]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[276]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[312]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(96)] -// Release input[276] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[312]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[296]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[280]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[264]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[316]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(112)] -// Release input[280] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[284]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(128)] -// Release input[284] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[336]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[320]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[372]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[340]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[324]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[376]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[344]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[380]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[364]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[348]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[332]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[432]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[432]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[400]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[384]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[436]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-416)] -// Release input[400] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[436]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[404]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[388]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-272)] -// Release input[436] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-400)] -// Release input[404] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[440]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[424]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[408]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[392]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[444]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-384)] -// Release input[408] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[428]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[412]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(-448)] -// Release input[392] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[396]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[496]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-368)] -// Release input[412] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[496]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[464]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[500]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-160)] -// Release input[464] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[500]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[468]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[452]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-16)] -// Release input[500] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-144)] -// Release input[468] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[488]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[472]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[456]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-128)] -// Release input[472] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[476]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[560]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-112)] -// Release input[476] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[560]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[544]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[528]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[564]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(96)] -// Release input[528] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[564]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[532]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[516]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[568]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(112)] -// Release input[532] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[568]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[536]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[572]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(128)] -// Release input[536] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[572]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[556]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[540]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[624]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(144)] -// Release input[540] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[624]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[576]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[628]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[628]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[596]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[580]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[632]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(496)] -// Release input[628] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[632]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[600]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[584]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[636]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[636]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[588]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[688]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-480)] -// Release input[636] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[688]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[656]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[640]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[692]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[688] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-400)] -// Release input[656] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[692]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[660]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[644]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[696]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-384)] -// Release input[660] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[696]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[664]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(-448)] -// Release input[644] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[648]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[700]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-240)] -// Release input[696] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-368)] -// Release input[664] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[700]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[668]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[752]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[700] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-352)] -// Release input[668] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[752]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[720]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[704]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-144)] -// Release input[720] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[724]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[708]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[760]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-128)] -// Release input[724] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[728]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[764]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-112)] -// Release input[728] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[764]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[748]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[732]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[816]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(32)] -// Release input[764] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-96)] -// Release input[732] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[800]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[784]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[768]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[820]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(176)] -// Release input[800] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(112)] -// Release input[784] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[820]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[804]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[788]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(48)] -// Release input[768] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[824]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(256)] -// Release input[820] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(128)] -// Release input[788] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[824]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[808]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[792]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[828]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(272)] -// Release input[824] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(208)] -// Release input[808] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(144)] -// Release input[792] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[828]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[812]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[796]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[780]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[880]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(288)] -// Release input[828] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(224)] -// Release input[812] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(160)] -// Release input[796] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[864]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[848]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release input[780] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[884]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(496)] -// Release input[880] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(368)] -// Release input[848] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[884]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[868]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[852]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[888]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-496)] -// Release input[884] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(384)] -// Release input[852] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[888]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[872]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[840]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[892]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-480)] -// Release input[888] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(400)] -// Release input[856] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[892]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[876]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[860]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[944]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-464)] -// Release input[892] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(416)] -// Release input[860] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[928]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[912]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[948]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-256)] -// Release input[944] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r10,#(-384)] -// Release input[912] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[948]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[932]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[916]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[900]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[952]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-240)] -// Release input[948] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r10,#(-368)] -// Release input[916] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[952]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[936]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[920]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[956]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-224)] -// Release input[952] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r10,#(-352)] -// Release input[920] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[956]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[940]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[924]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-208)] -// Release input[956] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r10,#(-336)] -// Release input[924] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[992]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[976]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[960]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-64)] -// Release input[992] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r10,#(-128)] -// Release input[976] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[996]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[980]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r10,#(-192)] -// Release input[960] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1016]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r10,#(-112)] -// Release input[980] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1016]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[1000]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[984]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(32)] -// Release input[1016] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-32)] -// Release input[1000] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r10,#(-96)] -// Release input[984] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[1004]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[988]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[972]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -vqrdmulh.s32 Q1, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q1, Q5, r9 -vstrw.u32 Q4, [r10,#(-16)] -// Release input[1004] from Q4 -vsub.s32 Q5, Q3, Q1 -vstrw.u32 Q5, [r10,#(-80)] -// Release input[988] from Q5 -vadd.s32 Q3, Q3, Q1 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r7 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q3, [r10,#(-144)] -// Release input[972] from Q3 -vqrdmulh.s32 Q3, Q2, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q3, Q2, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q2, Q1, Q3 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q3 -vqrdmlah.s32 Q5, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q0, [r1,#(96)] -vqrdmulh.s32 Q7, Q0, r3 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q2, [r1,#(64)] -vqrdmlah.s32 Q7, Q0, r9 -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q2, r3 -vsub.s32 Q4, Q1, Q6 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q4, [r1,#(32)] -vqrdmlah.s32 Q7, Q2, r9 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q2 -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(0)]! -vqrdmlah.s32 Q7, Q4, r9 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q0, Q1, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q1, Q1, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release input[0] from Q1 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[28]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[28] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[20] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[16] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q4, Q4, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vmul.u32 Q3, Q3, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q4, Q4, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r6 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q4, Q4, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vmul.u32 Q3, Q3, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q4, Q4, r6 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r6 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r6 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r6 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[252] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[248] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r6 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[268] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[264] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[260] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[256] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r6 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[272]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[284] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[280] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[276] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[272] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[300]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q4, Q4, r6 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[300] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[296] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[292] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[288] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[316]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r6 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[316] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[312] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[308] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[304] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r6 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[332] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[328] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[324] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[320] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r6 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[348] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[344] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[340] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[336] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q4, Q4, r6 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[364] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[360] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[356] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[352] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q3, Q3, r6 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[380] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[376] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[372] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[368] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[396]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vmul.u32 Q4, Q4, r6 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[396] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[392] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[388] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[384] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[412]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[408]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -96)] -vmul.u32 Q3, Q3, r6 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[412] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[408] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[404] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[400] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[428]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[424]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -80)] -vmul.u32 Q4, Q4, r6 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[428] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[424] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[420] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[416] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[444]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[444] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[440] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[436] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[432] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[460]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vmul.u32 Q4, Q4, r6 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[460] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[456] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[452] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[448] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[476]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vmul.u32 Q3, Q3, r6 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[476] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[472] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[468] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[464] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[492]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vmul.u32 Q4, Q4, r6 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[492] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[488] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[484] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[480] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vmul.u32 Q3, Q3, r6 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[508] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[504] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[500] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[496] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[524]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vmul.u32 Q4, Q4, r6 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[524] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[520] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[516] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[512] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[540]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vmul.u32 Q3, Q3, r6 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[540] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[536] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[532] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[528] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[556]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vmul.u32 Q4, Q4, r6 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[556] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[552] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[548] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[544] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[572]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vmul.u32 Q3, Q3, r6 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[572] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[568] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[564] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[560] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[588]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vmul.u32 Q4, Q4, r6 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[588] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[584] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[580] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[576] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[604]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[600]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 96)] -vmul.u32 Q3, Q3, r6 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[604] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[600] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[596] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[592] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[620]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[616]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 112)] -vmul.u32 Q4, Q4, r6 -// input[612]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[608]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[620] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[616] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[612] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[608] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[636]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q3, Q3, r6 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[636] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[632] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[628] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[624] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[652]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vmul.u32 Q4, Q4, r6 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[652] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[648] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[644] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[640] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[668]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[664]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q3, Q3, r6 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[656]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[668] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[664] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[660] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[656] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[684]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[680]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -76)] -vmul.u32 Q4, Q4, r6 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[672]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[684] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[680] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[676] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[672] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[700]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vmul.u32 Q3, Q3, r6 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[700] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[696] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[692] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[688] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[716]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vmul.u32 Q4, Q4, r6 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[716] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[712] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[708] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[704] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[732]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q3, Q3, r6 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[732] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[728] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[724] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[720] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[748]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vmul.u32 Q4, Q4, r6 -// input[740]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[748] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[744] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[740] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[736] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[764]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vmul.u32 Q3, Q3, r6 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[780]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[764] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[760] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[756] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[752] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[776]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 20)] -vmul.u32 Q4, Q4, r6 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[780] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[776] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[772] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[768] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[812]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[796] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[792] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[788] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[784] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[808]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vmul.u32 Q4, Q4, r6 -// input[804]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[800]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[828]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[812] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[808] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[804] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[800] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q3, Q3, r6 -// input[820]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[844]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[828] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[824] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[820] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[816] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[840]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 84)] -vmul.u32 Q4, Q4, r6 -// input[836]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[832]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[860]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[844] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[840] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[836] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[832] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[856]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vmul.u32 Q3, Q3, r6 -// input[852]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[848]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[860] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[856] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[852] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[848] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[872]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 116)] -vmul.u32 Q4, Q4, r6 -// input[868]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 112)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[864]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[892]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[876] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[872] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[868] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[864] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vmul.u32 Q3, Q3, r6 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[908]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[892] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[888] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[884] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[880] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[904]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -104)] -vmul.u32 Q4, Q4, r6 -// input[900]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -108)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[908] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[904] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[900] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[896] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[920]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[916]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -92)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[912]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[924] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[920] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[916] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[912] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[936]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -72)] -vmul.u32 Q4, Q4, r6 -// input[932]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[928]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[956]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[940] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[936] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[932] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[928] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[952]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -56)] -vmul.u32 Q3, Q3, r6 -// input[948]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -60)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[944]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[972]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[956] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[952] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[948] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[944] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vmul.u32 Q4, Q4, r6 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[960]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[988]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[972] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[968] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[964] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[960] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[984]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -24)] -vmul.u32 Q3, Q3, r6 -// input[980]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -28)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[976]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -// input[1004]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r9 -// Release input[988] from Q3 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[984] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[980] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[976] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r7 -// input[1000]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -8)] -vmul.u32 Q4, Q4, r6 -// input[996]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r9 -vqrdmulh.s32 Q3, Q1, r7 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q4, r3 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r2 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r9 -// input[1020]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q2, r5 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r9 -// Release input[1004] from Q4 -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r3 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r9 -vstrw.u32 Q7, [r1,#(208)] -// Release input[1000] from Q1 -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r9 -vneg.s32 Q7, Q7 -// Release input[996] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[992] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vmul.u32 Q3, Q3, r6 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmlah.s32 Q0, Q3, r9 -vqrdmulh.s32 Q4, Q1, r7 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r9 -// input[1008]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r9 -vqrdmulh.s32 Q4, Q2, r5 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q6, Q3, r3 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q6, Q3, r9 -// Release input[1020] from Q3 -vqrdmlah.s32 Q4, Q2, r9 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1,#(240)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q6, Q1, r9 -vstrw.u32 Q6, [r1,#(208)] -// Release input[1016] from Q1 -vqrdmulh.s32 Q6, Q2, r5 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q6, Q2, r9 -vneg.s32 Q6, Q6 -// Release input[1012] from Q2 -vqrdmulh.s32 Q1, Q0, r5 -vstrw.u32 Q6, [r1,#(48)] -vmul.u32 Q0, Q0, r4 -ldrd r7, r6, [r8], #+8 -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q1, [r1,#(16)] -// Release input[1008] from Q0 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 11431 -// Instruction count: 9122 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_rev4.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_rev4.s deleted file mode 100644 index bf173c5..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_rev4.s +++ /dev/null @@ -1,10308 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 = 17791697 * 2^31 -.word 3285804833 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 = 29333180 * 2^31 -.word 1876750725 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 = 16027071 * 2^31 -.word 3172903227 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 = 27246749 * 2^31 -.word 3890743531 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 = 19153009 * 2^31 -.word 3372902219 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 = 14378180 * 2^31 -.word 919922753 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 = 23328838 * 2^31 -.word 1492590085 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 = 26950707 * 2^31 -.word 3871802623 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 = 31812506 * 2^31 -.word 2035379173 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 = 17437883 * 2^31 -.word 3263167647 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 = 8357758 * 2^31 -.word 534733307 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 = 22422281 * 2^31 -.word 3582071787 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 = 29650081 * 2^31 -.word 4044509849 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 = 9686916 * 2^31 -.word 619773465 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 = 18399952 * 2^31 -.word 1177237627 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 = 27755269 * 2^31 -.word 3923278881 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 = 22563934 * 2^31 -.word 1443651165 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 = 2438403 * 2^31 -.word 2303493823 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 = 31481843 * 2^31 -.word 4161706847 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 = 32076751 * 2^31 -.word 4199769341 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 = 18223844 * 2^31 -.word 1165970155 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 = 3973412 * 2^31 -.word 254220777 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 = 7405458 * 2^31 -.word 473804703 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 = 33156191 * 2^31 -.word 4268832423 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 = 22859934 * 2^31 -.word 1462589385 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 = 23834070 * 2^31 -.word 1524915067 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 = 25149579 * 2^31 -.word 3756565603 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 = 13976724 * 2^31 -.word 894237409 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 = 15349951 * 2^31 -.word 3129580769 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 = 6932474 * 2^31 -.word 443542963 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 = 12503729 * 2^31 -.word 2947478141 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 = 10586616 * 2^31 -.word 677336697 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 = 15322485 * 2^31 -.word 3127823481 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 = 6173403 * 2^31 -.word 2542460889 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 = 14374018 * 2^31 -.word 919656467 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 = 9325363 * 2^31 -.word 2744124781 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 = 5605608 * 2^31 -.word 358649449 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 = 25200773 * 2^31 -.word 3759841019 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 = 31727447 * 2^31 -.word 4177420707 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 = 6658688 * 2^31 -.word 426026005 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 = 33297705 * 2^31 -.word 4277886557 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 = 486950 * 2^31 -.word 31155291 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 = 13215161 * 2^31 -.word 2992995895 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 = 16752026 * 2^31 -.word 1071802543 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 = 14102887 * 2^31 -.word 3049793025 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 = 32232983 * 2^31 -.word 4209765139 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 = 16009575 * 2^31 -.word 3171783825 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 = 5365218 * 2^31 -.word 343269183 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 = 24042369 * 2^31 -.word 3685725783 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 = 27221548 * 2^31 -.word 1741647511 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 = 7233695 * 2^31 -.word 2610298873 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 = 15385892 * 2^31 -.word 984396643 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 = 15700554 * 2^31 -.word 1004528867 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 = 17178032 * 2^31 -.word 1099058609 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 = 20482112 * 2^31 -.word 1310455209 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 = 31908284 * 2^31 -.word 2041507095 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 = 4869100 * 2^31 -.word 311527319 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 = 29810009 * 2^31 -.word 4054742117 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 = 11135584 * 2^31 -.word 712459929 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 = 18505659 * 2^31 -.word 3331484459 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 = 17484839 * 2^31 -.word 3266171913 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 = 20168277 * 2^31 -.word 3437859545 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 = 31283961 * 2^31 -.word 4149046263 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 = 17056436 * 2^31 -.word 1091278839 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.text -rev4: .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete_rev4, %function -.global ntt_1024_u32_33564673_286215_incomplete_rev4 -ntt_1024_u32_33564673_286215_incomplete_rev4: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -adr r4, rev4 -vldrb.u32 Q0, [r4] -vadd.u32 Q0, Q0, r0 -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q2, Q1, r7 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vmul.u32 Q1, Q1, r6 -// input[256]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r9 -vqrdmulh.s32 Q5, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q3, Q2, Q5 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q1, r9 -// input[772]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 16)] -vqrdmulh.s32 Q7, Q4, r5 -vsub.s32 Q1, Q3, Q6 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q6 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmlah.s32 Q7, Q4, r9 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vsub.s32 Q4, Q2, Q7 -vstrw.u32 Q4, [r14,#(16)] -// Release input[256] from Q4 -vadd.s32 Q2, Q2, Q7 -// input[772]: Already loaded as Q5 -vqrdmulh.s32 Q1, Q5, r7 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vmul.u32 Q5, Q5, r6 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q1, Q5, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q5, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q6, Q5, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q5, Q5, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q5, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q7, Q4, r5 -vsub.s32 Q5, Q3, Q6 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q6 -vstrw.u32 Q5, [r11,#(64)] -// Release input[772] from Q5 -vqrdmlah.s32 Q7, Q4, r9 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vsub.s32 Q4, Q1, Q7 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q1, Q1, Q7 -// input[776]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[520]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 16)] -vmul.u32 Q2, Q2, r6 -// input[264]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 12)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[780]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(64)] -// Release input[520] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(48)] -// Release input[264] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[780]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmul.u32 Q1, Q1, r6 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[784]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(96)] -// Release input[780] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(80)] -// Release input[524] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(64)] -// Release input[268] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[784]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vmul.u32 Q3, Q3, r6 -// input[272]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release input[784] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(80)] -// Release input[272] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[788]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[276]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(96)] -// Release input[276] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[792]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmul.u32 Q1, Q1, r6 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(144)] -// Release input[792] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(112)] -// Release input[280] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[796]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(160)] -// Release input[796] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(128)] -// Release input[284] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[800]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[288]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[804]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(144)] -// Release input[288] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[804]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r6 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(192)] -// Release input[804] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(160)] -// Release input[292] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[808]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vmul.u32 Q3, Q3, r6 -// input[296]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(176)] -// Release input[296] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[812]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(192)] -// Release input[300] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[816]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[560]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmul.u32 Q1, Q1, r6 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[820]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 64)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(224)] -// Release input[560] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(208)] -// Release input[304] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[820]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vmul.u32 Q3, Q3, r6 -// input[308]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(256)] -// Release input[820] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(224)] -// Release input[308] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[824]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vmul.u32 Q2, Q2, r6 -// input[312]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 60)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(240)] -// Release input[312] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[828]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[572]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmul.u32 Q1, Q1, r6 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[832]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 76)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(272)] -// Release input[572] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(256)] -// Release input[316] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[832]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vmul.u32 Q3, Q3, r6 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[836]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(304)] -// Release input[832] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(272)] -// Release input[320] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[836]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[840]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 84)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(320)] -// Release input[836] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(288)] -// Release input[324] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[840]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmul.u32 Q1, Q1, r6 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[844]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(336)] -// Release input[840] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(304)] -// Release input[328] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[844]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(352)] -// Release input[844] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(320)] -// Release input[332] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[848]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[852]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 96)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[852]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmul.u32 Q1, Q1, r6 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(384)] -// Release input[852] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[856]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vmul.u32 Q3, Q3, r6 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[860]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[860]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[864]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(416)] -// Release input[860] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[864]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmul.u32 Q1, Q1, r6 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(432)] -// Release input[864] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(400)] -// Release input[352] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[868]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vmul.u32 Q3, Q3, r6 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[872]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(416)] -// Release input[356] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[872]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[876]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 120)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(464)] -// Release input[872] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(432)] -// Release input[360] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[876]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmul.u32 Q1, Q1, r6 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[880]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 124)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(480)] -// Release input[876] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(448)] -// Release input[364] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[880]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vmul.u32 Q3, Q3, r6 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[884]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -124)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(496)] -// Release input[880] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(464)] -// Release input[368] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[884]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[888]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -120)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[884] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(480)] -// Release input[372] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[888]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vmul.u32 Q1, Q1, r6 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[892]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -116)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-480)] -// Release input[888] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(496)] -// Release input[376] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[892]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vmul.u32 Q3, Q3, r6 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-464)] -// Release input[892] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-496)] -// Release input[380] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[896]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmul.u32 Q2, Q2, r6 -// input[384]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -120)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[900]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -108)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-480)] -// Release input[384] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[900]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-432)] -// Release input[900] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-464)] -// Release input[388] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[904]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vmul.u32 Q3, Q3, r6 -// input[392]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[908]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-448)] -// Release input[392] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[908]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmul.u32 Q2, Q2, r6 -// input[396]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -108)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[912]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -96)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-400)] -// Release input[908] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-432)] -// Release input[396] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[912]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-384)] -// Release input[912] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-416)] -// Release input[400] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[916]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vmul.u32 Q3, Q3, r6 -// input[404]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[920]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -88)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-400)] -// Release input[404] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[920]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[408]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -96)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[924]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -84)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-352)] -// Release input[920] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-384)] -// Release input[408] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[924]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-336)] -// Release input[924] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-368)] -// Release input[412] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[928]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmul.u32 Q3, Q3, r6 -// input[416]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[932]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -76)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-352)] -// Release input[416] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[932]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[420]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -84)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[936]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -72)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-304)] -// Release input[932] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-336)] -// Release input[420] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[936]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-288)] -// Release input[936] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-320)] -// Release input[424] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[940]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmul.u32 Q3, Q3, r6 -// input[428]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-304)] -// Release input[428] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[944]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vmul.u32 Q2, Q2, r6 -// input[432]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -72)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[948]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -60)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-288)] -// Release input[432] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[948]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-240)] -// Release input[948] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-272)] -// Release input[436] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[952]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vmul.u32 Q3, Q3, r6 -// input[440]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -64)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-256)] -// Release input[440] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vmul.u32 Q2, Q2, r6 -// input[444]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -60)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-240)] -// Release input[444] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[960]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[704]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmul.u32 Q1, Q1, r6 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[964]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-208)] -// Release input[704] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-224)] -// Release input[448] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[964]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmul.u32 Q3, Q3, r6 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[968]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-176)] -// Release input[964] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-208)] -// Release input[452] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[968]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vmul.u32 Q2, Q2, r6 -// input[456]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -48)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[972]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-160)] -// Release input[968] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-176)] -// Release input[712] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-192)] -// Release input[456] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[972]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmul.u32 Q1, Q1, r6 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[976]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-144)] -// Release input[972] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-160)] -// Release input[716] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-176)] -// Release input[460] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[976]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vmul.u32 Q3, Q3, r6 -// input[464]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[980]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-128)] -// Release input[976] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-160)] -// Release input[464] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[980]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[984]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-112)] -// Release input[980] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-144)] -// Release input[468] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[984]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vmul.u32 Q1, Q1, r6 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[988]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-96)] -// Release input[984] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-128)] -// Release input[472] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[988]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmul.u32 Q3, Q3, r6 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-80)] -// Release input[988] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-112)] -// Release input[476] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[992]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[480]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -24)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[996]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-96)] -// Release input[480] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[996]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r6 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-48)] -// Release input[996] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-80)] -// Release input[484] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1000]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vmul.u32 Q3, Q3, r6 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1004]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-64)] -// Release input[488] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1004]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-16)] -// Release input[1004] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-48)] -// Release input[492] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[752]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -4)] -vmul.u32 Q1, Q1, r6 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1012]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-16)] -// Release input[752] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-32)] -// Release input[496] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1012]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vmul.u32 Q3, Q3, r6 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1016]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(16)] -// Release input[1012] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-16)] -// Release input[500] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1016]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vmul.u32 Q2, Q2, r6 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(32)] -// Release input[1016] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(16)] -// Release input[760] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(0)] -// Release input[504] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[764]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmul.u32 Q1, Q1, r6 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(32)] -// Release input[764] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(16)] -// Release input[508] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vmul.u32 Q3, Q3, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[196]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[68]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(272)] -// Release input[68] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(304)] -// Release input[76] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(320)] -// Release input[80] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[84]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[216]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(352)] -// Release input[88] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[96]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(384)] -// Release input[96] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(400)] -// Release input[100] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(416)] -// Release input[104] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(432)] -// Release input[108] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(464)] -// Release input[116] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(480)] -// Release input[120] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(272)] -// Release input[320] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(288)] -// Release input[324] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[456]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(304)] -// Release input[328] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(320)] -// Release input[332] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[468]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[480]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(400)] -// Release input[352] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(416)] -// Release input[356] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(432)] -// Release input[360] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[492]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(448)] -// Release input[364] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(464)] -// Release input[368] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(480)] -// Release input[372] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[504]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(496)] -// Release input[376] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-496)] -// Release input[380] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[704]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[576]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(288)] -// Release input[576] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[708]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(304)] -// Release input[580] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[584]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(320)] -// Release input[584] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[588]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(336)] -// Release input[588] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[720]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[596]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[728]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[600]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-112)] -// Release input[728] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[732]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[540]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[608]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(144)] -// Release input[540] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[740]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(416)] -// Release input[608] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[740]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[612]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-64)] -// Release input[740] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(432)] -// Release input[612] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[744]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(448)] -// Release input[616] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[620]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(464)] -// Release input[620] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[624]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(480)] -// Release input[624] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[756]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(496)] -// Release input[628] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[632]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -124)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[632] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[636]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[960]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-480)] -// Release input[636] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[896]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -112)] -vmul.u32 Q3, Q3, r6 -// input[832]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 76)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-192)] -// Release input[960] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-448)] -// Release input[896] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(304)] -// Release input[832] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[836]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 80)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[772]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(320)] -// Release input[836] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[968]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[904]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[840]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(64)] -// Release input[772] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[972]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-416)] -// Release input[904] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(336)] -// Release input[840] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[972]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[908]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -100)] -vmul.u32 Q3, Q3, r6 -// input[844]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 88)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[780]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-144)] -// Release input[972] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-400)] -// Release input[908] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(352)] -// Release input[844] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[848]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 92)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(96)] -// Release input[780] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[784]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[980]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(368)] -// Release input[848] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[980]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[852]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(112)] -// Release input[784] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-112)] -// Release input[980] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(384)] -// Release input[852] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[984]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vmul.u32 Q3, Q3, r6 -// input[856]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[792]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 36)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(400)] -// Release input[856] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[860]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(144)] -// Release input[792] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[796]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(416)] -// Release input[860] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[992]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[928]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[864]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(160)] -// Release input[796] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(432)] -// Release input[864] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[996]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vmul.u32 Q3, Q3, r6 -// input[868]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[804]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 48)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(448)] -// Release input[868] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[936]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[872]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(192)] -// Release input[804] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(464)] -// Release input[872] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[940]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[876]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1008]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(480)] -// Release input[876] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1008]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[944]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -64)] -vmul.u32 Q3, Q3, r6 -// input[880]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(0)] -// Release input[1008] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-256)] -// Release input[944] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(496)] -// Release input[880] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[884]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[820]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r10,#(-496)] -// Release input[884] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[952]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[888]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(256)] -// Release input[820] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1020]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 12)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-224)] -// Release input[952] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r10,#(-480)] -// Release input[888] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1020]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[956]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -52)] -vmul.u32 Q3, Q3, r6 -// input[892]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -116)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[828]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -vqrdmulh.s32 Q2, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(48)] -// Release input[1020] from Q3 -vqrdmlah.s32 Q2, Q5, r9 -vstrw.u32 Q4, [r10,#(-208)] -// Release input[956] from Q4 -vsub.s32 Q5, Q1, Q2 -vstrw.u32 Q5, [r10,#(-464)] -// Release input[892] from Q5 -vadd.s32 Q1, Q1, Q2 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 48)] -vqrdmulh.s32 Q3, Q2, r7 -// input[32]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 32)] -vmul.u32 Q2, Q2, r6 -// input[16]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 16)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release input[828] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[0]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 0)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[52]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(64)] -// Release input[16] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[20]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(0)] -// Release input[0] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[56]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(80)] -// Release input[20] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[56]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[40]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[44]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[28]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[12]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(112)] -// Release input[28] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[80]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[64]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(320)] -// Release input[80] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[84]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[68]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[72]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(352)] -// Release input[88] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[92]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[176]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[128]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[148]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[132]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[184]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[156]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[140]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[208]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[192]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[244]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[212]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[248]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-160)] -// Release input[212] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[216]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-144)] -// Release input[216] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[220]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[204]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-128)] -// Release input[220] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[304]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[272]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(80)] -// Release input[272] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[308]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[276]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[260]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[312]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(96)] -// Release input[276] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[312]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[296]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[280]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[264]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[316]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(112)] -// Release input[280] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[284]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(128)] -// Release input[284] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[336]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[320]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[372]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[340]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[324]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[376]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[344]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[380]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[364]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[348]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[332]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[432]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[432]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[400]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[384]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[436]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-416)] -// Release input[400] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[436]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[404]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[388]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-272)] -// Release input[436] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-400)] -// Release input[404] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[440]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[424]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[408]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[392]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[444]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-384)] -// Release input[408] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[428]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[412]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(-448)] -// Release input[392] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[396]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[496]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-368)] -// Release input[412] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[496]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[464]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[500]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-160)] -// Release input[464] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[500]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[468]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[452]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-16)] -// Release input[500] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-144)] -// Release input[468] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[488]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[472]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[456]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-128)] -// Release input[472] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[476]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[560]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-112)] -// Release input[476] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[560]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[544]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[528]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[564]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(96)] -// Release input[528] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[564]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[532]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[516]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[568]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(112)] -// Release input[532] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[568]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[536]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[572]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(128)] -// Release input[536] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[572]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[556]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[540]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[624]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(144)] -// Release input[540] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[624]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[576]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[628]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[628]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[596]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[580]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[632]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(496)] -// Release input[628] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[632]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[600]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[584]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[636]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[636]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[588]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[688]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-480)] -// Release input[636] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[688]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[656]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[640]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[692]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[688] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-400)] -// Release input[656] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[692]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[660]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[644]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[696]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-384)] -// Release input[660] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[696]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[664]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(-448)] -// Release input[644] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[648]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[700]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-240)] -// Release input[696] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-368)] -// Release input[664] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[700]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[668]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[752]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[700] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-352)] -// Release input[668] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[752]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[720]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[704]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-144)] -// Release input[720] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[724]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[708]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[760]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-128)] -// Release input[724] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[728]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[764]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-112)] -// Release input[728] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[764]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[748]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[732]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[816]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(32)] -// Release input[764] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-96)] -// Release input[732] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[800]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[784]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[768]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[820]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(176)] -// Release input[800] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(112)] -// Release input[784] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[820]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[804]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[788]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(48)] -// Release input[768] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[824]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(256)] -// Release input[820] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(128)] -// Release input[788] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[824]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[808]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[792]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[776]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[828]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(272)] -// Release input[824] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(208)] -// Release input[808] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(144)] -// Release input[792] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[828]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[812]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[796]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(80)] -// Release input[776] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[780]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[880]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(288)] -// Release input[828] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(224)] -// Release input[812] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(160)] -// Release input[796] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[864]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[848]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release input[780] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[884]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(496)] -// Release input[880] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(368)] -// Release input[848] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[884]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[868]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[852]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[836]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[888]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-496)] -// Release input[884] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(384)] -// Release input[852] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[888]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[872]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r11,#(320)] -// Release input[836] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[840]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[892]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-480)] -// Release input[888] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(400)] -// Release input[856] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[892]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[876]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[860]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[944]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-464)] -// Release input[892] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(416)] -// Release input[860] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[928]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r6 -// input[912]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[948]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-256)] -// Release input[944] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-320)] -// Release input[928] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r10,#(-384)] -// Release input[912] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[948]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[932]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[916]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[900]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[952]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(-240)] -// Release input[948] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r10,#(-368)] -// Release input[916] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[952]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[936]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[920]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[956]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(-224)] -// Release input[952] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-288)] -// Release input[936] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r10,#(-352)] -// Release input[920] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[956]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[940]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[924]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(-208)] -// Release input[956] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-272)] -// Release input[940] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r10,#(-336)] -// Release input[924] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[992]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[976]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[960]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -// input[1012]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-64)] -// Release input[992] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r10,#(-128)] -// Release input[976] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[1012]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[996]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[980]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q3, [r10,#(-192)] -// Release input[960] from Q3 -vqrdmulh.s32 Q3, Q4, r7 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r3 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r9 -// input[1016]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r10,#(16)] -// Release input[1012] from Q1 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r10,#(-112)] -// Release input[980] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[1016]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r7 -// input[1000]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r6 -// input[984]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r9 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmulh.s32 Q2, Q4, r7 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r9 -// input[968]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r3 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r5 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r10,#(32)] -// Release input[1016] from Q3 -vqrdmlah.s32 Q7, Q5, r9 -vstrw.u32 Q4, [r10,#(-32)] -// Release input[1000] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r10,#(-96)] -// Release input[984] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r7 -// input[1004]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[988]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r9 -vstrw.u32 Q1, [r10,#(-160)] -// Release input[968] from Q1 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r9 -// input[972]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r3 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r9 -vqrdmulh.s32 Q1, Q5, r5 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q1, Q5, r9 -vstrw.u32 Q4, [r10,#(-16)] -// Release input[1004] from Q4 -vsub.s32 Q5, Q3, Q1 -vstrw.u32 Q5, [r10,#(-80)] -// Release input[988] from Q5 -vadd.s32 Q3, Q3, Q1 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r7 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q3, [r10,#(-144)] -// Release input[972] from Q3 -vqrdmulh.s32 Q3, Q2, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q3, Q2, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q2, Q1, Q3 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q3 -vqrdmlah.s32 Q5, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[28]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r7 -vsub.s32 Q3, Q4, Q0 -vmul.u32 Q2, Q2, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q2, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q3, r3 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q3, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q3, Q2, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[44]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q2, Q2, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[92]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(368)] -// Release input[92] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[268]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q2, Q2, r6 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[284]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q2, Q2, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[332]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q1, Q1, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[336]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r14,#(336)] -// Release input[336] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[428]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[428]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-304)] -// Release input[428] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q2, Q2, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q2, Q2, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[524]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r6 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[556]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r6 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[604]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[620]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[620]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r12,#(352)] -// Release input[592] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(464)] -// Release input[620] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[652]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[668]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[668]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-352)] -// Release input[668] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[672]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release input[672] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r6 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r6 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(-80)] -// Release input[736] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[776]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release input[776] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[792]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 36)] -vmul.u32 Q2, Q2, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[812]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(160)] -// Release input[796] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(144)] -// Release input[792] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[804]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(112)] -// Release input[784] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(224)] -// Release input[812] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[824]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[820]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[816]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(272)] -// Release input[824] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(256)] -// Release input[820] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[840]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vmul.u32 Q2, Q2, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release input[816] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[832]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[860]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(304)] -// Release input[832] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[848]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(416)] -// Release input[860] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(368)] -// Release input[848] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[864]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[888]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release input[864] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[908]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-464)] -// Release input[892] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-480)] -// Release input[888] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q1, Q1, r6 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[896]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-400)] -// Release input[908] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-448)] -// Release input[896] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[912]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-384)] -// Release input[912] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[928]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-272)] -// Release input[940] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q1, Q1, r6 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-320)] -// Release input[928] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[944]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[968]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[964]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-256)] -// Release input[944] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-160)] -// Release input[968] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-176)] -// Release input[964] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r7 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vmul.u32 Q2, Q2, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r9 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r9 -// input[976]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[996]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r9 -vstrw.u32 Q0, [r10,#(-128)] -// Release input[976] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r9 -// input[992]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-16)] -// Release input[1004] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r7 -// input[1016]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[1012]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r10,#(-64)] -// Release input[992] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r9 -vqrdmulh.s32 Q2, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q2, Q4, r9 -vstrw.u32 Q3, [r10,#(32)] -// Release input[1016] from Q3 -vsub.s32 Q4, Q1, Q2 -vstrw.u32 Q4, [r10,#(16)] -// Release input[1012] from Q4 -vadd.s32 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 10276 -// Instruction count: 7967 \ No newline at end of file diff --git a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_skipfirst.s b/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_skipfirst.s deleted file mode 100644 index e11639a..0000000 --- a/tests/ntt_1024/auto/ntt_1024_u32_33564673_286215_incomplete_skipfirst.s +++ /dev/null @@ -1,9471 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 286215^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 35458195 // zeta^256 * 2^31 = 286215^256 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 286215^256 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 44770213 // zeta^128 * 2^31 = 286215^128 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 286215^128 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 3545473 // zeta^384 * 2^31 = 286215^384 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 286215^384 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 20108763 // zeta^ 64 * 2^31 = 286215^ 64 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 64 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 16155699 // zeta^320 * 2^31 = 286215^320 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 286215^320 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 23777969 // zeta^192 * 2^31 = 286215^192 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 286215^192 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 43443635 // zeta^448 * 2^31 = 286215^448 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 286215^448 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 56312659 // zeta^ 32 * 2^31 = 286215^ 32 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 32 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 50428539 // zeta^288 * 2^31 = 286215^288 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 286215^288 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 40872355 // zeta^160 * 2^31 = 286215^160 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 286215^160 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 17505197 // zeta^416 * 2^31 = 286215^416 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 286215^416 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 29514841 // zeta^ 96 * 2^31 = 286215^ 96 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 96 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 46171693 // zeta^352 * 2^31 = 286215^352 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 286215^352 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 49378579 // zeta^224 * 2^31 = 286215^224 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 286215^224 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 37299575 // zeta^480 * 2^31 = 286215^480 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 286215^480 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 35114601 // zeta^ 16 * 2^31 = 286215^ 16 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 16 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 56661185 // zeta^272 * 2^31 = 286215^272 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 286215^272 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 24798937 // zeta^144 * 2^31 = 286215^144 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 286215^144 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 2433499 // zeta^400 * 2^31 = 286215^400 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 286215^400 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 13509691 // zeta^ 80 * 2^31 = 286215^ 80 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 80 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 61528771 // zeta^336 * 2^31 = 286215^336 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 286215^336 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 26961583 // zeta^208 * 2^31 = 286215^208 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 286215^208 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 39914361 // zeta^464 * 2^31 = 286215^464 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 286215^464 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 42427289 // zeta^ 48 * 2^31 = 286215^ 48 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 48 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 22993529 // zeta^304 * 2^31 = 286215^304 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 286215^304 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 12459675 // zeta^176 * 2^31 = 286215^176 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 286215^176 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 17297731 // zeta^432 * 2^31 = 286215^432 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 286215^432 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 51482787 // zeta^112 * 2^31 = 286215^112 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 286215^112 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 47832419 // zeta^368 * 2^31 = 286215^368 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 286215^368 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 32696733 // zeta^240 * 2^31 = 286215^240 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 286215^240 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 16328205 // zeta^496 * 2^31 = 286215^496 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 286215^496 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 7271765 // zeta^ 8 * 2^31 = 286215^ 8 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 8 * 71292929 * 2^31 -.word 34173151 // zeta^ 4 * 2^31 = 286215^ 4 * 2^31 = 17791697 * 2^31 -.word 3285804833 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 4 * 71292929 * 2^31 -.word 6702715 // zeta^260 * 2^31 = 286215^260 * 2^31 = 29333180 * 2^31 -.word 1876750725 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 286215^260 * 71292929 * 2^31 -.word 9232849 // zeta^264 * 2^31 = 286215^264 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 286215^264 * 71292929 * 2^31 -.word 40902341 // zeta^132 * 2^31 = 286215^132 * 2^31 = 16027071 * 2^31 -.word 3172903227 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 286215^132 * 71292929 * 2^31 -.word 11747093 // zeta^388 * 2^31 = 286215^388 * 2^31 = 27246749 * 2^31 -.word 3890743531 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 286215^388 * 71292929 * 2^31 -.word 5061807 // zeta^136 * 2^31 = 286215^136 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 286215^136 * 71292929 * 2^31 -.word 13754549 // zeta^ 68 * 2^31 = 286215^ 68 * 2^31 = 19153009 * 2^31 -.word 3372902219 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 68 * 71292929 * 2^31 -.word 48295871 // zeta^324 * 2^31 = 286215^324 * 2^31 = 14378180 * 2^31 -.word 919922753 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 286215^324 * 71292929 * 2^31 -.word 12062383 // zeta^392 * 2^31 = 286215^392 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 286215^392 * 71292929 * 2^31 -.word 5773819 // zeta^196 * 2^31 = 286215^196 * 2^31 = 23328838 * 2^31 -.word 1492590085 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 286215^196 * 71292929 * 2^31 -.word 40968961 // zeta^452 * 2^31 = 286215^452 * 2^31 = 26950707 * 2^31 -.word 3871802623 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 286215^452 * 71292929 * 2^31 -.word 26674607 // zeta^ 72 * 2^31 = 286215^ 72 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 72 * 71292929 * 2^31 -.word 64146459 // zeta^ 36 * 2^31 = 286215^ 36 * 2^31 = 31812506 * 2^31 -.word 2035379173 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 36 * 71292929 * 2^31 -.word 469857 // zeta^292 * 2^31 = 286215^292 * 2^31 = 17437883 * 2^31 -.word 3263167647 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 286215^292 * 71292929 * 2^31 -.word 6369225 // zeta^328 * 2^31 = 286215^328 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 286215^328 * 71292929 * 2^31 -.word 47277573 // zeta^164 * 2^31 = 286215^164 * 2^31 = 8357758 * 2^31 -.word 534733307 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 286215^164 * 71292929 * 2^31 -.word 23147541 // zeta^420 * 2^31 = 286215^420 * 2^31 = 22422281 * 2^31 -.word 3582071787 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 286215^420 * 71292929 * 2^31 -.word 13877423 // zeta^200 * 2^31 = 286215^200 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 286215^200 * 71292929 * 2^31 -.word 378215 // zeta^100 * 2^31 = 286215^100 * 2^31 = 29650081 * 2^31 -.word 4044509849 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 286215^100 * 71292929 * 2^31 -.word 22747623 // zeta^356 * 2^31 = 286215^356 * 2^31 = 9686916 * 2^31 -.word 619773465 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 286215^356 * 71292929 * 2^31 -.word 52182971 // zeta^456 * 2^31 = 286215^456 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 286215^456 * 71292929 * 2^31 -.word 50433925 // zeta^228 * 2^31 = 286215^228 * 2^31 = 18399952 * 2^31 -.word 1177237627 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 286215^228 * 71292929 * 2^31 -.word 12737503 // zeta^484 * 2^31 = 286215^484 * 2^31 = 27755269 * 2^31 -.word 3923278881 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 286215^484 * 71292929 * 2^31 -.word 26766019 // zeta^ 40 * 2^31 = 286215^ 40 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 40 * 71292929 * 2^31 -.word 20257187 // zeta^ 20 * 2^31 = 286215^ 20 * 2^31 = 22563934 * 2^31 -.word 1443651165 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 20 * 71292929 * 2^31 -.word 61186369 // zeta^276 * 2^31 = 286215^276 * 2^31 = 2438403 * 2^31 -.word 2303493823 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 286215^276 * 71292929 * 2^31 -.word 3049295 // zeta^296 * 2^31 = 286215^296 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 286215^296 * 71292929 * 2^31 -.word 27954337 // zeta^148 * 2^31 = 286215^148 * 2^31 = 31481843 * 2^31 -.word 4161706847 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 286215^148 * 71292929 * 2^31 -.word 65344259 // zeta^404 * 2^31 = 286215^404 * 2^31 = 32076751 * 2^31 -.word 4199769341 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 286215^404 * 71292929 * 2^31 -.word 27572075 // zeta^168 * 2^31 = 286215^168 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 286215^168 * 71292929 * 2^31 -.word 13368597 // zeta^ 84 * 2^31 = 286215^ 84 * 2^31 = 18223844 * 2^31 -.word 1165970155 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 84 * 71292929 * 2^31 -.word 46956055 // zeta^340 * 2^31 = 286215^340 * 2^31 = 3973412 * 2^31 -.word 254220777 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 286215^340 * 71292929 * 2^31 -.word 62852605 // zeta^424 * 2^31 = 286215^424 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 286215^424 * 71292929 * 2^31 -.word 38893665 // zeta^212 * 2^31 = 286215^212 * 2^31 = 7405458 * 2^31 -.word 473804703 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 286215^212 * 71292929 * 2^31 -.word 50639193 // zeta^468 * 2^31 = 286215^468 * 2^31 = 33156191 * 2^31 -.word 4268832423 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 286215^468 * 71292929 * 2^31 -.word 41037815 // zeta^104 * 2^31 = 286215^104 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 286215^104 * 71292929 * 2^31 -.word 18563127 // zeta^ 52 * 2^31 = 286215^ 52 * 2^31 = 22859934 * 2^31 -.word 1462589385 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 52 * 71292929 * 2^31 -.word 13659269 // zeta^308 * 2^31 = 286215^308 * 2^31 = 23834070 * 2^31 -.word 1524915067 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 286215^308 * 71292929 * 2^31 -.word 16612991 // zeta^360 * 2^31 = 286215^360 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 286215^360 * 71292929 * 2^31 -.word 6808477 // zeta^180 * 2^31 = 286215^180 * 2^31 = 25149579 * 2^31 -.word 3756565603 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 286215^180 * 71292929 * 2^31 -.word 25156895 // zeta^436 * 2^31 = 286215^436 * 2^31 = 13976724 * 2^31 -.word 894237409 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 286215^436 * 71292929 * 2^31 -.word 32973157 // zeta^232 * 2^31 = 286215^232 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 286215^232 * 71292929 * 2^31 -.word 49494815 // zeta^116 * 2^31 = 286215^116 * 2^31 = 15349951 * 2^31 -.word 3129580769 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 286215^116 * 71292929 * 2^31 -.word 40639053 // zeta^372 * 2^31 = 286215^372 * 2^31 = 6932474 * 2^31 -.word 443542963 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 286215^372 * 71292929 * 2^31 -.word 36139229 // zeta^488 * 2^31 = 286215^488 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 286215^488 * 71292929 * 2^31 -.word 7177603 // zeta^244 * 2^31 = 286215^244 * 2^31 = 12503729 * 2^31 -.word 2947478141 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 286215^244 * 71292929 * 2^31 -.word 1950087 // zeta^500 * 2^31 = 286215^500 * 2^31 = 10586616 * 2^31 -.word 677336697 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 286215^500 * 71292929 * 2^31 -.word 61506475 // zeta^ 24 * 2^31 = 286215^ 24 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 24 * 71292929 * 2^31 -.word 60705671 // zeta^ 12 * 2^31 = 286215^ 12 * 2^31 = 15322485 * 2^31 -.word 3127823481 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 12 * 71292929 * 2^31 -.word 58406951 // zeta^268 * 2^31 = 286215^268 * 2^31 = 6173403 * 2^31 -.word 2542460889 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 286215^268 * 71292929 * 2^31 -.word 55340015 // zeta^280 * 2^31 = 286215^280 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 286215^280 * 71292929 * 2^31 -.word 23867373 // zeta^140 * 2^31 = 286215^140 * 2^31 = 14374018 * 2^31 -.word 919656467 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 286215^140 * 71292929 * 2^31 -.word 26669715 // zeta^396 * 2^31 = 286215^396 * 2^31 = 9325363 * 2^31 -.word 2744124781 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 286215^396 * 71292929 * 2^31 -.word 12255067 // zeta^152 * 2^31 = 286215^152 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 286215^152 * 71292929 * 2^31 -.word 39782807 // zeta^ 76 * 2^31 = 286215^ 76 * 2^31 = 5605608 * 2^31 -.word 358649449 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 76 * 71292929 * 2^31 -.word 17705221 // zeta^332 * 2^31 = 286215^332 * 2^31 = 25200773 * 2^31 -.word 3759841019 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 286215^332 * 71292929 * 2^31 -.word 39251459 // zeta^408 * 2^31 = 286215^408 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 286215^408 * 71292929 * 2^31 -.word 29369949 // zeta^204 * 2^31 = 286215^204 * 2^31 = 31727447 * 2^31 -.word 4177420707 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 286215^204 * 71292929 * 2^31 -.word 49812459 // zeta^460 * 2^31 = 286215^460 * 2^31 = 6658688 * 2^31 -.word 426026005 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 286215^460 * 71292929 * 2^31 -.word 13565749 // zeta^ 88 * 2^31 = 286215^ 88 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 88 * 71292929 * 2^31 -.word 4594083 // zeta^ 44 * 2^31 = 286215^ 44 * 2^31 = 33297705 * 2^31 -.word 4277886557 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 44 * 71292929 * 2^31 -.word 7758757 // zeta^300 * 2^31 = 286215^300 * 2^31 = 486950 * 2^31 -.word 31155291 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 286215^300 * 71292929 * 2^31 -.word 36826073 // zeta^344 * 2^31 = 286215^344 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 286215^344 * 71292929 * 2^31 -.word 65137097 // zeta^172 * 2^31 = 286215^172 * 2^31 = 13215161 * 2^31 -.word 2992995895 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 286215^172 * 71292929 * 2^31 -.word 29507409 // zeta^428 * 2^31 = 286215^428 * 2^31 = 16752026 * 2^31 -.word 1071802543 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 286215^428 * 71292929 * 2^31 -.word 34487347 // zeta^216 * 2^31 = 286215^216 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 286215^216 * 71292929 * 2^31 -.word 38253055 // zeta^108 * 2^31 = 286215^108 * 2^31 = 14102887 * 2^31 -.word 3049793025 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 286215^108 * 71292929 * 2^31 -.word 39394541 // zeta^364 * 2^31 = 286215^364 * 2^31 = 32232983 * 2^31 -.word 4209765139 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 286215^364 * 71292929 * 2^31 -.word 61222515 // zeta^472 * 2^31 = 286215^472 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 286215^472 * 71292929 * 2^31 -.word 29082479 // zeta^236 * 2^31 = 286215^236 * 2^31 = 16009575 * 2^31 -.word 3171783825 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 286215^236 * 71292929 * 2^31 -.word 44583105 // zeta^492 * 2^31 = 286215^492 * 2^31 = 5365218 * 2^31 -.word 343269183 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 286215^492 * 71292929 * 2^31 -.word 62959157 // zeta^ 56 * 2^31 = 286215^ 56 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 56 * 71292929 * 2^31 -.word 30585257 // zeta^ 28 * 2^31 = 286215^ 28 * 2^31 = 24042369 * 2^31 -.word 3685725783 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 28 * 71292929 * 2^31 -.word 15268201 // zeta^284 * 2^31 = 286215^284 * 2^31 = 27221548 * 2^31 -.word 1741647511 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 286215^284 * 71292929 * 2^31 -.word 51158985 // zeta^312 * 2^31 = 286215^312 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 286215^312 * 71292929 * 2^31 -.word 40572935 // zeta^156 * 2^31 = 286215^156 * 2^31 = 7233695 * 2^31 -.word 2610298873 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 286215^156 * 71292929 * 2^31 -.word 55301277 // zeta^412 * 2^31 = 286215^412 * 2^31 = 15385892 * 2^31 -.word 984396643 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 286215^412 * 71292929 * 2^31 -.word 59122583 // zeta^184 * 2^31 = 286215^184 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 286215^184 * 71292929 * 2^31 -.word 39625501 // zeta^ 92 * 2^31 = 286215^ 92 * 2^31 = 15700554 * 2^31 -.word 1004528867 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 92 * 71292929 * 2^31 -.word 5900879 // zeta^348 * 2^31 = 286215^348 * 2^31 = 17178032 * 2^31 -.word 1099058609 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 286215^348 * 71292929 * 2^31 -.word 12915351 // zeta^440 * 2^31 = 286215^440 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 286215^440 * 71292929 * 2^31 -.word 25272919 // zeta^220 * 2^31 = 286215^220 * 2^31 = 20482112 * 2^31 -.word 1310455209 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 286215^220 * 71292929 * 2^31 -.word 54885097 // zeta^476 * 2^31 = 286215^476 * 2^31 = 31908284 * 2^31 -.word 2041507095 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 286215^476 * 71292929 * 2^31 -.word 32364195 // zeta^120 * 2^31 = 286215^120 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 286215^120 * 71292929 * 2^31 -.word 37675113 // zeta^ 60 * 2^31 = 286215^ 60 * 2^31 = 4869100 * 2^31 -.word 311527319 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 286215^ 60 * 71292929 * 2^31 -.word 35767195 // zeta^316 * 2^31 = 286215^316 * 2^31 = 29810009 * 2^31 -.word 4054742117 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 286215^316 * 71292929 * 2^31 -.word 17635297 // zeta^376 * 2^31 = 286215^376 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 286215^376 * 71292929 * 2^31 -.word 8442215 // zeta^188 * 2^31 = 286215^188 * 2^31 = 11135584 * 2^31 -.word 712459929 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 286215^188 * 71292929 * 2^31 -.word 45014229 // zeta^444 * 2^31 = 286215^444 * 2^31 = 18505659 * 2^31 -.word 3331484459 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 286215^444 * 71292929 * 2^31 -.word 38891533 // zeta^248 * 2^31 = 286215^248 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 286215^248 * 71292929 * 2^31 -.word 36750327 // zeta^124 * 2^31 = 286215^124 * 2^31 = 17484839 * 2^31 -.word 3266171913 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 286215^124 * 71292929 * 2^31 -.word 35947815 // zeta^380 * 2^31 = 286215^380 * 2^31 = 20168277 * 2^31 -.word 3437859545 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 286215^380 * 71292929 * 2^31 -.word 24452961 // zeta^504 * 2^31 = 286215^504 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 286215^504 * 71292929 * 2^31 -.word 30669833 // zeta^252 * 2^31 = 286215^252 * 2^31 = 31283961 * 2^31 -.word 4149046263 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 286215^252 * 71292929 * 2^31 -.word 20303881 // zeta^508 * 2^31 = 286215^508 * 2^31 = 17056436 * 2^31 -.word 1091278839 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 286215^508 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_1024_u32_33564673_286215_incomplete_skipfirst, %function -.global ntt_1024_u32_33564673_286215_incomplete_skipfirst -ntt_1024_u32_33564673_286215_incomplete_skipfirst: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -// Use r10 as marker for r0 + 4032 -add r10, r11, #1008 -.equ modulus, 33564673 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vqrdmulh.s32 Q1, Q2, r3 -// input[0]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 0)] -vmul.u32 Q2, Q2, r2 -// input[256]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)] -vsub.s32 Q0, Q3, Q4 -// input[512]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r9 -// input[772]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 16)] -vsub.s32 Q2, Q5, Q1 -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(32)] -// Release input[512] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q4 -// input[772]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q6, Q6, r2 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vsub.s32 Q0, Q2, Q4 -// input[516]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vqrdmlah.s32 Q1, Q6, r9 -// input[776]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 20)] -vstrw.u32 Q3, [r0,#(0)] -// Release input[0] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(64)] -// Release input[772] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(48)] -// Release input[516] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q4 -// input[776]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q7, Q7, r2 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vsub.s32 Q0, Q3, Q4 -// input[520]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 16)] -vqrdmlah.s32 Q1, Q7, r9 -// input[780]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 24)] -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(80)] -// Release input[776] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(64)] -// Release input[520] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q4 -// input[780]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q6, Q6, r2 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vsub.s32 Q0, Q2, Q4 -// input[524]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 20)] -vqrdmlah.s32 Q1, Q6, r9 -// input[784]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 28)] -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(96)] -// Release input[780] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(80)] -// Release input[524] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q4 -// input[784]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q7, Q7, r2 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vsub.s32 Q0, Q3, Q4 -// input[528]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q7, r9 -// input[788]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 32)] -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(112)] -// Release input[784] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(96)] -// Release input[528] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(80)] -// Release input[272] from Q4 -// input[788]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q6, Q6, r2 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vsub.s32 Q0, Q2, Q4 -// input[532]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q6, r9 -// input[792]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 36)] -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(128)] -// Release input[788] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(112)] -// Release input[532] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(96)] -// Release input[276] from Q4 -// input[792]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q7, Q7, r2 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vsub.s32 Q0, Q3, Q4 -// input[536]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q7, r9 -// input[796]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 40)] -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(144)] -// Release input[792] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(128)] -// Release input[536] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q4 -// input[796]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q6, r2 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vsub.s32 Q0, Q2, Q4 -// input[540]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vqrdmlah.s32 Q1, Q6, r9 -// input[800]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 44)] -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(160)] -// Release input[796] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(144)] -// Release input[540] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q4 -// input[800]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q7, Q7, r2 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vsub.s32 Q0, Q3, Q4 -// input[544]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 40)] -vqrdmlah.s32 Q1, Q7, r9 -// input[804]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 48)] -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(176)] -// Release input[800] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(160)] -// Release input[544] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q4 -// input[804]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vmul.u32 Q6, Q6, r2 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vsub.s32 Q0, Q2, Q4 -// input[548]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 44)] -vqrdmlah.s32 Q1, Q6, r9 -// input[808]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 52)] -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(192)] -// Release input[804] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(176)] -// Release input[548] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q4 -// input[808]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q7, Q7, r2 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vsub.s32 Q0, Q3, Q4 -// input[552]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q7, r9 -// input[812]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 56)] -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(208)] -// Release input[808] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(192)] -// Release input[552] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(176)] -// Release input[296] from Q4 -// input[812]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vmul.u32 Q6, Q6, r2 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vsub.s32 Q0, Q2, Q4 -// input[556]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 52)] -vqrdmlah.s32 Q1, Q6, r9 -// input[816]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 60)] -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(224)] -// Release input[812] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(208)] -// Release input[556] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q4 -// input[816]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q7, Q7, r2 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vsub.s32 Q0, Q3, Q4 -// input[560]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 56)] -vqrdmlah.s32 Q1, Q7, r9 -// input[820]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 64)] -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(240)] -// Release input[816] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(224)] -// Release input[560] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q4 -// input[820]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q6, r2 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vsub.s32 Q0, Q2, Q4 -// input[564]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q6, r9 -// input[824]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 68)] -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(256)] -// Release input[820] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(240)] -// Release input[564] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(224)] -// Release input[308] from Q4 -// input[824]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q7, Q7, r2 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vsub.s32 Q0, Q3, Q4 -// input[568]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 64)] -vqrdmlah.s32 Q1, Q7, r9 -// input[828]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 72)] -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(272)] -// Release input[824] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(256)] -// Release input[568] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q4 -// input[828]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q6, r2 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vsub.s32 Q0, Q2, Q4 -// input[572]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 68)] -vqrdmlah.s32 Q1, Q6, r9 -// input[832]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 76)] -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(288)] -// Release input[828] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(272)] -// Release input[572] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q4 -// input[832]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmul.u32 Q7, Q7, r2 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vsub.s32 Q0, Q3, Q4 -// input[576]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q7, r9 -// input[836]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 80)] -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(304)] -// Release input[832] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(288)] -// Release input[576] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q4 -// input[836]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q6, Q6, r2 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vsub.s32 Q0, Q2, Q4 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q6, r9 -// input[840]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 84)] -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(320)] -// Release input[836] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(304)] -// Release input[580] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(288)] -// Release input[324] from Q4 -// input[840]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q7, Q7, r2 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vsub.s32 Q0, Q3, Q4 -// input[584]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q7, r9 -// input[844]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 88)] -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(336)] -// Release input[840] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(320)] -// Release input[584] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q4 -// input[844]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q6, Q6, r2 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vsub.s32 Q0, Q2, Q4 -// input[588]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q6, r9 -// input[848]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 92)] -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(352)] -// Release input[844] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(336)] -// Release input[588] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q4 -// input[848]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q7, Q7, r2 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vsub.s32 Q0, Q3, Q4 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q7, r9 -// input[852]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 96)] -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(368)] -// Release input[848] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(336)] -// Release input[336] from Q4 -// input[852]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vmul.u32 Q6, Q6, r2 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vsub.s32 Q0, Q2, Q4 -// input[596]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q6, r9 -// input[856]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 100)] -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(384)] -// Release input[852] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(352)] -// Release input[340] from Q4 -// input[856]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q7, Q7, r2 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vsub.s32 Q0, Q3, Q4 -// input[600]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q7, r9 -// input[860]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 104)] -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(400)] -// Release input[856] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(368)] -// Release input[344] from Q4 -// input[860]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vmul.u32 Q6, Q6, r2 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vsub.s32 Q0, Q2, Q4 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q6, r9 -// input[864]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 108)] -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(416)] -// Release input[860] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q4 -// input[864]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q7, Q7, r2 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vsub.s32 Q0, Q3, Q4 -// input[608]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q7, r9 -// input[868]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 112)] -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(432)] -// Release input[864] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(416)] -// Release input[608] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q4 -// input[868]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vmul.u32 Q6, Q6, r2 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vsub.s32 Q0, Q2, Q4 -// input[612]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q6, r9 -// input[872]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 116)] -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(448)] -// Release input[868] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(432)] -// Release input[612] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(416)] -// Release input[356] from Q4 -// input[872]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q7, Q7, r2 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vsub.s32 Q0, Q3, Q4 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q7, r9 -// input[876]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 120)] -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(464)] -// Release input[872] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(448)] -// Release input[616] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(432)] -// Release input[360] from Q4 -// input[876]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vmul.u32 Q6, Q6, r2 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vsub.s32 Q0, Q2, Q4 -// input[620]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q6, r9 -// input[880]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 124)] -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r11,#(480)] -// Release input[876] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(464)] -// Release input[620] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(448)] -// Release input[364] from Q4 -// input[880]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q7, Q7, r2 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vsub.s32 Q0, Q3, Q4 -// input[624]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q7, r9 -// input[884]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -124)] -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r11,#(496)] -// Release input[880] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(480)] -// Release input[624] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q4 -// input[884]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vmul.u32 Q6, Q6, r2 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vsub.s32 Q0, Q2, Q4 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q6, r9 -// input[888]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -120)] -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-496)] -// Release input[884] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r12,#(496)] -// Release input[628] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r14,#(480)] -// Release input[372] from Q4 -// input[888]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q7, Q7, r2 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vsub.s32 Q0, Q3, Q4 -// input[632]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q7, r9 -// input[892]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -116)] -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-480)] -// Release input[888] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[632] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q4 -// input[892]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q6, r2 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vsub.s32 Q0, Q2, Q4 -// input[636]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q6, r9 -// input[896]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -112)] -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-464)] -// Release input[892] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-480)] -// Release input[636] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q4 -// input[896]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vmul.u32 Q7, Q7, r2 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vsub.s32 Q0, Q3, Q4 -// input[640]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -116)] -vqrdmlah.s32 Q1, Q7, r9 -// input[900]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -108)] -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-448)] -// Release input[896] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[640] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q4 -// input[900]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q6, Q6, r2 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vsub.s32 Q0, Q2, Q4 -// input[644]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -112)] -vqrdmlah.s32 Q1, Q6, r9 -// input[904]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -104)] -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-432)] -// Release input[900] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-448)] -// Release input[644] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-464)] -// Release input[388] from Q4 -// input[904]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q7, Q7, r2 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vsub.s32 Q0, Q3, Q4 -// input[648]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -108)] -vqrdmlah.s32 Q1, Q7, r9 -// input[908]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -100)] -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-416)] -// Release input[904] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-432)] -// Release input[648] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q4 -// input[908]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q6, Q6, r2 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vsub.s32 Q0, Q2, Q4 -// input[652]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -104)] -vqrdmlah.s32 Q1, Q6, r9 -// input[912]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -96)] -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-400)] -// Release input[908] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-416)] -// Release input[652] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q4 -// input[912]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q7, Q7, r2 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vsub.s32 Q0, Q3, Q4 -// input[656]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q7, r9 -// input[916]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -92)] -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-384)] -// Release input[912] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-400)] -// Release input[656] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q4 -// input[916]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vmul.u32 Q6, Q6, r2 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vsub.s32 Q0, Q2, Q4 -// input[660]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q6, r9 -// input[920]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -88)] -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-368)] -// Release input[916] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-384)] -// Release input[660] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-400)] -// Release input[404] from Q4 -// input[920]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q7, Q7, r2 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vsub.s32 Q0, Q3, Q4 -// input[664]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q7, r9 -// input[924]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -84)] -vstrw.u32 Q2, [r14,#(-416)] -// Release input[148] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-352)] -// Release input[920] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-368)] -// Release input[664] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-384)] -// Release input[408] from Q4 -// input[924]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q6, Q6, r2 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vsub.s32 Q0, Q2, Q4 -// input[668]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -88)] -vqrdmlah.s32 Q1, Q6, r9 -// input[928]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -80)] -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-336)] -// Release input[924] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-352)] -// Release input[668] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-368)] -// Release input[412] from Q4 -// input[928]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q7, Q7, r2 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vsub.s32 Q0, Q3, Q4 -// input[672]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q7, r9 -// input[932]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -76)] -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-320)] -// Release input[928] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-336)] -// Release input[672] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-352)] -// Release input[416] from Q4 -// input[932]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q6, Q6, r2 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vsub.s32 Q0, Q2, Q4 -// input[676]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q6, r9 -// input[936]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -72)] -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-304)] -// Release input[932] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-320)] -// Release input[676] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-336)] -// Release input[420] from Q4 -// input[936]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q7, Q7, r2 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vsub.s32 Q0, Q3, Q4 -// input[680]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vqrdmlah.s32 Q1, Q7, r9 -// input[940]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -68)] -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-288)] -// Release input[936] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-304)] -// Release input[680] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-320)] -// Release input[424] from Q4 -// input[940]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q6, r2 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vsub.s32 Q0, Q2, Q4 -// input[684]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q6, r9 -// input[944]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -64)] -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-272)] -// Release input[940] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-288)] -// Release input[684] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-304)] -// Release input[428] from Q4 -// input[944]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q7, Q7, r2 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vsub.s32 Q0, Q3, Q4 -// input[688]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vqrdmlah.s32 Q1, Q7, r9 -// input[948]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -60)] -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-256)] -// Release input[944] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-272)] -// Release input[688] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q4 -// input[948]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vmul.u32 Q6, Q6, r2 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vsub.s32 Q0, Q2, Q4 -// input[692]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vqrdmlah.s32 Q1, Q6, r9 -// input[952]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -56)] -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-240)] -// Release input[948] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-256)] -// Release input[692] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q4 -// input[952]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q7, Q7, r2 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vsub.s32 Q0, Q3, Q4 -// input[696]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -60)] -vqrdmlah.s32 Q1, Q7, r9 -// input[956]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -52)] -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-224)] -// Release input[952] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-240)] -// Release input[696] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-256)] -// Release input[440] from Q4 -// input[956]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q6, r2 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vsub.s32 Q0, Q2, Q4 -// input[700]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -56)] -vqrdmlah.s32 Q1, Q6, r9 -// input[960]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -48)] -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-208)] -// Release input[956] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-224)] -// Release input[700] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q4 -// input[960]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vmul.u32 Q7, Q7, r2 -// input[448]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -56)] -vsub.s32 Q0, Q3, Q4 -// input[704]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -52)] -vqrdmlah.s32 Q1, Q7, r9 -// input[964]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -44)] -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-192)] -// Release input[960] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-208)] -// Release input[704] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q4 -// input[964]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vmul.u32 Q6, Q6, r2 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vsub.s32 Q0, Q2, Q4 -// input[708]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q6, r9 -// input[968]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -40)] -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-176)] -// Release input[964] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-192)] -// Release input[708] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q4 -// input[968]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q7, Q7, r2 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vsub.s32 Q0, Q3, Q4 -// input[712]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -44)] -vqrdmlah.s32 Q1, Q7, r9 -// input[972]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -36)] -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-160)] -// Release input[968] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-176)] -// Release input[712] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q4 -// input[972]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vmul.u32 Q6, Q6, r2 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vsub.s32 Q0, Q2, Q4 -// input[716]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -40)] -vqrdmlah.s32 Q1, Q6, r9 -// input[976]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -32)] -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-144)] -// Release input[972] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-160)] -// Release input[716] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-176)] -// Release input[460] from Q4 -// input[976]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q7, Q7, r2 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vsub.s32 Q0, Q3, Q4 -// input[720]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -36)] -vqrdmlah.s32 Q1, Q7, r9 -// input[980]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -28)] -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-128)] -// Release input[976] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-144)] -// Release input[720] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q4 -// input[980]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vmul.u32 Q6, Q6, r2 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vsub.s32 Q0, Q2, Q4 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q6, r9 -// input[984]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -24)] -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-112)] -// Release input[980] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-128)] -// Release input[724] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q4 -// input[984]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q7, Q7, r2 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vsub.s32 Q0, Q3, Q4 -// input[728]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q7, r9 -// input[988]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -20)] -vstrw.u32 Q2, [r14,#(-160)] -// Release input[212] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-96)] -// Release input[984] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-112)] -// Release input[728] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-128)] -// Release input[472] from Q4 -// input[988]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vmul.u32 Q6, Q6, r2 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vsub.s32 Q0, Q2, Q4 -// input[732]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -24)] -vqrdmlah.s32 Q1, Q6, r9 -// input[992]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -16)] -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-80)] -// Release input[988] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-96)] -// Release input[732] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q4 -// input[992]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q7, Q7, r2 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vsub.s32 Q0, Q3, Q4 -// input[736]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vqrdmlah.s32 Q1, Q7, r9 -// input[996]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -12)] -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-64)] -// Release input[992] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-80)] -// Release input[736] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q4 -// input[996]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vmul.u32 Q6, Q6, r2 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vsub.s32 Q0, Q2, Q4 -// input[740]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vqrdmlah.s32 Q1, Q6, r9 -// input[1000]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * -8)] -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-48)] -// Release input[996] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-64)] -// Release input[740] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-80)] -// Release input[484] from Q4 -// input[1000]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q7, Q7, r2 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vsub.s32 Q0, Q3, Q4 -// input[744]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q7, r9 -// input[1004]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * -4)] -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(-32)] -// Release input[1000] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-48)] -// Release input[744] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q4 -// input[1004]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vmul.u32 Q6, Q6, r2 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vsub.s32 Q0, Q2, Q4 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vqrdmlah.s32 Q1, Q6, r9 -// input[1008]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * 0)] -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(-16)] -// Release input[1004] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-32)] -// Release input[748] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q4 -// input[1008]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q7, Q7, r2 -// input[496]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -8)] -vsub.s32 Q0, Q3, Q4 -// input[752]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vqrdmlah.s32 Q1, Q7, r9 -// input[1012]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * 4)] -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(0)] -// Release input[1008] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(-16)] -// Release input[752] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q4 -// input[1012]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vmul.u32 Q6, Q6, r2 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vsub.s32 Q0, Q2, Q4 -// input[756]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q6, r9 -// input[1016]: Load as Q7 -vldrw.u32 Q7, [r10, #(4 * 8)] -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(16)] -// Release input[1012] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(0)] -// Release input[756] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q4 -// input[1016]: Already loaded as Q7 -vqrdmulh.s32 Q1, Q7, r3 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q7, Q7, r2 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vsub.s32 Q0, Q3, Q4 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vqrdmlah.s32 Q1, Q7, r9 -// input[1020]: Load as Q6 -vldrw.u32 Q6, [r10, #(4 * 12)] -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vsub.s32 Q7, Q5, Q1 -vstrw.u32 Q7, [r10,#(32)] -// Release input[1016] from Q7 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(16)] -// Release input[760] from Q5 -vadd.s32 Q3, Q3, Q4 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q4 -// input[1020]: Already loaded as Q6 -vqrdmulh.s32 Q1, Q6, r3 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q6, r2 -// input[508]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)] -vsub.s32 Q0, Q2, Q4 -// input[764]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 8)] -vqrdmlah.s32 Q1, Q6, r9 -// input[192]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -60)] -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q6, Q5, Q1 -vstrw.u32 Q6, [r10,#(48)] -// Release input[1020] from Q6 -vadd.s32 Q5, Q5, Q1 -vstrw.u32 Q5, [r11,#(32)] -// Release input[764] from Q5 -vadd.s32 Q2, Q2, Q4 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q4 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[192]: Already loaded as Q7 -vqrdmulh.s32 Q0, Q7, r7 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q7, Q7, r6 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q0, Q7, r9 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmulh.s32 Q2, Q1, r7 -vsub.s32 Q7, Q3, Q0 -vmul.u32 Q1, Q1, r6 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q2, Q1, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q4, Q7, r3 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q7, Q7, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q4, Q7, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q1, Q1, Q4 -vstrw.u32 Q7, [r14,#(-240)] -// Release input[192] from Q7 -vqrdmlah.s32 Q5, Q3, r9 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vsub.s32 Q3, Q0, Q5 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q0, Q0, Q5 -// input[196]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(96)] -// Release input[24] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(144)] -// Release input[36] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r6 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[456]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[468]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[468]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[276]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-144)] -// Release input[468] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(96)] -// Release input[276] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[480]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[292]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(160)] -// Release input[292] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[492]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-48)] -// Release input[492] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q0, Q0, r6 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[708]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[720]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(96)] -// Release input[528] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[532]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(112)] -// Release input[532] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[544]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(160)] -// Release input[544] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[744]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[552]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-48)] -// Release input[744] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(192)] -// Release input[552] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[556]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(208)] -// Release input[556] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[960]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[896]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -112)] -vmul.u32 Q1, Q1, r6 -// input[832]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[768]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-448)] -// Release input[896] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release input[832] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[964]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[900]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -108)] -vmul.u32 Q2, Q2, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(48)] -// Release input[768] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[772]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-432)] -// Release input[900] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[968]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q0, Q0, r6 -// input[840]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 84)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(64)] -// Release input[772] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[776]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[972]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(336)] -// Release input[840] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[972]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[908]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -100)] -vmul.u32 Q1, Q1, r6 -// input[844]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release input[776] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[976]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-144)] -// Release input[972] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-400)] -// Release input[908] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(352)] -// Release input[844] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[976]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[912]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -96)] -vmul.u32 Q2, Q2, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[784]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[980]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-128)] -// Release input[976] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-384)] -// Release input[912] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[980]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[916]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -92)] -vmul.u32 Q0, Q0, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(112)] -// Release input[784] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[788]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[984]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-112)] -// Release input[980] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-368)] -// Release input[916] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[984]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q1, Q1, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(128)] -// Release input[788] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[792]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[988]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-96)] -// Release input[984] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[988]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[924]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -84)] -vmul.u32 Q2, Q2, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release input[792] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[796]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[992]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-80)] -// Release input[988] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-336)] -// Release input[924] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[992]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q0, Q0, r6 -// input[864]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(160)] -// Release input[796] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[800]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[996]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-64)] -// Release input[992] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release input[864] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[996]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q1, Q1, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(176)] -// Release input[800] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[804]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1000]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-48)] -// Release input[996] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1000]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q2, Q2, r6 -// input[872]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(192)] -// Release input[804] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[808]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1004]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-32)] -// Release input[1000] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(464)] -// Release input[872] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1004]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q0, Q0, r6 -// input[876]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release input[808] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1008]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-16)] -// Release input[1004] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(480)] -// Release input[876] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1008]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[944]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -64)] -vmul.u32 Q1, Q1, r6 -// input[880]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 124)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[816]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1012]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(0)] -// Release input[1008] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-256)] -// Release input[944] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(496)] -// Release input[880] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1012]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[948]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -60)] -vmul.u32 Q2, Q2, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(240)] -// Release input[816] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[820]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1016]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(16)] -// Release input[1012] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-240)] -// Release input[948] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1016]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q0, Q0, r6 -// input[888]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -120)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(256)] -// Release input[820] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[824]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1020]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(32)] -// Release input[1016] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-480)] -// Release input[888] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1020]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[956]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -52)] -vmul.u32 Q1, Q1, r6 -// input[892]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -116)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release input[824] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(48)] -// Release input[1020] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-208)] -// Release input[956] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-464)] -// Release input[892] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r6 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r6 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r6 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r6 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r6 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r6 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[308]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[312]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q2, Q2, r6 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[368]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q1, Q1, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[432]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q2, Q2, r6 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[440]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-256)] -// Release input[440] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[396]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r6 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-432)] -// Release input[396] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q2, Q2, r6 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[560]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q1, Q1, r6 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[564]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q2, Q2, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(32)] -// Release input[512] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[568]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[524]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[624]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[624]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q2, Q2, r6 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(80)] -// Release input[524] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[632]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[584]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[636]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-496)] -// Release input[632] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[636]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(320)] -// Release input[584] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[588]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-480)] -// Release input[636] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q0, Q0, r6 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(336)] -// Release input[588] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[696]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[696]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-240)] -// Release input[696] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release input[648] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q1, Q1, r6 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q2, Q2, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[760]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(16)] -// Release input[760] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[816]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[816]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[800]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 44)] -vmul.u32 Q2, Q2, r6 -// input[784]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[768]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[820]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(176)] -// Release input[800] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(112)] -// Release input[784] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[820]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[804]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmul.u32 Q0, Q0, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release input[768] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[772]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[824]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(256)] -// Release input[820] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(192)] -// Release input[804] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[824]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q1, Q1, r6 -// input[792]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(64)] -// Release input[772] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[776]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[828]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(272)] -// Release input[824] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(144)] -// Release input[792] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[828]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[812]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vmul.u32 Q2, Q2, r6 -// input[796]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(80)] -// Release input[776] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[780]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 24)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(288)] -// Release input[828] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release input[812] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(160)] -// Release input[796] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[880]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[864]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 108)] -vmul.u32 Q0, Q0, r6 -// input[848]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(96)] -// Release input[780] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[832]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[884]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(432)] -// Release input[864] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(368)] -// Release input[848] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[884]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[868]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 112)] -vmul.u32 Q1, Q1, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release input[832] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[836]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[888]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-496)] -// Release input[884] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(448)] -// Release input[868] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[888]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q2, Q2, r6 -// input[856]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(320)] -// Release input[836] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[840]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 84)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[892]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-480)] -// Release input[888] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(400)] -// Release input[856] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[892]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[876]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vmul.u32 Q0, Q0, r6 -// input[860]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(336)] -// Release input[840] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[844]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-464)] -// Release input[892] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release input[876] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(416)] -// Release input[860] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[944]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[928]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -80)] -vmul.u32 Q1, Q1, r6 -// input[912]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(352)] -// Release input[844] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[896]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[948]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-320)] -// Release input[928] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-384)] -// Release input[912] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[948]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[932]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -76)] -vmul.u32 Q2, Q2, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-448)] -// Release input[896] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[900]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[952]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-240)] -// Release input[948] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-304)] -// Release input[932] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[952]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q0, Q0, r6 -// input[920]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-432)] -// Release input[900] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[904]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[956]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-224)] -// Release input[952] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-352)] -// Release input[920] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[956]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[940]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -68)] -vmul.u32 Q1, Q1, r6 -// input[924]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r10,#(-416)] -// Release input[904] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[908]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-208)] -// Release input[956] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-272)] -// Release input[940] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-336)] -// Release input[924] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1008]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[992]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -16)] -vmul.u32 Q2, Q2, r6 -// input[976]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-400)] -// Release input[908] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[960]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1012]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-64)] -// Release input[992] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-128)] -// Release input[976] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[1012]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[996]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -12)] -vmul.u32 Q0, Q0, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-192)] -// Release input[960] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[964]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[1016]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(16)] -// Release input[1012] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-48)] -// Release input[996] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[1016]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q1, Q1, r6 -// input[984]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r10,#(-176)] -// Release input[964] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[968]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1020]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(32)] -// Release input[1016] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-96)] -// Release input[984] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[1020]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[1004]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -4)] -vmul.u32 Q2, Q2, r6 -// input[988]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-160)] -// Release input[968] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[972]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -36)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(48)] -// Release input[1020] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-16)] -// Release input[1004] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-80)] -// Release input[988] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-144)] -// Release input[972] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r6 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r6 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[76]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q1, Q1, r6 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[80]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r6 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r0,#(320)] -// Release input[80] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r6 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r6 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[172]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r6 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r6 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q1, Q1, r6 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r6 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[268]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q1, Q1, r6 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[284]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q2, Q2, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r6 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q1, Q1, r6 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[332]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q2, Q2, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r6 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[364]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[364]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q1, Q1, r6 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(448)] -// Release input[364] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q2, Q2, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r6 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[412]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[412]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r6 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[428]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-368)] -// Release input[412] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[428]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r6 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[416]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-304)] -// Release input[428] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r6 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-352)] -// Release input[416] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r6 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[476]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[476]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q2, Q2, r6 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-112)] -// Release input[476] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r6 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[480]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q1, Q1, r6 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(-96)] -// Release input[480] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[524]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q2, Q2, r6 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r6 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(32)] -// Release input[512] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[528]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[556]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[556]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q1, Q1, r6 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(96)] -// Release input[528] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(208)] -// Release input[556] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q2, Q2, r6 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r6 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[604]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[604]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q1, Q1, r6 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[620]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(400)] -// Release input[604] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[620]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r6 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r12,#(352)] -// Release input[592] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[608]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(464)] -// Release input[620] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r6 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r12,#(416)] -// Release input[608] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[624]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[652]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q1, Q1, r6 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[668]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[668]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q2, Q2, r6 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[656]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-352)] -// Release input[668] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r6 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release input[656] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[672]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[700]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[700]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q1, Q1, r6 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -// Release input[672] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[700] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[716]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q2, Q2, r6 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r6 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q1, Q1, r6 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q2, Q2, r6 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-80)] -// Release input[736] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[780]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[780]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[776]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vmul.u32 Q0, Q0, r6 -// input[772]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[768]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[796]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(96)] -// Release input[780] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release input[776] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release input[772] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[796]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[792]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 36)] -vmul.u32 Q1, Q1, r6 -// input[788]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 32)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(48)] -// Release input[768] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[784]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[812]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(160)] -// Release input[796] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(144)] -// Release input[792] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(128)] -// Release input[788] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[812]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[808]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vmul.u32 Q2, Q2, r6 -// input[804]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(112)] -// Release input[784] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[800]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[828]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(224)] -// Release input[812] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release input[808] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release input[804] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[828]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[824]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 68)] -vmul.u32 Q0, Q0, r6 -// input[820]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release input[800] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[816]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[844]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(288)] -// Release input[828] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(272)] -// Release input[824] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(256)] -// Release input[820] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[844]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[840]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vmul.u32 Q1, Q1, r6 -// input[836]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(240)] -// Release input[816] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[832]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[860]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(352)] -// Release input[844] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release input[840] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release input[836] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[860]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[856]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 100)] -vmul.u32 Q2, Q2, r6 -// input[852]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(304)] -// Release input[832] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[848]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[876]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(416)] -// Release input[860] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(400)] -// Release input[856] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(384)] -// Release input[852] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[876]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[872]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vmul.u32 Q0, Q0, r6 -// input[868]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release input[848] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[864]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[892]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(480)] -// Release input[876] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release input[872] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release input[868] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[892]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[888]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -120)] -vmul.u32 Q1, Q1, r6 -// input[884]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -124)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(432)] -// Release input[864] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[880]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[908]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-464)] -// Release input[892] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-480)] -// Release input[888] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-496)] -// Release input[884] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[908]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[904]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -104)] -vmul.u32 Q2, Q2, r6 -// input[900]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(496)] -// Release input[880] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[896]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[924]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-400)] -// Release input[908] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-416)] -// Release input[904] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-432)] -// Release input[900] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[924]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[920]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -88)] -vmul.u32 Q0, Q0, r6 -// input[916]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-448)] -// Release input[896] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[912]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -96)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[940]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-336)] -// Release input[924] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-352)] -// Release input[920] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-368)] -// Release input[916] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[940]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[936]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -72)] -vmul.u32 Q1, Q1, r6 -// input[932]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -76)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r10,#(-384)] -// Release input[912] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[928]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -80)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[956]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-272)] -// Release input[940] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-288)] -// Release input[936] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-304)] -// Release input[932] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[956]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[952]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -56)] -vmul.u32 Q2, Q2, r6 -// input[948]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-320)] -// Release input[928] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[944]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -64)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[972]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-208)] -// Release input[956] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-224)] -// Release input[952] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-240)] -// Release input[948] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[972]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[968]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -40)] -vmul.u32 Q0, Q0, r6 -// input[964]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -44)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-256)] -// Release input[944] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[960]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -// input[988]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-144)] -// Release input[972] from Q0 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-160)] -// Release input[968] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r10,#(-176)] -// Release input[964] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[988]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r7 -// input[984]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -24)] -vmul.u32 Q1, Q1, r6 -// input[980]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -28)] -vqrdmlah.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r10,#(-192)] -// Release input[960] from Q2 -vqrdmulh.s32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r9 -// input[976]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -32)] -vqrdmulh.s32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r9 -// input[1004]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r10,#(-80)] -// Release input[988] from Q1 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-96)] -// Release input[984] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r10,#(-112)] -// Release input[980] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1004]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r7 -// input[1000]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -8)] -vmul.u32 Q2, Q2, r6 -// input[996]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r10,#(-128)] -// Release input[976] from Q0 -vqrdmulh.s32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r9 -// input[992]: Load as Q1 -vldrw.u32 Q1, [r10, #(4 * -16)] -vqrdmulh.s32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r9 -// input[1020]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r10,#(-16)] -// Release input[1004] from Q2 -vqrdmlah.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r10,#(-32)] -// Release input[1000] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r10,#(-48)] -// Release input[996] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// input[1020]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r7 -// input[1016]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * 8)] -vmul.u32 Q0, Q0, r6 -// input[1012]: Load as Q4 -vldrw.u32 Q4, [r10, #(4 * 4)] -vqrdmlah.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r10,#(-64)] -// Release input[992] from Q1 -vqrdmulh.s32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r9 -// input[1008]: Load as Q2 -vldrw.u32 Q2, [r10, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r9 -vqrdmulh.s32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(48)] -// Release input[1020] from Q0 -vqrdmlah.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r10,#(32)] -// Release input[1016] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r10,#(16)] -// Release input[1012] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(0)] -// Release input[1008] from Q2 -.equ modulus_inv, 4223674367 -movw r7, #:lower16:modulus_inv -movt r7, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 9439 -// Instruction count: 7128 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good.s deleted file mode 100644 index 85dcc66..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_106117153_62524596_incomplete_good_twiddles -ntt_192_u32_106117153_62524596_incomplete_good_twiddles: // For base multiplication -.word 181897243 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 3242424133 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 112049651 // zeta^160 * 2^31 = 62524596^160 * 2^31 = 54660581 * 2^31 -.word 3804748909 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 62524596^160 * 2586463201 * 2^31 -.word 21893595 // zeta^ 80 * 2^31 = 62524596^ 80 * 2^31 = 91733486 * 2^31 -.word 1329145221 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 80 * 2586463201 * 2^31 -.word 167711653 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 16540923 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 200606947 // zeta^136 * 2^31 = 62524596^136 * 2^31 = 105862549 * 2^31 -.word 1509569405 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 62524596^136 * 2586463201 * 2^31 -.word 139952163 // zeta^104 * 2^31 = 62524596^104 * 2^31 = 37582414 * 2^31 -.word 4107280445 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 62524596^104 * 2586463201 * 2^31 -.word 31068557 // zeta^ 24 * 2^31 = 62524596^ 24 * 2^31 = 36202838 * 2^31 -.word 3373286419 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 24 * 2586463201 * 2^31 -.word 105143799 // zeta^184 * 2^31 = 62524596^184 * 2^31 = 52822457 * 2^31 -.word 1977355497 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 62524596^184 * 2586463201 * 2^31 -.word 55615889 // zeta^ 68 * 2^31 = 62524596^ 68 * 2^31 = 39384089 * 2^31 -.word 3137504399 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 68 * 2586463201 * 2^31 -.word 146912053 // zeta^ 36 * 2^31 = 62524596^ 36 * 2^31 = 101908685 * 2^31 -.word 1166997355 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 36 * 2586463201 * 2^31 -.word 122622335 // zeta^148 * 2^31 = 62524596^148 * 2^31 = 17280056 * 2^31 -.word 613116513 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 62524596^148 * 2586463201 * 2^31 -.word 164824309 // zeta^116 * 2^31 = 62524596^116 * 2^31 = 51886295 * 2^31 -.word 1491262891 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 62524596^116 * 2586463201 * 2^31 -.word 138976211 // zeta^ 12 * 2^31 = 62524596^ 12 * 2^31 = 87659826 * 2^31 -.word 4092213901 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 12 * 2586463201 * 2^31 -.word 96098887 // zeta^172 * 2^31 = 62524596^172 * 2^31 = 27892831 * 2^31 -.word 3642612377 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 62524596^172 * 2586463201 * 2^31 -.word 60018025 // zeta^ 92 * 2^31 = 62524596^ 92 * 2^31 = 45785556 * 2^31 -.word 2964370359 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 92 * 2586463201 * 2^31 -.word 54839591 // zeta^ 60 * 2^31 = 62524596^ 60 * 2^31 = 66124790 * 2^31 -.word 2201405369 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 60 * 2586463201 * 2^31 -.word 100184655 // zeta^ 64 * 2^31 = 62524596^ 64 * 2^31 = 51456572 * 2^31 -.word 490218385 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 64 * 2586463201 * 2^31 -.word 175964745 // zeta^ 32 * 2^31 = 62524596^ 32 * 2^31 = 51456573 * 2^31 -.word 3732642519 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 32 * 2586463201 * 2^31 -.word 44522653 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 4278426371 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 172533401 // zeta^112 * 2^31 = 62524596^112 * 2^31 = 34864379 * 2^31 -.word 1312604295 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 62524596^112 * 2586463201 * 2^31 -.word 72282143 // zeta^ 8 * 2^31 = 62524596^ 8 * 2^31 = 68534739 * 2^31 -.word 187686849 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 8 * 2586463201 * 2^31 -.word 166771937 // zeta^168 * 2^31 = 62524596^168 * 2^31 = 68280135 * 2^31 -.word 1697256255 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 62524596^168 * 2586463201 * 2^31 -.word 107090507 // zeta^ 88 * 2^31 = 62524596^ 88 * 2^31 = 53294696 * 2^31 -.word 2317611797 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 88 * 2586463201 * 2^31 -.word 32041911 // zeta^ 56 * 2^31 = 62524596^ 56 * 2^31 = 89497534 * 2^31 -.word 1395930921 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 56 * 2586463201 * 2^31 -.word 65322253 // zeta^132 * 2^31 = 62524596^132 * 2^31 = 4208468 * 2^31 -.word 3127969939 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 62524596^132 * 2586463201 * 2^31 -.word 14820989 // zeta^100 * 2^31 = 62524596^100 * 2^31 = 43592557 * 2^31 -.word 1970507043 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 62524596^100 * 2586463201 * 2^31 -.word 47409997 // zeta^ 20 * 2^31 = 62524596^ 20 * 2^31 = 54230858 * 2^31 -.word 2803704403 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 20 * 2586463201 * 2^31 -.word 63915179 // zeta^180 * 2^31 = 62524596^180 * 2^31 = 71510914 * 2^31 -.word 3416820917 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 62524596^180 * 2586463201 * 2^31 -.word 116135419 // zeta^ 76 * 2^31 = 62524596^ 76 * 2^31 = 78224322 * 2^31 -.word 652354917 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 76 * 2586463201 * 2^31 -.word 148994477 // zeta^ 44 * 2^31 = 62524596^ 44 * 2^31 = 59766995 * 2^31 -.word 449601523 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 44 * 2586463201 * 2^31 -.word 157394715 // zeta^156 * 2^31 = 62524596^156 * 2^31 = 39992363 * 2^31 -.word 2093561925 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 62524596^156 * 2586463201 * 2^31 -.word 111295587 // zeta^124 * 2^31 = 62524596^124 * 2^31 = 85777919 * 2^31 -.word 762964989 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 62524596^124 * 2586463201 * 2^31 -.word 36269561 // zeta^128 * 2^31 = 62524596^128 * 2^31 = 54660580 * 2^31 -.word 562324775 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 62524596^128 * 2586463201 * 2^31 -.word 30337063 // zeta^ 96 * 2^31 = 62524596^ 96 * 2^31 = 106117152 * 2^31 -.word 1052543161 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 96 * 2586463201 * 2^31 -.word 39700905 // zeta^ 16 * 2^31 = 62524596^ 16 * 2^31 = 71252774 * 2^31 -.word 2982362999 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 16 * 2586463201 * 2^31 -.word 190340711 // zeta^176 * 2^31 = 62524596^176 * 2^31 = 14383667 * 2^31 -.word 2965822073 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 62524596^176 * 2586463201 * 2^31 -.word 45462369 // zeta^ 72 * 2^31 = 62524596^ 72 * 2^31 = 37837018 * 2^31 -.word 2597711039 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 72 * 2586463201 * 2^31 -.word 11627359 // zeta^ 40 * 2^31 = 62524596^ 40 * 2^31 = 254604 * 2^31 -.word 2785397889 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 40 * 2586463201 * 2^31 -.word 180192395 // zeta^152 * 2^31 = 62524596^152 * 2^31 = 16619619 * 2^31 -.word 2899036373 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 62524596^152 * 2586463201 * 2^31 -.word 181165749 // zeta^120 * 2^31 = 62524596^120 * 2^31 = 69914315 * 2^31 -.word 921680875 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 62524596^120 * 2586463201 * 2^31 -.word 197413317 // zeta^ 4 * 2^31 = 62524596^ 4 * 2^31 = 62524596 * 2^31 -.word 2324460251 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 4 * 2586463201 * 2^31 -.word 156618417 // zeta^164 * 2^31 = 62524596^164 * 2^31 = 66733064 * 2^31 -.word 1157462895 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 62524596^164 * 2586463201 * 2^31 -.word 148319127 // zeta^ 84 * 2^31 = 62524596^ 84 * 2^31 = 34606239 * 2^31 -.word 878146377 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 84 * 2586463201 * 2^31 -.word 89611971 // zeta^ 52 * 2^31 = 62524596^ 52 * 2^31 = 88837097 * 2^31 -.word 3681850781 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 52 * 2586463201 * 2^31 -.word 63239829 // zeta^140 * 2^31 = 62524596^140 * 2^31 = 46350158 * 2^31 -.word 3845365771 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 62524596^140 * 2586463201 * 2^31 -.word 73258095 // zeta^108 * 2^31 = 62524596^108 * 2^31 = 18457327 * 2^31 -.word 202753393 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 62524596^108 * 2586463201 * 2^31 -.word 100938719 // zeta^ 28 * 2^31 = 62524596^ 28 * 2^31 = 20339234 * 2^31 -.word 3532002305 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 28 * 2586463201 * 2^31 -.word 152216281 // zeta^188 * 2^31 = 62524596^188 * 2^31 = 60331597 * 2^31 -.word 1330596935 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 62524596^188 * 2586463201 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_106117153_62524596_incomplete_good_scale -ntt_192_u32_106117153_62524596_incomplete_good_scale: // Constants for scaling by 1/N -.word 181897243 // 1/48 -.word 3242424133 // 1/48 twisted -.data -roots: -.word 50789515 /// zeta^ 64 * 2^31 = 62524596^ 64 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 64 * 2586463201 * 2^31 -.word 136304203 /// zeta^128 * 2^31 = 62524596^128 * 2^31 = 54660580 * 2^31 -.word 1106161429 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 62524596^128 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 86500417 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 3362131 // zeta^ 72 * 2^31 = 62524596^ 72 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 72 * 2586463201 * 2^31 -.word 74219771 // zeta^ 24 * 2^31 = 62524596^ 24 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 24 * 2586463201 * 2^31 -.word 3362131 // zeta^ 72 * 2^31 = 62524596^ 72 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 72 * 2586463201 * 2^31 -.word 207754911 // zeta^132 * 2^31 = 62524596^132 * 2^31 = 4208468 * 2^31 -.word 85166401 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 62524596^132 * 2586463201 * 2^31 -.word 86384727 // zeta^ 84 * 2^31 = 62524596^ 84 * 2^31 = 34606239 * 2^31 -.word 2847807113 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 84 * 2586463201 * 2^31 -.word 74219771 // zeta^ 24 * 2^31 = 62524596^ 24 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 24 * 2586463201 * 2^31 -.word 77895747 // zeta^ 12 * 2^31 = 62524596^ 12 * 2^31 = 87659826 * 2^31 -.word 1773964317 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 12 * 2586463201 * 2^31 -.word 42168601 // zeta^156 * 2^31 = 62524596^156 * 2^31 = 39992363 * 2^31 -.word 2956805639 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 62524596^156 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_106117153_62524596_incomplete_good, %function -.global ntt_192_u32_106117153_62524596_incomplete_good -ntt_192_u32_106117153_62524596_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 106117153 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 1708504095 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s deleted file mode 100644 index 25257ff..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 54660580 /// zeta^128 * 2^31 = 62524596^128 * 2^31 = 54660580 * 2^31 -.word 1106161430 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 62524596^128 * 2586463201 * 2^31 -.word 51456572 /// zeta^ 64 * 2^31 = 62524596^ 64 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 64 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 56869107 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 1150855200 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 56869107 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 1150855200 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 56869107 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 1150855200 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 69914315 // zeta^120 * 2^31 = 62524596^120 * 2^31 = 69914315 * 2^31 -.word 1414849946 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 62524596^120 * 2586463201 * 2^31 -.word 68280135 // zeta^168 * 2^31 = 62524596^168 * 2^31 = 68280135 * 2^31 -.word 1381779187 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 62524596^168 * 2586463201 * 2^31 -.word 69914315 // zeta^120 * 2^31 = 62524596^120 * 2^31 = 69914315 * 2^31 -.word 1414849946 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 62524596^120 * 2586463201 * 2^31 -.word 66124790 // zeta^ 60 * 2^31 = 62524596^ 60 * 2^31 = 66124790 * 2^31 -.word 1338161657 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 60 * 2586463201 * 2^31 -.word 18457327 // zeta^108 * 2^31 = 62524596^108 * 2^31 = 18457327 * 2^31 -.word 373519330 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 62524596^108 * 2586463201 * 2^31 -.word 68280135 // zeta^168 * 2^31 = 62524596^168 * 2^31 = 68280135 * 2^31 -.word 1381779187 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 62524596^168 * 2586463201 * 2^31 -.word 71510914 // zeta^180 * 2^31 = 62524596^180 * 2^31 = 71510914 * 2^31 -.word 1447160182 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 62524596^180 * 2586463201 * 2^31 -.word 101908685 // zeta^ 36 * 2^31 = 62524596^ 36 * 2^31 = 101908685 * 2^31 -.word 2062317245 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 36 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_106117153_62524596_incomplete_good_bitrev, %function -.global ntt_192_u32_106117153_62524596_incomplete_good_bitrev -ntt_192_u32_106117153_62524596_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -106117153 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 1708504095 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good.s deleted file mode 100644 index b077a5d..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_108643009_1793055_incomplete_good_twiddles -ntt_192_u32_108643009_1793055_incomplete_good_twiddles: // For base multiplication -.word 125819369 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 3200325335 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 7219049 // zeta^160 * 2^31 = 1793055^160 * 2^31 = 40973034 * 2^31 -.word 3635407191 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 1793055^160 * 3479293249 * 2^31 -.word 20524789 // zeta^ 80 * 2^31 = 1793055^ 80 * 2^31 = 13028154 * 2^31 -.word 778955 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 80 * 3479293249 * 2^31 -.word 41573363 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 255463501 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 90655441 // zeta^136 * 2^31 = 1793055^136 * 2^31 = 21310129 * 2^31 -.word 2443944943 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 1793055^136 * 3479293249 * 2^31 -.word 147417303 // zeta^104 * 2^31 = 1793055^104 * 2^31 = 26332312 * 2^31 -.word 3916510825 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 1793055^104 * 3479293249 * 2^31 -.word 11354681 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 3881929351 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 183168985 // zeta^184 * 2^31 = 1793055^184 * 2^31 = 38250802 * 2^31 -.word 3222230247 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 1793055^184 * 3479293249 * 2^31 -.word 10759601 // zeta^ 68 * 2^31 = 1793055^ 68 * 2^31 = 106639146 * 2^31 -.word 4004030735 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 68 * 3479293249 * 2^31 -.word 48748081 // zeta^ 36 * 2^31 = 1793055^ 36 * 2^31 = 108432201 * 2^31 -.word 3574358159 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 36 * 3479293249 * 2^31 -.word 118657223 // zeta^148 * 2^31 = 1793055^148 * 2^31 = 62017780 * 2^31 -.word 2448614009 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 1793055^148 * 3479293249 * 2^31 -.word 135399931 // zeta^116 * 2^31 = 1793055^116 * 2^31 = 56179088 * 2^31 -.word 3293739077 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 1793055^116 * 3479293249 * 2^31 -.word 22236245 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1317309803 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 173577835 // zeta^172 * 2^31 = 1793055^172 * 2^31 = 42747918 * 2^31 -.word 3951239125 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 1793055^172 * 3479293249 * 2^31 -.word 97528185 // zeta^ 92 * 2^31 = 1793055^ 92 * 2^31 = 105229554 * 2^31 -.word 1192661831 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 92 * 3479293249 * 2^31 -.word 73825049 // zeta^ 60 * 2^31 = 1793055^ 60 * 2^31 = 14289518 * 2^31 -.word 4126063015 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 60 * 3479293249 * 2^31 -.word 210066969 // zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 659560103 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 9957311 // zeta^ 32 * 2^31 = 1793055^ 32 * 2^31 = 67669976 * 2^31 -.word 3859885441 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 32 * 3479293249 * 2^31 -.word 175712655 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 4039503793 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87594435 // zeta^112 * 2^31 = 1793055^112 * 2^31 = 100073230 * 2^31 -.word 4040282749 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 1793055^112 * 3479293249 * 2^31 -.word 69868715 // zeta^ 8 * 2^31 = 1793055^ 8 * 2^31 = 82310697 * 2^31 -.word 378456469 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 8 * 3479293249 * 2^31 -.word 51881147 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2822401413 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 34117033 // zeta^ 88 * 2^31 = 1793055^ 88 * 2^31 = 70392207 * 2^31 -.word 1072737047 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 88 * 3479293249 * 2^31 -.word 154114723 // zeta^ 56 * 2^31 = 1793055^ 56 * 2^31 = 44058032 * 2^31 -.word 659699101 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 56 * 3479293249 * 2^31 -.word 168537937 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 720609135 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 70654529 // zeta^100 * 2^31 = 1793055^100 * 2^31 = 106849954 * 2^31 -.word 429672575 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 1793055^100 * 3479293249 * 2^31 -.word 81886087 // zeta^ 20 * 2^31 = 1793055^ 20 * 2^31 = 52463921 * 2^31 -.word 1001228217 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 20 * 3479293249 * 2^31 -.word 91900301 // zeta^180 * 2^31 = 1793055^180 * 2^31 = 5838692 * 2^31 -.word 3449842227 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1793055^180 * 3479293249 * 2^31 -.word 43708183 // zeta^ 76 * 2^31 = 1793055^ 76 * 2^31 = 65895091 * 2^31 -.word 343728169 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 76 * 3479293249 * 2^31 -.word 174587437 // zeta^ 44 * 2^31 = 1793055^ 44 * 2^31 = 56126250 * 2^31 -.word 1661037971 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 44 * 3479293249 * 2^31 -.word 143460969 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 168904279 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.word 132346145 // zeta^124 * 2^31 = 1793055^124 * 2^31 = 90940036 * 2^31 -.word 1361566111 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 1793055^124 * 3479293249 * 2^31 -.word 207328707 // zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 435081853 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 91466649 // zeta^ 96 * 2^31 = 1793055^ 96 * 2^31 = 108643008 * 2^31 -.word 1094641959 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 96 * 3479293249 * 2^31 -.word 129691583 // zeta^ 16 * 2^31 = 1793055^ 16 * 2^31 = 8569779 * 2^31 -.word 254684545 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 16 * 3479293249 * 2^31 -.word 196761229 // zeta^176 * 2^31 = 1793055^176 * 2^31 = 95614855 * 2^31 -.word 4294188339 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 1793055^176 * 3479293249 * 2^31 -.word 165404871 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 1472565881 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 126630577 // zeta^ 40 * 2^31 = 1793055^ 40 * 2^31 = 87332880 * 2^31 -.word 1851022351 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 40 * 3479293249 * 2^31 -.word 63171295 // zeta^152 * 2^31 = 1793055^152 * 2^31 = 64584977 * 2^31 -.word 3635268193 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 1793055^152 * 3479293249 * 2^31 -.word 205931337 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 413037943 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 146631489 // zeta^ 4 * 2^31 = 1793055^ 4 * 2^31 = 1793055 * 2^31 -.word 3865294719 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 4 * 3479293249 * 2^31 -.word 206526417 // zeta^164 * 2^31 = 1793055^164 * 2^31 = 2003863 * 2^31 -.word 290936559 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 1793055^164 * 3479293249 * 2^31 -.word 125385717 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 845125067 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 98628795 // zeta^ 52 * 2^31 = 1793055^ 52 * 2^31 = 46625229 * 2^31 -.word 1846353285 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 52 * 3479293249 * 2^31 -.word 42698581 // zeta^140 * 2^31 = 1793055^140 * 2^31 = 52516759 * 2^31 -.word 2633929323 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 1793055^140 * 3479293249 * 2^31 -.word 195049773 // zeta^108 * 2^31 = 1793055^108 * 2^31 = 9768841 * 2^31 -.word 2977657491 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1793055^108 * 3479293249 * 2^31 -.word 84939873 // zeta^ 28 * 2^31 = 1793055^ 28 * 2^31 = 17702973 * 2^31 -.word 2933401183 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 28 * 3479293249 * 2^31 -.word 119757833 // zeta^188 * 2^31 = 1793055^188 * 2^31 = 3413455 * 2^31 -.word 3102305463 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 1793055^188 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_108643009_1793055_incomplete_good_scale -ntt_192_u32_108643009_1793055_incomplete_good_scale: // Constants for scaling by 1/N -.word 125819369 // 1/48 -.word 3200325335 // 1/48 twisted -.data -roots: -.word 67669975 /// zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 40973033 /// zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 210808 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 102804317 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 98874168 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 94353491 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_108643009_1793055_incomplete_good, %function -.global ntt_192_u32_108643009_1793055_incomplete_good -ntt_192_u32_108643009_1793055_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -108643009 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 815674047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s deleted file mode 100644 index a66e58d..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 40973033 /// zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 67669975 /// zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 21597933 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 26334175 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 103620826 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 26334175 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 14289518 // zeta^ 60 * 2^31 = 1793055^ 60 * 2^31 = 14289518 * 2^31 -.word 282452654 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 60 * 3479293249 * 2^31 -.word 9768841 // zeta^108 * 2^31 = 1793055^108 * 2^31 = 9768841 * 2^31 -.word 193095041 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1793055^108 * 3479293249 * 2^31 -.word 103620826 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 5838692 // zeta^180 * 2^31 = 1793055^180 * 2^31 = 5838692 * 2^31 -.word 115410055 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1793055^180 * 3479293249 * 2^31 -.word 108432201 // zeta^ 36 * 2^31 = 1793055^ 36 * 2^31 = 108432201 * 2^31 -.word 2143316728 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 36 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_108643009_1793055_incomplete_good_bitrev, %function -.global ntt_192_u32_108643009_1793055_incomplete_good_bitrev -ntt_192_u32_108643009_1793055_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -108643009 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 815674047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s b/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s deleted file mode 100644 index 8f2c6ae..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,1237 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_twiddles -ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 125819369 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 3200325335 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 7219049 // zeta^160 * 2^31 = 1793055^160 * 2^31 = 40973034 * 2^31 -.word 3635407191 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 1793055^160 * 3479293249 * 2^31 -.word 20524789 // zeta^ 80 * 2^31 = 1793055^ 80 * 2^31 = 13028154 * 2^31 -.word 778955 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 80 * 3479293249 * 2^31 -.word 41573363 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 255463501 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 90655441 // zeta^136 * 2^31 = 1793055^136 * 2^31 = 21310129 * 2^31 -.word 2443944943 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 1793055^136 * 3479293249 * 2^31 -.word 147417303 // zeta^104 * 2^31 = 1793055^104 * 2^31 = 26332312 * 2^31 -.word 3916510825 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 1793055^104 * 3479293249 * 2^31 -.word 11354681 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 3881929351 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 183168985 // zeta^184 * 2^31 = 1793055^184 * 2^31 = 38250802 * 2^31 -.word 3222230247 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 1793055^184 * 3479293249 * 2^31 -.word 10759601 // zeta^ 68 * 2^31 = 1793055^ 68 * 2^31 = 106639146 * 2^31 -.word 4004030735 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 68 * 3479293249 * 2^31 -.word 48748081 // zeta^ 36 * 2^31 = 1793055^ 36 * 2^31 = 108432201 * 2^31 -.word 3574358159 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 36 * 3479293249 * 2^31 -.word 118657223 // zeta^148 * 2^31 = 1793055^148 * 2^31 = 62017780 * 2^31 -.word 2448614009 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 1793055^148 * 3479293249 * 2^31 -.word 135399931 // zeta^116 * 2^31 = 1793055^116 * 2^31 = 56179088 * 2^31 -.word 3293739077 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 1793055^116 * 3479293249 * 2^31 -.word 22236245 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1317309803 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 173577835 // zeta^172 * 2^31 = 1793055^172 * 2^31 = 42747918 * 2^31 -.word 3951239125 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 1793055^172 * 3479293249 * 2^31 -.word 97528185 // zeta^ 92 * 2^31 = 1793055^ 92 * 2^31 = 105229554 * 2^31 -.word 1192661831 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 92 * 3479293249 * 2^31 -.word 73825049 // zeta^ 60 * 2^31 = 1793055^ 60 * 2^31 = 14289518 * 2^31 -.word 4126063015 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 60 * 3479293249 * 2^31 -.word 210066969 // zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 659560103 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 9957311 // zeta^ 32 * 2^31 = 1793055^ 32 * 2^31 = 67669976 * 2^31 -.word 3859885441 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 32 * 3479293249 * 2^31 -.word 175712655 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 4039503793 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87594435 // zeta^112 * 2^31 = 1793055^112 * 2^31 = 100073230 * 2^31 -.word 4040282749 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 1793055^112 * 3479293249 * 2^31 -.word 69868715 // zeta^ 8 * 2^31 = 1793055^ 8 * 2^31 = 82310697 * 2^31 -.word 378456469 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 8 * 3479293249 * 2^31 -.word 51881147 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2822401413 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 34117033 // zeta^ 88 * 2^31 = 1793055^ 88 * 2^31 = 70392207 * 2^31 -.word 1072737047 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 88 * 3479293249 * 2^31 -.word 154114723 // zeta^ 56 * 2^31 = 1793055^ 56 * 2^31 = 44058032 * 2^31 -.word 659699101 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 56 * 3479293249 * 2^31 -.word 168537937 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 720609135 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 70654529 // zeta^100 * 2^31 = 1793055^100 * 2^31 = 106849954 * 2^31 -.word 429672575 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 1793055^100 * 3479293249 * 2^31 -.word 81886087 // zeta^ 20 * 2^31 = 1793055^ 20 * 2^31 = 52463921 * 2^31 -.word 1001228217 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 20 * 3479293249 * 2^31 -.word 91900301 // zeta^180 * 2^31 = 1793055^180 * 2^31 = 5838692 * 2^31 -.word 3449842227 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1793055^180 * 3479293249 * 2^31 -.word 43708183 // zeta^ 76 * 2^31 = 1793055^ 76 * 2^31 = 65895091 * 2^31 -.word 343728169 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 76 * 3479293249 * 2^31 -.word 174587437 // zeta^ 44 * 2^31 = 1793055^ 44 * 2^31 = 56126250 * 2^31 -.word 1661037971 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 44 * 3479293249 * 2^31 -.word 143460969 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 168904279 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.word 132346145 // zeta^124 * 2^31 = 1793055^124 * 2^31 = 90940036 * 2^31 -.word 1361566111 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 1793055^124 * 3479293249 * 2^31 -.word 207328707 // zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 435081853 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 91466649 // zeta^ 96 * 2^31 = 1793055^ 96 * 2^31 = 108643008 * 2^31 -.word 1094641959 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 96 * 3479293249 * 2^31 -.word 129691583 // zeta^ 16 * 2^31 = 1793055^ 16 * 2^31 = 8569779 * 2^31 -.word 254684545 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 16 * 3479293249 * 2^31 -.word 196761229 // zeta^176 * 2^31 = 1793055^176 * 2^31 = 95614855 * 2^31 -.word 4294188339 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 1793055^176 * 3479293249 * 2^31 -.word 165404871 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 1472565881 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 126630577 // zeta^ 40 * 2^31 = 1793055^ 40 * 2^31 = 87332880 * 2^31 -.word 1851022351 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 40 * 3479293249 * 2^31 -.word 63171295 // zeta^152 * 2^31 = 1793055^152 * 2^31 = 64584977 * 2^31 -.word 3635268193 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 1793055^152 * 3479293249 * 2^31 -.word 205931337 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 413037943 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 146631489 // zeta^ 4 * 2^31 = 1793055^ 4 * 2^31 = 1793055 * 2^31 -.word 3865294719 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 4 * 3479293249 * 2^31 -.word 206526417 // zeta^164 * 2^31 = 1793055^164 * 2^31 = 2003863 * 2^31 -.word 290936559 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 1793055^164 * 3479293249 * 2^31 -.word 125385717 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 845125067 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 98628795 // zeta^ 52 * 2^31 = 1793055^ 52 * 2^31 = 46625229 * 2^31 -.word 1846353285 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 52 * 3479293249 * 2^31 -.word 42698581 // zeta^140 * 2^31 = 1793055^140 * 2^31 = 52516759 * 2^31 -.word 2633929323 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 1793055^140 * 3479293249 * 2^31 -.word 195049773 // zeta^108 * 2^31 = 1793055^108 * 2^31 = 9768841 * 2^31 -.word 2977657491 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1793055^108 * 3479293249 * 2^31 -.word 84939873 // zeta^ 28 * 2^31 = 1793055^ 28 * 2^31 = 17702973 * 2^31 -.word 2933401183 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 28 * 3479293249 * 2^31 -.word 119757833 // zeta^188 * 2^31 = 1793055^188 * 2^31 = 3413455 * 2^31 -.word 3102305463 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 1793055^188 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_scale -ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 125819369 // 1/48 -.word 3200325335 // 1/48 twisted -.data -roots: -.word 67669975 /// zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 40973033 /// zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 210808 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 102804317 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 98874168 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 94353491 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input, %function -.global ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input -ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -108643009 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q0 -// input[4]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r7 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r10 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r1,#(256)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[0] from Q1 -// Release input[64] from Q0 -// input[4]: Already loaded as Q6 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -vmul.u32 Q1, Q0, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vadd.s32 Q2, Q6, Q7 -vqrdmulh.s32 Q0, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q6 -// Release input[4] from Q6 -vstrw.u32 Q2, [r11,#(-480)] -vsub.s32 Q5, Q1, Q7 -// Release input[68] from Q7 -vstrw.u32 Q5, [r1,#(16)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(272)] -// input[8]: Already loaded as Q4 -// input[72]: Already loaded as Q3 -vmul.u32 Q0, Q4, r8 -vadd.s32 Q2, Q3, Q4 -// input[12]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q4, r7 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q5, Q3, Q4 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(288)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(32)] -vsub.s32 Q5, Q5, Q0 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[72] from Q3 -// Release input[8] from Q4 -// input[76]: Already loaded as Q7 -// input[12]: Already loaded as Q6 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q6, Q7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmulh.s32 Q1, Q7, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[12] from Q6 -// Release input[76] from Q7 -// input[16]: Already loaded as Q4 -// input[80]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r7 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[16] from Q4 -vstrw.u32 Q2, [r11,#(-432)] -vsub.s32 Q4, Q1, Q5 -// Release input[80] from Q5 -vstrw.u32 Q4, [r1,#(64)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(320)] -// input[20]: Already loaded as Q6 -// input[84]: Already loaded as Q3 -vmul.u32 Q0, Q6, r8 -vadd.s32 Q2, Q3, Q6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q6, r7 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(80)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[84] from Q3 -// Release input[20] from Q6 -// input[88]: Already loaded as Q7 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q5, Q7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmulh.s32 Q1, Q7, r7 -// input[92]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 92)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[24] from Q5 -// Release input[88] from Q7 -// input[28]: Already loaded as Q4 -// input[92]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[28] from Q4 -vstrw.u32 Q2, [r11,#(-384)] -vsub.s32 Q4, Q1, Q6 -// Release input[92] from Q6 -vstrw.u32 Q4, [r1,#(112)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(368)] -// input[32]: Already loaded as Q3 -vmul.u32 Q0, Q3, r8 -vneg.s32 Q1, Q3 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmulh.s32 Q2, Q3, r7 -vstrw.u32 Q3, [r1,#(384)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(128)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[32] from Q3 -// input[36]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(144)] -vstrw.u32 Q4, [r1,#(400)] -vstrw.u32 Q4, [r11,#(-352)] -// Release input[36] from Q4 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-336)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(416)] -// Release input[40] from Q0 -// input[44]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(432)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(176)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[44] from Q4 -// input[48]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(192)] -vstrw.u32 Q3, [r1,#(448)] -vstrw.u32 Q3, [r11,#(-304)] -// Release input[48] from Q3 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-288)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(464)] -// Release input[52] from Q0 -// input[56]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(480)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(224)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[56] from Q4 -// input[60]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(240)] -vstrw.u32 Q3, [r1,#(496)] -vstrw.u32 Q3, [r11,#(-256)] -// Release input[60] from Q3 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 815674047 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1204 -// Instruction count: 900 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good.s deleted file mode 100644 index 04e148f..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_114826273_107284677_incomplete_good_twiddles -ntt_192_u32_114826273_107284677_incomplete_good_twiddles: // For base multiplication -.word 172045843 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 3105084493 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 89562401 // zeta^160 * 2^31 = 107284677^160 * 2^31 = 71938546 * 2^31 -.word 1860804351 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 107284677^160 * 553543649 * 2^31 -.word 78731003 // zeta^ 80 * 2^31 = 107284677^ 80 * 2^31 = 1326612 * 2^31 -.word 3642331237 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 80 * 553543649 * 2^31 -.word 21975303 // zeta^ 48 * 2^31 = 107284677^ 48 * 2^31 = 11544119 * 2^31 -.word 2864576473 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 48 * 553543649 * 2^31 -.word 168650109 // zeta^136 * 2^31 = 107284677^136 * 2^31 = 85313027 * 2^31 -.word 1952350755 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 107284677^136 * 553543649 * 2^31 -.word 92010389 // zeta^104 * 2^31 = 107284677^104 * 2^31 = 79315144 * 2^31 -.word 127431435 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 107284677^104 * 553543649 * 2^31 -.word 121600131 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 633364445 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 35578973 // zeta^184 * 2^31 = 107284677^184 * 2^31 = 55506216 * 2^31 -.word 2159140675 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 107284677^184 * 553543649 * 2^31 -.word 55680185 // zeta^ 68 * 2^31 = 107284677^ 68 * 2^31 = 46436470 * 2^31 -.word 2678641255 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 68 * 553543649 * 2^31 -.word 184439943 // zeta^ 36 * 2^31 = 107284677^ 36 * 2^31 = 38894874 * 2^31 -.word 1149461593 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 36 * 553543649 * 2^31 -.word 212208961 // zeta^148 * 2^31 = 107284677^148 * 2^31 = 31137870 * 2^31 -.word 3426431711 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 107284677^148 * 553543649 * 2^31 -.word 36836787 // zeta^116 * 2^31 = 107284677^116 * 2^31 = 72551608 * 2^31 -.word 61204653 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 107284677^116 * 553543649 * 2^31 -.word 24016093 // zeta^ 12 * 2^31 = 107284677^ 12 * 2^31 = 106011292 * 2^31 -.word 3114294979 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 12 * 553543649 * 2^31 -.word 162365217 // zeta^172 * 2^31 = 107284677^172 * 2^31 = 18107895 * 2^31 -.word 1839135999 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 107284677^172 * 553543649 * 2^31 -.word 224634063 // zeta^ 92 * 2^31 = 107284677^ 92 * 2^31 = 77720494 * 2^31 -.word 3199917329 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 92 * 553543649 * 2^31 -.word 44136757 // zeta^ 60 * 2^31 = 107284677^ 60 * 2^31 = 35185048 * 2^31 -.word 3585746283 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 60 * 553543649 * 2^31 -.word 140090145 // zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 2434162943 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 197309715 // zeta^ 32 * 2^31 = 107284677^ 32 * 2^31 = 42887728 * 2^31 -.word 1244280141 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 32 * 553543649 * 2^31 -.word 207677243 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1430390821 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 171581973 // zeta^112 * 2^31 = 107284677^112 * 2^31 = 104608766 * 2^31 -.word 777754763 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 107284677^112 * 553543649 * 2^31 -.word 137642157 // zeta^ 8 * 2^31 = 107284677^ 8 * 2^31 = 35511129 * 2^31 -.word 4167535859 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 8 * 553543649 * 2^31 -.word 191465993 // zeta^168 * 2^31 = 107284677^168 * 2^31 = 5997883 * 2^31 -.word 1824919319 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 107284677^168 * 553543649 * 2^31 -.word 194073573 // zeta^ 88 * 2^31 = 107284677^ 88 * 2^31 = 59320057 * 2^31 -.word 2135826619 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 88 * 553543649 * 2^31 -.word 200847431 // zeta^ 56 * 2^31 = 107284677^ 56 * 2^31 = 91801134 * 2^31 -.word 2769191065 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 56 * 553543649 * 2^31 -.word 45212603 // zeta^132 * 2^31 = 107284677^132 * 2^31 = 75931399 * 2^31 -.word 3145505701 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 107284677^132 * 553543649 * 2^31 -.word 215719061 // zeta^100 * 2^31 = 107284677^100 * 2^31 = 7541596 * 2^31 -.word 1529179659 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 107284677^100 * 553543649 * 2^31 -.word 192815759 // zeta^ 20 * 2^31 = 107284677^ 20 * 2^31 = 42274665 * 2^31 -.word 4233762641 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 20 * 553543649 * 2^31 -.word 60545901 // zeta^180 * 2^31 = 107284677^180 * 2^31 = 73412535 * 2^31 -.word 3365227059 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 107284677^180 * 553543649 * 2^31 -.word 67287329 // zeta^ 76 * 2^31 = 107284677^ 76 * 2^31 = 96718378 * 2^31 -.word 2455831295 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 76 * 553543649 * 2^31 -.word 206129695 // zeta^ 44 * 2^31 = 107284677^ 44 * 2^31 = 87903397 * 2^31 -.word 1275158977 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 44 * 553543649 * 2^31 -.word 185515789 // zeta^156 * 2^31 = 107284677^156 * 2^31 = 79641225 * 2^31 -.word 709221011 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 107284677^156 * 553543649 * 2^31 -.word 65671033 // zeta^124 * 2^31 = 107284677^124 * 2^31 = 42535446 * 2^31 -.word 3909138343 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 107284677^124 * 553543649 * 2^31 -.word 32342831 // zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 3050687153 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 57606703 // zeta^ 96 * 2^31 = 107284677^ 96 * 2^31 = 114826272 * 2^31 -.word 1189882801 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 96 * 553543649 * 2^31 -.word 58070573 // zeta^ 16 * 2^31 = 107284677^ 16 * 2^31 = 10217507 * 2^31 -.word 3517212531 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 16 * 553543649 * 2^31 -.word 150921543 // zeta^176 * 2^31 = 107284677^176 * 2^31 = 113499661 * 2^31 -.word 652636057 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 107284677^176 * 553543649 * 2^31 -.word 38186553 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2470047975 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 61002437 // zeta^ 40 * 2^31 = 107284677^ 40 * 2^31 = 29513246 * 2^31 -.word 2342616539 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 40 * 553543649 * 2^31 -.word 28805115 // zeta^152 * 2^31 = 107284677^152 * 2^31 = 23025139 * 2^31 -.word 1525776229 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 107284677^152 * 553543649 * 2^31 -.word 108052415 // zeta^120 * 2^31 = 107284677^120 * 2^31 = 82345196 * 2^31 -.word 3661602849 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 107284677^120 * 553543649 * 2^31 -.word 13933485 // zeta^ 4 * 2^31 = 107284677^ 4 * 2^31 = 107284677 * 2^31 -.word 2765787635 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 4 * 553543649 * 2^31 -.word 173972361 // zeta^164 * 2^31 = 107284677^164 * 2^31 = 68389803 * 2^31 -.word 1616326039 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 107284677^164 * 553543649 * 2^31 -.word 169106645 // zeta^ 84 * 2^31 = 107284677^ 84 * 2^31 = 41413738 * 2^31 -.word 929740235 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 84 * 553543649 * 2^31 -.word 17443585 // zeta^ 52 * 2^31 = 107284677^ 52 * 2^31 = 83688403 * 2^31 -.word 868535583 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 52 * 553543649 * 2^31 -.word 23522851 // zeta^140 * 2^31 = 107284677^140 * 2^31 = 26922876 * 2^31 -.word 3019808317 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 107284677^140 * 553543649 * 2^31 -.word 205636453 // zeta^108 * 2^31 = 107284677^108 * 2^31 = 8814981 * 2^31 -.word 1180672315 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 107284677^108 * 553543649 * 2^31 -.word 163981513 // zeta^ 28 * 2^31 = 107284677^ 28 * 2^31 = 72290827 * 2^31 -.word 385828951 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 28 * 553543649 * 2^31 -.word 5018483 // zeta^188 * 2^31 = 107284677^188 * 2^31 = 37105779 * 2^31 -.word 1095049965 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 107284677^188 * 553543649 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_114826273_107284677_incomplete_good_scale -ntt_192_u32_114826273_107284677_incomplete_good_scale: // Constants for scaling by 1/N -.word 172045843 // 1/48 -.word 3105084493 // 1/48 twisted -.data -roots: -.word 42887727 /// zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 71938545 /// zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 108828390 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 32481077 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 108828390 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 75931399 // zeta^132 * 2^31 = 107284677^132 * 2^31 = 75931399 * 2^31 -.word 1420070803 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 107284677^132 * 553543649 * 2^31 -.word 41413738 // zeta^ 84 * 2^31 = 107284677^ 84 * 2^31 = 41413738 * 2^31 -.word 774520698 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 84 * 553543649 * 2^31 -.word 32481077 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 106011292 // zeta^ 12 * 2^31 = 107284677^ 12 * 2^31 = 106011292 * 2^31 -.word 1982625667 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 12 * 553543649 * 2^31 -.word 79641225 // zeta^156 * 2^31 = 107284677^156 * 2^31 = 79641225 * 2^31 -.word 1489452056 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 107284677^156 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_114826273_107284677_incomplete_good, %function -.global ntt_192_u32_114826273_107284677_incomplete_good -ntt_192_u32_114826273_107284677_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -114826273 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 3741423647 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_bitrev.s deleted file mode 100644 index 052e336..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 71938545 /// zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 42887727 /// zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 11544119 // zeta^ 48 * 2^31 = 107284677^ 48 * 2^31 = 11544119 * 2^31 -.word 215898384 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 48 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 11544119 // zeta^ 48 * 2^31 = 107284677^ 48 * 2^31 = 11544119 * 2^31 -.word 215898384 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 48 * 553543649 * 2^31 -.word 11544119 // zeta^ 48 * 2^31 = 107284677^ 48 * 2^31 = 11544119 * 2^31 -.word 215898384 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 48 * 553543649 * 2^31 -.word 82345196 // zeta^120 * 2^31 = 107284677^120 * 2^31 = 82345196 * 2^31 -.word 1540021785 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 107284677^120 * 553543649 * 2^31 -.word 5997883 // zeta^168 * 2^31 = 107284677^168 * 2^31 = 5997883 * 2^31 -.word 112172548 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 107284677^168 * 553543649 * 2^31 -.word 82345196 // zeta^120 * 2^31 = 107284677^120 * 2^31 = 82345196 * 2^31 -.word 1540021785 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 107284677^120 * 553543649 * 2^31 -.word 35185048 // zeta^ 60 * 2^31 = 107284677^ 60 * 2^31 = 35185048 * 2^31 -.word 658031592 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 60 * 553543649 * 2^31 -.word 8814981 // zeta^108 * 2^31 = 107284677^108 * 2^31 = 8814981 * 2^31 -.word 164857981 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 107284677^108 * 553543649 * 2^31 -.word 5997883 // zeta^168 * 2^31 = 107284677^168 * 2^31 = 5997883 * 2^31 -.word 112172548 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 107284677^168 * 553543649 * 2^31 -.word 73412535 // zeta^180 * 2^31 = 107284677^180 * 2^31 = 73412535 * 2^31 -.word 1372962950 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 107284677^180 * 553543649 * 2^31 -.word 38894874 // zeta^ 36 * 2^31 = 107284677^ 36 * 2^31 = 38894874 * 2^31 -.word 727412845 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 36 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_114826273_107284677_incomplete_good_bitrev, %function -.global ntt_192_u32_114826273_107284677_incomplete_good_bitrev -ntt_192_u32_114826273_107284677_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -114826273 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 3741423647 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop.s b/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop.s deleted file mode 100644 index 04bccf7..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop.s +++ /dev/null @@ -1,1395 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_114826273_107284677_incomplete_good_oop_twiddles -ntt_192_u32_114826273_107284677_incomplete_good_oop_twiddles: // For base multiplication -.word 172045843 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 3105084493 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 89562401 // zeta^160 * 2^31 = 107284677^160 * 2^31 = 71938546 * 2^31 -.word 1860804351 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 107284677^160 * 553543649 * 2^31 -.word 78731003 // zeta^ 80 * 2^31 = 107284677^ 80 * 2^31 = 1326612 * 2^31 -.word 3642331237 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 80 * 553543649 * 2^31 -.word 21975303 // zeta^ 48 * 2^31 = 107284677^ 48 * 2^31 = 11544119 * 2^31 -.word 2864576473 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 48 * 553543649 * 2^31 -.word 168650109 // zeta^136 * 2^31 = 107284677^136 * 2^31 = 85313027 * 2^31 -.word 1952350755 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 107284677^136 * 553543649 * 2^31 -.word 92010389 // zeta^104 * 2^31 = 107284677^104 * 2^31 = 79315144 * 2^31 -.word 127431435 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 107284677^104 * 553543649 * 2^31 -.word 121600131 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 633364445 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 35578973 // zeta^184 * 2^31 = 107284677^184 * 2^31 = 55506216 * 2^31 -.word 2159140675 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 107284677^184 * 553543649 * 2^31 -.word 55680185 // zeta^ 68 * 2^31 = 107284677^ 68 * 2^31 = 46436470 * 2^31 -.word 2678641255 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 68 * 553543649 * 2^31 -.word 184439943 // zeta^ 36 * 2^31 = 107284677^ 36 * 2^31 = 38894874 * 2^31 -.word 1149461593 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 36 * 553543649 * 2^31 -.word 212208961 // zeta^148 * 2^31 = 107284677^148 * 2^31 = 31137870 * 2^31 -.word 3426431711 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 107284677^148 * 553543649 * 2^31 -.word 36836787 // zeta^116 * 2^31 = 107284677^116 * 2^31 = 72551608 * 2^31 -.word 61204653 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 107284677^116 * 553543649 * 2^31 -.word 24016093 // zeta^ 12 * 2^31 = 107284677^ 12 * 2^31 = 106011292 * 2^31 -.word 3114294979 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 12 * 553543649 * 2^31 -.word 162365217 // zeta^172 * 2^31 = 107284677^172 * 2^31 = 18107895 * 2^31 -.word 1839135999 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 107284677^172 * 553543649 * 2^31 -.word 224634063 // zeta^ 92 * 2^31 = 107284677^ 92 * 2^31 = 77720494 * 2^31 -.word 3199917329 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 92 * 553543649 * 2^31 -.word 44136757 // zeta^ 60 * 2^31 = 107284677^ 60 * 2^31 = 35185048 * 2^31 -.word 3585746283 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 60 * 553543649 * 2^31 -.word 140090145 // zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 2434162943 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 197309715 // zeta^ 32 * 2^31 = 107284677^ 32 * 2^31 = 42887728 * 2^31 -.word 1244280141 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 32 * 553543649 * 2^31 -.word 207677243 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1430390821 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 171581973 // zeta^112 * 2^31 = 107284677^112 * 2^31 = 104608766 * 2^31 -.word 777754763 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 107284677^112 * 553543649 * 2^31 -.word 137642157 // zeta^ 8 * 2^31 = 107284677^ 8 * 2^31 = 35511129 * 2^31 -.word 4167535859 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 8 * 553543649 * 2^31 -.word 191465993 // zeta^168 * 2^31 = 107284677^168 * 2^31 = 5997883 * 2^31 -.word 1824919319 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 107284677^168 * 553543649 * 2^31 -.word 194073573 // zeta^ 88 * 2^31 = 107284677^ 88 * 2^31 = 59320057 * 2^31 -.word 2135826619 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 88 * 553543649 * 2^31 -.word 200847431 // zeta^ 56 * 2^31 = 107284677^ 56 * 2^31 = 91801134 * 2^31 -.word 2769191065 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 56 * 553543649 * 2^31 -.word 45212603 // zeta^132 * 2^31 = 107284677^132 * 2^31 = 75931399 * 2^31 -.word 3145505701 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 107284677^132 * 553543649 * 2^31 -.word 215719061 // zeta^100 * 2^31 = 107284677^100 * 2^31 = 7541596 * 2^31 -.word 1529179659 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 107284677^100 * 553543649 * 2^31 -.word 192815759 // zeta^ 20 * 2^31 = 107284677^ 20 * 2^31 = 42274665 * 2^31 -.word 4233762641 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 20 * 553543649 * 2^31 -.word 60545901 // zeta^180 * 2^31 = 107284677^180 * 2^31 = 73412535 * 2^31 -.word 3365227059 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 107284677^180 * 553543649 * 2^31 -.word 67287329 // zeta^ 76 * 2^31 = 107284677^ 76 * 2^31 = 96718378 * 2^31 -.word 2455831295 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 76 * 553543649 * 2^31 -.word 206129695 // zeta^ 44 * 2^31 = 107284677^ 44 * 2^31 = 87903397 * 2^31 -.word 1275158977 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 44 * 553543649 * 2^31 -.word 185515789 // zeta^156 * 2^31 = 107284677^156 * 2^31 = 79641225 * 2^31 -.word 709221011 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 107284677^156 * 553543649 * 2^31 -.word 65671033 // zeta^124 * 2^31 = 107284677^124 * 2^31 = 42535446 * 2^31 -.word 3909138343 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 107284677^124 * 553543649 * 2^31 -.word 32342831 // zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 3050687153 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 57606703 // zeta^ 96 * 2^31 = 107284677^ 96 * 2^31 = 114826272 * 2^31 -.word 1189882801 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 96 * 553543649 * 2^31 -.word 58070573 // zeta^ 16 * 2^31 = 107284677^ 16 * 2^31 = 10217507 * 2^31 -.word 3517212531 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 16 * 553543649 * 2^31 -.word 150921543 // zeta^176 * 2^31 = 107284677^176 * 2^31 = 113499661 * 2^31 -.word 652636057 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 107284677^176 * 553543649 * 2^31 -.word 38186553 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2470047975 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 61002437 // zeta^ 40 * 2^31 = 107284677^ 40 * 2^31 = 29513246 * 2^31 -.word 2342616539 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 40 * 553543649 * 2^31 -.word 28805115 // zeta^152 * 2^31 = 107284677^152 * 2^31 = 23025139 * 2^31 -.word 1525776229 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 107284677^152 * 553543649 * 2^31 -.word 108052415 // zeta^120 * 2^31 = 107284677^120 * 2^31 = 82345196 * 2^31 -.word 3661602849 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 107284677^120 * 553543649 * 2^31 -.word 13933485 // zeta^ 4 * 2^31 = 107284677^ 4 * 2^31 = 107284677 * 2^31 -.word 2765787635 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 4 * 553543649 * 2^31 -.word 173972361 // zeta^164 * 2^31 = 107284677^164 * 2^31 = 68389803 * 2^31 -.word 1616326039 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 107284677^164 * 553543649 * 2^31 -.word 169106645 // zeta^ 84 * 2^31 = 107284677^ 84 * 2^31 = 41413738 * 2^31 -.word 929740235 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 84 * 553543649 * 2^31 -.word 17443585 // zeta^ 52 * 2^31 = 107284677^ 52 * 2^31 = 83688403 * 2^31 -.word 868535583 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 52 * 553543649 * 2^31 -.word 23522851 // zeta^140 * 2^31 = 107284677^140 * 2^31 = 26922876 * 2^31 -.word 3019808317 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 107284677^140 * 553543649 * 2^31 -.word 205636453 // zeta^108 * 2^31 = 107284677^108 * 2^31 = 8814981 * 2^31 -.word 1180672315 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 107284677^108 * 553543649 * 2^31 -.word 163981513 // zeta^ 28 * 2^31 = 107284677^ 28 * 2^31 = 72290827 * 2^31 -.word 385828951 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 28 * 553543649 * 2^31 -.word 5018483 // zeta^188 * 2^31 = 107284677^188 * 2^31 = 37105779 * 2^31 -.word 1095049965 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 107284677^188 * 553543649 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_114826273_107284677_incomplete_good_oop_scale -ntt_192_u32_114826273_107284677_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 172045843 // 1/48 -.word 3105084493 // 1/48 twisted -.data -roots: -.word 42887727 /// zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 71938545 /// zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 108828390 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 32481077 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 108828390 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 75931399 // zeta^132 * 2^31 = 107284677^132 * 2^31 = 75931399 * 2^31 -.word 1420070803 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 107284677^132 * 553543649 * 2^31 -.word 41413738 // zeta^ 84 * 2^31 = 107284677^ 84 * 2^31 = 41413738 * 2^31 -.word 774520698 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 84 * 553543649 * 2^31 -.word 32481077 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 106011292 // zeta^ 12 * 2^31 = 107284677^ 12 * 2^31 = 106011292 * 2^31 -.word 1982625667 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 12 * 553543649 * 2^31 -.word 79641225 // zeta^156 * 2^31 = 107284677^156 * 2^31 = 79641225 * 2^31 -.word 1489452056 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 107284677^156 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_114826273_107284677_incomplete_good_oop, %function -.global ntt_192_u32_114826273_107284677_incomplete_good_oop -ntt_192_u32_114826273_107284677_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -114826273 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r7 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmla.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r1,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 3741423647 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1362 -// Instruction count: 1000 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input.s b/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input.s deleted file mode 100644 index 9e835cc..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,1237 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input_twiddles -ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 172045843 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 3105084493 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 89562401 // zeta^160 * 2^31 = 107284677^160 * 2^31 = 71938546 * 2^31 -.word 1860804351 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 107284677^160 * 553543649 * 2^31 -.word 78731003 // zeta^ 80 * 2^31 = 107284677^ 80 * 2^31 = 1326612 * 2^31 -.word 3642331237 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 80 * 553543649 * 2^31 -.word 21975303 // zeta^ 48 * 2^31 = 107284677^ 48 * 2^31 = 11544119 * 2^31 -.word 2864576473 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 48 * 553543649 * 2^31 -.word 168650109 // zeta^136 * 2^31 = 107284677^136 * 2^31 = 85313027 * 2^31 -.word 1952350755 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 107284677^136 * 553543649 * 2^31 -.word 92010389 // zeta^104 * 2^31 = 107284677^104 * 2^31 = 79315144 * 2^31 -.word 127431435 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 107284677^104 * 553543649 * 2^31 -.word 121600131 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 633364445 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 35578973 // zeta^184 * 2^31 = 107284677^184 * 2^31 = 55506216 * 2^31 -.word 2159140675 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 107284677^184 * 553543649 * 2^31 -.word 55680185 // zeta^ 68 * 2^31 = 107284677^ 68 * 2^31 = 46436470 * 2^31 -.word 2678641255 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 68 * 553543649 * 2^31 -.word 184439943 // zeta^ 36 * 2^31 = 107284677^ 36 * 2^31 = 38894874 * 2^31 -.word 1149461593 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 36 * 553543649 * 2^31 -.word 212208961 // zeta^148 * 2^31 = 107284677^148 * 2^31 = 31137870 * 2^31 -.word 3426431711 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 107284677^148 * 553543649 * 2^31 -.word 36836787 // zeta^116 * 2^31 = 107284677^116 * 2^31 = 72551608 * 2^31 -.word 61204653 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 107284677^116 * 553543649 * 2^31 -.word 24016093 // zeta^ 12 * 2^31 = 107284677^ 12 * 2^31 = 106011292 * 2^31 -.word 3114294979 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 12 * 553543649 * 2^31 -.word 162365217 // zeta^172 * 2^31 = 107284677^172 * 2^31 = 18107895 * 2^31 -.word 1839135999 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 107284677^172 * 553543649 * 2^31 -.word 224634063 // zeta^ 92 * 2^31 = 107284677^ 92 * 2^31 = 77720494 * 2^31 -.word 3199917329 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 92 * 553543649 * 2^31 -.word 44136757 // zeta^ 60 * 2^31 = 107284677^ 60 * 2^31 = 35185048 * 2^31 -.word 3585746283 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 60 * 553543649 * 2^31 -.word 140090145 // zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 2434162943 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 197309715 // zeta^ 32 * 2^31 = 107284677^ 32 * 2^31 = 42887728 * 2^31 -.word 1244280141 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 32 * 553543649 * 2^31 -.word 207677243 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1430390821 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 171581973 // zeta^112 * 2^31 = 107284677^112 * 2^31 = 104608766 * 2^31 -.word 777754763 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 107284677^112 * 553543649 * 2^31 -.word 137642157 // zeta^ 8 * 2^31 = 107284677^ 8 * 2^31 = 35511129 * 2^31 -.word 4167535859 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 8 * 553543649 * 2^31 -.word 191465993 // zeta^168 * 2^31 = 107284677^168 * 2^31 = 5997883 * 2^31 -.word 1824919319 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 107284677^168 * 553543649 * 2^31 -.word 194073573 // zeta^ 88 * 2^31 = 107284677^ 88 * 2^31 = 59320057 * 2^31 -.word 2135826619 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 88 * 553543649 * 2^31 -.word 200847431 // zeta^ 56 * 2^31 = 107284677^ 56 * 2^31 = 91801134 * 2^31 -.word 2769191065 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 56 * 553543649 * 2^31 -.word 45212603 // zeta^132 * 2^31 = 107284677^132 * 2^31 = 75931399 * 2^31 -.word 3145505701 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 107284677^132 * 553543649 * 2^31 -.word 215719061 // zeta^100 * 2^31 = 107284677^100 * 2^31 = 7541596 * 2^31 -.word 1529179659 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 107284677^100 * 553543649 * 2^31 -.word 192815759 // zeta^ 20 * 2^31 = 107284677^ 20 * 2^31 = 42274665 * 2^31 -.word 4233762641 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 20 * 553543649 * 2^31 -.word 60545901 // zeta^180 * 2^31 = 107284677^180 * 2^31 = 73412535 * 2^31 -.word 3365227059 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 107284677^180 * 553543649 * 2^31 -.word 67287329 // zeta^ 76 * 2^31 = 107284677^ 76 * 2^31 = 96718378 * 2^31 -.word 2455831295 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 76 * 553543649 * 2^31 -.word 206129695 // zeta^ 44 * 2^31 = 107284677^ 44 * 2^31 = 87903397 * 2^31 -.word 1275158977 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 44 * 553543649 * 2^31 -.word 185515789 // zeta^156 * 2^31 = 107284677^156 * 2^31 = 79641225 * 2^31 -.word 709221011 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 107284677^156 * 553543649 * 2^31 -.word 65671033 // zeta^124 * 2^31 = 107284677^124 * 2^31 = 42535446 * 2^31 -.word 3909138343 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 107284677^124 * 553543649 * 2^31 -.word 32342831 // zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 3050687153 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 57606703 // zeta^ 96 * 2^31 = 107284677^ 96 * 2^31 = 114826272 * 2^31 -.word 1189882801 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 96 * 553543649 * 2^31 -.word 58070573 // zeta^ 16 * 2^31 = 107284677^ 16 * 2^31 = 10217507 * 2^31 -.word 3517212531 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 16 * 553543649 * 2^31 -.word 150921543 // zeta^176 * 2^31 = 107284677^176 * 2^31 = 113499661 * 2^31 -.word 652636057 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 107284677^176 * 553543649 * 2^31 -.word 38186553 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2470047975 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 61002437 // zeta^ 40 * 2^31 = 107284677^ 40 * 2^31 = 29513246 * 2^31 -.word 2342616539 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 40 * 553543649 * 2^31 -.word 28805115 // zeta^152 * 2^31 = 107284677^152 * 2^31 = 23025139 * 2^31 -.word 1525776229 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 107284677^152 * 553543649 * 2^31 -.word 108052415 // zeta^120 * 2^31 = 107284677^120 * 2^31 = 82345196 * 2^31 -.word 3661602849 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 107284677^120 * 553543649 * 2^31 -.word 13933485 // zeta^ 4 * 2^31 = 107284677^ 4 * 2^31 = 107284677 * 2^31 -.word 2765787635 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 4 * 553543649 * 2^31 -.word 173972361 // zeta^164 * 2^31 = 107284677^164 * 2^31 = 68389803 * 2^31 -.word 1616326039 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 107284677^164 * 553543649 * 2^31 -.word 169106645 // zeta^ 84 * 2^31 = 107284677^ 84 * 2^31 = 41413738 * 2^31 -.word 929740235 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 84 * 553543649 * 2^31 -.word 17443585 // zeta^ 52 * 2^31 = 107284677^ 52 * 2^31 = 83688403 * 2^31 -.word 868535583 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 52 * 553543649 * 2^31 -.word 23522851 // zeta^140 * 2^31 = 107284677^140 * 2^31 = 26922876 * 2^31 -.word 3019808317 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 107284677^140 * 553543649 * 2^31 -.word 205636453 // zeta^108 * 2^31 = 107284677^108 * 2^31 = 8814981 * 2^31 -.word 1180672315 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 107284677^108 * 553543649 * 2^31 -.word 163981513 // zeta^ 28 * 2^31 = 107284677^ 28 * 2^31 = 72290827 * 2^31 -.word 385828951 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 28 * 553543649 * 2^31 -.word 5018483 // zeta^188 * 2^31 = 107284677^188 * 2^31 = 37105779 * 2^31 -.word 1095049965 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 107284677^188 * 553543649 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input_scale -ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 172045843 // 1/48 -.word 3105084493 // 1/48 twisted -.data -roots: -.word 42887727 /// zeta^ 64 * 2^31 = 107284677^ 64 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 64 * 553543649 * 2^31 -.word 71938545 /// zeta^128 * 2^31 = 107284677^128 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 107284677^128 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 107284677^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 103282154 // zeta^144 * 2^31 = 107284677^144 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 107284677^144 * 553543649 * 2^31 -.word 108828390 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 32481077 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 108828390 // zeta^ 72 * 2^31 = 107284677^ 72 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 72 * 553543649 * 2^31 -.word 75931399 // zeta^132 * 2^31 = 107284677^132 * 2^31 = 75931399 * 2^31 -.word 1420070803 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 107284677^132 * 553543649 * 2^31 -.word 41413738 // zeta^ 84 * 2^31 = 107284677^ 84 * 2^31 = 41413738 * 2^31 -.word 774520698 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 84 * 553543649 * 2^31 -.word 32481077 // zeta^ 24 * 2^31 = 107284677^ 24 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 24 * 553543649 * 2^31 -.word 106011292 // zeta^ 12 * 2^31 = 107284677^ 12 * 2^31 = 106011292 * 2^31 -.word 1982625667 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 107284677^ 12 * 553543649 * 2^31 -.word 79641225 // zeta^156 * 2^31 = 107284677^156 * 2^31 = 79641225 * 2^31 -.word 1489452056 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 107284677^156 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input, %function -.global ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input -ntt_192_u32_114826273_107284677_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -114826273 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q0 -// input[4]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r7 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r10 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r1,#(256)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[0] from Q1 -// Release input[64] from Q0 -// input[4]: Already loaded as Q6 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -vmul.u32 Q1, Q0, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vadd.s32 Q2, Q6, Q7 -vqrdmulh.s32 Q0, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q6 -// Release input[4] from Q6 -vstrw.u32 Q2, [r11,#(-480)] -vsub.s32 Q5, Q1, Q7 -// Release input[68] from Q7 -vstrw.u32 Q5, [r1,#(16)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(272)] -// input[8]: Already loaded as Q4 -// input[72]: Already loaded as Q3 -vmul.u32 Q0, Q4, r8 -vadd.s32 Q2, Q3, Q4 -// input[12]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q4, r7 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q5, Q3, Q4 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(288)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(32)] -vsub.s32 Q5, Q5, Q0 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[72] from Q3 -// Release input[8] from Q4 -// input[76]: Already loaded as Q7 -// input[12]: Already loaded as Q6 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q6, Q7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmulh.s32 Q1, Q7, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[12] from Q6 -// Release input[76] from Q7 -// input[16]: Already loaded as Q4 -// input[80]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r7 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[16] from Q4 -vstrw.u32 Q2, [r11,#(-432)] -vsub.s32 Q4, Q1, Q5 -// Release input[80] from Q5 -vstrw.u32 Q4, [r1,#(64)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(320)] -// input[20]: Already loaded as Q6 -// input[84]: Already loaded as Q3 -vmul.u32 Q0, Q6, r8 -vadd.s32 Q2, Q3, Q6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q6, r7 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(80)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[84] from Q3 -// Release input[20] from Q6 -// input[88]: Already loaded as Q7 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q5, Q7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmulh.s32 Q1, Q7, r7 -// input[92]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 92)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[24] from Q5 -// Release input[88] from Q7 -// input[28]: Already loaded as Q4 -// input[92]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[28] from Q4 -vstrw.u32 Q2, [r11,#(-384)] -vsub.s32 Q4, Q1, Q6 -// Release input[92] from Q6 -vstrw.u32 Q4, [r1,#(112)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(368)] -// input[32]: Already loaded as Q3 -vmul.u32 Q0, Q3, r8 -vneg.s32 Q1, Q3 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmulh.s32 Q2, Q3, r7 -vstrw.u32 Q3, [r1,#(384)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(128)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[32] from Q3 -// input[36]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(144)] -vstrw.u32 Q4, [r1,#(400)] -vstrw.u32 Q4, [r11,#(-352)] -// Release input[36] from Q4 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-336)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(416)] -// Release input[40] from Q0 -// input[44]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(432)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(176)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[44] from Q4 -// input[48]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(192)] -vstrw.u32 Q3, [r1,#(448)] -vstrw.u32 Q3, [r11,#(-304)] -// Release input[48] from Q3 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-288)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(464)] -// Release input[52] from Q0 -// input[56]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(480)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(224)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[56] from Q4 -// input[60]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(240)] -vstrw.u32 Q3, [r1,#(496)] -vstrw.u32 Q3, [r11,#(-256)] -// Release input[60] from Q3 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 3741423647 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1204 -// Instruction count: 900 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good.s deleted file mode 100644 index f931429..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_128919937_120423310_incomplete_good_twiddles -ntt_192_u32_128919937_120423310_incomplete_good_twiddles: // For base multiplication -.word 151081339 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 2922143493 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 204918473 // zeta^160 * 2^31 = 120423310^160 * 2^31 = 2223848 * 2^31 -.word 1482262199 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 120423310^160 * 1521161857 * 2^31 -.word 64661425 // zeta^ 80 * 2^31 = 120423310^ 80 * 2^31 = 107269247 * 2^31 -.word 120408527 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 80 * 1521161857 * 2^31 -.word 11853537 // zeta^ 48 * 2^31 = 120423310^ 48 * 2^31 = 14136207 * 2^31 -.word 526998175 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 48 * 1521161857 * 2^31 -.word 210375303 // zeta^136 * 2^31 = 120423310^136 * 2^31 = 58374830 * 2^31 -.word 4205318137 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 120423310^136 * 1521161857 * 2^31 -.word 109569101 // zeta^104 * 2^31 = 120423310^104 * 2^31 = 44864068 * 2^31 -.word 1987598131 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 120423310^104 * 1521161857 * 2^31 -.word 170126629 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 1674400347 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 225931657 // zeta^184 * 2^31 = 120423310^184 * 2^31 = 83604053 * 2^31 -.word 4069196791 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 120423310^184 * 1521161857 * 2^31 -.word 255490897 // zeta^ 68 * 2^31 = 120423310^ 68 * 2^31 = 56394291 * 2^31 -.word 1943266863 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 68 * 1521161857 * 2^31 -.word 196963681 // zeta^ 36 * 2^31 = 120423310^ 36 * 2^31 = 47897664 * 2^31 -.word 2757397535 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 36 * 1521161857 * 2^31 -.word 249336559 // zeta^148 * 2^31 = 120423310^148 * 2^31 = 13888621 * 2^31 -.word 1653894033 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 120423310^148 * 1521161857 * 2^31 -.word 25222977 // zeta^116 * 2^31 = 120423310^116 * 2^31 = 59206896 * 2^31 -.word 812968511 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 120423310^116 * 1521161857 * 2^31 -.word 45743345 // zeta^ 12 * 2^31 = 120423310^ 12 * 2^31 = 74458359 * 2^31 -.word 4119843983 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 12 * 1521161857 * 2^31 -.word 36202655 // zeta^172 * 2^31 = 120423310^172 * 2^31 = 79182254 * 2^31 -.word 2825952737 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 120423310^172 * 1521161857 * 2^31 -.word 221269965 // zeta^ 92 * 2^31 = 120423310^ 92 * 2^31 = 120320932 * 2^31 -.word 3554487219 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 92 * 1521161857 * 2^31 -.word 268193 // zeta^ 60 * 2^31 = 120423310^ 60 * 2^31 = 72023844 * 2^31 -.word 1096630751 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 60 * 1521161857 * 2^31 -.word 52921401 // zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2812705095 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 75082803 // zeta^ 32 * 2^31 = 120423310^ 32 * 2^31 = 126696090 * 2^31 -.word 1439881293 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 32 * 1521161857 * 2^31 -.word 245986337 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 3767969119 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 181727825 // zeta^112 * 2^31 = 120423310^112 * 2^31 = 93133040 * 2^31 -.word 3888377647 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 120423310^112 * 1521161857 * 2^31 -.word 148270773 // zeta^ 8 * 2^31 = 120423310^ 8 * 2^31 = 84055869 * 2^31 -.word 2307369163 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 8 * 1521161857 * 2^31 -.word 229726139 // zeta^168 * 2^31 = 120423310^168 * 2^31 = 13510762 * 2^31 -.word 2217720005 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 120423310^168 * 1521161857 * 2^31 -.word 31908217 // zeta^ 88 * 2^31 = 120423310^ 88 * 2^31 = 45315884 * 2^31 -.word 225770503 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 88 * 1521161857 * 2^31 -.word 73114909 // zeta^ 56 * 2^31 = 120423310^ 56 * 2^31 = 83528165 * 2^31 -.word 1900170851 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 56 * 1521161857 * 2^31 -.word 60876193 // zeta^132 * 2^31 = 120423310^132 * 2^31 = 81022273 * 2^31 -.word 1537569759 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 120423310^132 * 1521161857 * 2^31 -.word 187447153 // zeta^100 * 2^31 = 120423310^100 * 2^31 = 8496627 * 2^31 -.word 3480836623 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 120423310^100 * 1521161857 * 2^31 -.word 232616897 // zeta^ 20 * 2^31 = 120423310^ 20 * 2^31 = 69713041 * 2^31 -.word 3481998783 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 20 * 1521161857 * 2^31 -.word 95193645 // zeta^180 * 2^31 = 120423310^180 * 2^31 = 83601662 * 2^31 -.word 840925523 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 120423310^180 * 1521161857 * 2^31 -.word 221637219 // zeta^ 76 * 2^31 = 120423310^ 76 * 2^31 = 49737683 * 2^31 -.word 1469014557 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 76 * 1521161857 * 2^31 -.word 138460627 // zeta^ 44 * 2^31 = 120423310^ 44 * 2^31 = 124196042 * 2^31 -.word 1293891245 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 44 * 1521161857 * 2^31 -.word 257571681 // zeta^156 * 2^31 = 120423310^156 * 2^31 = 56896093 * 2^31 -.word 3198336543 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 120423310^156 * 1521161857 * 2^31 -.word 92081835 // zeta^124 * 2^31 = 120423310^124 * 2^31 = 48297088 * 2^31 -.word 2457856469 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 120423310^124 * 1521161857 * 2^31 -.word 182757071 // zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 2855086001 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 106758535 // zeta^ 96 * 2^31 = 120423310^ 96 * 2^31 = 128919936 * 2^31 -.word 1372823801 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 96 * 1521161857 * 2^31 -.word 76112049 // zeta^ 16 * 2^31 = 120423310^ 16 * 2^31 = 35786897 * 2^31 -.word 406589647 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 16 * 1521161857 * 2^31 -.word 193178449 // zeta^176 * 2^31 = 120423310^176 * 2^31 = 21650690 * 2^31 -.word 4174558767 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 120423310^176 * 1521161857 * 2^31 -.word 28113735 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 2077247289 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 47464571 // zeta^ 40 * 2^31 = 120423310^ 40 * 2^31 = 70545107 * 2^31 -.word 89649157 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 40 * 1521161857 * 2^31 -.word 184724965 // zeta^152 * 2^31 = 120423310^152 * 2^31 = 45391772 * 2^31 -.word 2394796443 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 120423310^152 * 1521161857 * 2^31 -.word 87713245 // zeta^120 * 2^31 = 120423310^120 * 2^31 = 90707656 * 2^31 -.word 2620566947 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 120423310^120 * 1521161857 * 2^31 -.word 70392721 // zeta^ 4 * 2^31 = 120423310^ 4 * 2^31 = 120423310 * 2^31 -.word 814130671 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 4 * 1521161857 * 2^31 -.word 2348977 // zeta^164 * 2^31 = 120423310^164 * 2^31 = 72525646 * 2^31 -.word 2351700431 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 120423310^164 * 1521161857 * 2^31 -.word 162646229 // zeta^ 84 * 2^31 = 120423310^ 84 * 2^31 = 45318275 * 2^31 -.word 3454041771 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 84 * 1521161857 * 2^31 -.word 8503315 // zeta^ 52 * 2^31 = 120423310^ 52 * 2^31 = 115031316 * 2^31 -.word 2641073261 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 52 * 1521161857 * 2^31 -.word 119379247 // zeta^140 * 2^31 = 120423310^140 * 2^31 = 4723895 * 2^31 -.word 3001076049 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 120423310^140 * 1521161857 * 2^31 -.word 212096529 // zeta^108 * 2^31 = 120423310^108 * 2^31 = 54461578 * 2^31 -.word 175123311 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 120423310^108 * 1521161857 * 2^31 -.word 165758039 // zeta^ 28 * 2^31 = 120423310^ 28 * 2^31 = 80622849 * 2^31 -.word 1837110825 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 28 * 1521161857 * 2^31 -.word 36569909 // zeta^188 * 2^31 = 120423310^188 * 2^31 = 8599005 * 2^31 -.word 740480075 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 120423310^188 * 1521161857 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_128919937_120423310_incomplete_good_scale -ntt_192_u32_128919937_120423310_incomplete_good_scale: // Constants for scaling by 1/N -.word 151081339 // 1/48 -.word 2922143493 // 1/48 twisted -.data -roots: -.word 126696089 /// zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 2223847 /// zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 115409175 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 38212281 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 115409175 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 81022273 // zeta^132 * 2^31 = 120423310^132 * 2^31 = 81022273 * 2^31 -.word 1349628385 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 120423310^132 * 1521161857 * 2^31 -.word 45318275 // zeta^ 84 * 2^31 = 120423310^ 84 * 2^31 = 45318275 * 2^31 -.word 754889095 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 84 * 1521161857 * 2^31 -.word 38212281 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 74458359 // zeta^ 12 * 2^31 = 120423310^ 12 * 2^31 = 74458359 * 2^31 -.word 1240289998 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 12 * 1521161857 * 2^31 -.word 56896093 // zeta^156 * 2^31 = 120423310^156 * 2^31 = 56896093 * 2^31 -.word 947746580 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 120423310^156 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_128919937_120423310_incomplete_good, %function -.global ntt_192_u32_128919937_120423310_incomplete_good -ntt_192_u32_128919937_120423310_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -128919937 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 2773805439 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_bitrev.s deleted file mode 100644 index d5cce74..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 2223847 /// zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 126696089 /// zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 14136207 // zeta^ 48 * 2^31 = 120423310^ 48 * 2^31 = 14136207 * 2^31 -.word 235473846 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 48 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 14136207 // zeta^ 48 * 2^31 = 120423310^ 48 * 2^31 = 14136207 * 2^31 -.word 235473846 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 48 * 1521161857 * 2^31 -.word 14136207 // zeta^ 48 * 2^31 = 120423310^ 48 * 2^31 = 14136207 * 2^31 -.word 235473846 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 48 * 1521161857 * 2^31 -.word 90707656 // zeta^120 * 2^31 = 120423310^120 * 2^31 = 90707656 * 2^31 -.word 1510962637 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 120423310^120 * 1521161857 * 2^31 -.word 13510762 // zeta^168 * 2^31 = 120423310^168 * 2^31 = 13510762 * 2^31 -.word 225055497 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 120423310^168 * 1521161857 * 2^31 -.word 90707656 // zeta^120 * 2^31 = 120423310^120 * 2^31 = 90707656 * 2^31 -.word 1510962637 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 120423310^120 * 1521161857 * 2^31 -.word 72023844 // zeta^ 60 * 2^31 = 120423310^ 60 * 2^31 = 72023844 * 2^31 -.word 1199737068 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 60 * 1521161857 * 2^31 -.word 54461578 // zeta^108 * 2^31 = 120423310^108 * 2^31 = 54461578 * 2^31 -.word 907193650 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 120423310^108 * 1521161857 * 2^31 -.word 13510762 // zeta^168 * 2^31 = 120423310^168 * 2^31 = 13510762 * 2^31 -.word 225055497 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 120423310^168 * 1521161857 * 2^31 -.word 83601662 // zeta^180 * 2^31 = 120423310^180 * 2^31 = 83601662 * 2^31 -.word 1392594553 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 120423310^180 * 1521161857 * 2^31 -.word 47897664 // zeta^ 36 * 2^31 = 120423310^ 36 * 2^31 = 47897664 * 2^31 -.word 797855263 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 36 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_128919937_120423310_incomplete_good_bitrev, %function -.global ntt_192_u32_128919937_120423310_incomplete_good_bitrev -ntt_192_u32_128919937_120423310_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -128919937 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 2773805439 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop.s b/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop.s deleted file mode 100644 index aee14b9..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop.s +++ /dev/null @@ -1,1395 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_128919937_120423310_incomplete_good_oop_twiddles -ntt_192_u32_128919937_120423310_incomplete_good_oop_twiddles: // For base multiplication -.word 151081339 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 2922143493 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 204918473 // zeta^160 * 2^31 = 120423310^160 * 2^31 = 2223848 * 2^31 -.word 1482262199 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 120423310^160 * 1521161857 * 2^31 -.word 64661425 // zeta^ 80 * 2^31 = 120423310^ 80 * 2^31 = 107269247 * 2^31 -.word 120408527 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 80 * 1521161857 * 2^31 -.word 11853537 // zeta^ 48 * 2^31 = 120423310^ 48 * 2^31 = 14136207 * 2^31 -.word 526998175 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 48 * 1521161857 * 2^31 -.word 210375303 // zeta^136 * 2^31 = 120423310^136 * 2^31 = 58374830 * 2^31 -.word 4205318137 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 120423310^136 * 1521161857 * 2^31 -.word 109569101 // zeta^104 * 2^31 = 120423310^104 * 2^31 = 44864068 * 2^31 -.word 1987598131 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 120423310^104 * 1521161857 * 2^31 -.word 170126629 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 1674400347 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 225931657 // zeta^184 * 2^31 = 120423310^184 * 2^31 = 83604053 * 2^31 -.word 4069196791 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 120423310^184 * 1521161857 * 2^31 -.word 255490897 // zeta^ 68 * 2^31 = 120423310^ 68 * 2^31 = 56394291 * 2^31 -.word 1943266863 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 68 * 1521161857 * 2^31 -.word 196963681 // zeta^ 36 * 2^31 = 120423310^ 36 * 2^31 = 47897664 * 2^31 -.word 2757397535 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 36 * 1521161857 * 2^31 -.word 249336559 // zeta^148 * 2^31 = 120423310^148 * 2^31 = 13888621 * 2^31 -.word 1653894033 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 120423310^148 * 1521161857 * 2^31 -.word 25222977 // zeta^116 * 2^31 = 120423310^116 * 2^31 = 59206896 * 2^31 -.word 812968511 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 120423310^116 * 1521161857 * 2^31 -.word 45743345 // zeta^ 12 * 2^31 = 120423310^ 12 * 2^31 = 74458359 * 2^31 -.word 4119843983 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 12 * 1521161857 * 2^31 -.word 36202655 // zeta^172 * 2^31 = 120423310^172 * 2^31 = 79182254 * 2^31 -.word 2825952737 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 120423310^172 * 1521161857 * 2^31 -.word 221269965 // zeta^ 92 * 2^31 = 120423310^ 92 * 2^31 = 120320932 * 2^31 -.word 3554487219 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 92 * 1521161857 * 2^31 -.word 268193 // zeta^ 60 * 2^31 = 120423310^ 60 * 2^31 = 72023844 * 2^31 -.word 1096630751 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 60 * 1521161857 * 2^31 -.word 52921401 // zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2812705095 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 75082803 // zeta^ 32 * 2^31 = 120423310^ 32 * 2^31 = 126696090 * 2^31 -.word 1439881293 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 32 * 1521161857 * 2^31 -.word 245986337 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 3767969119 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 181727825 // zeta^112 * 2^31 = 120423310^112 * 2^31 = 93133040 * 2^31 -.word 3888377647 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 120423310^112 * 1521161857 * 2^31 -.word 148270773 // zeta^ 8 * 2^31 = 120423310^ 8 * 2^31 = 84055869 * 2^31 -.word 2307369163 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 8 * 1521161857 * 2^31 -.word 229726139 // zeta^168 * 2^31 = 120423310^168 * 2^31 = 13510762 * 2^31 -.word 2217720005 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 120423310^168 * 1521161857 * 2^31 -.word 31908217 // zeta^ 88 * 2^31 = 120423310^ 88 * 2^31 = 45315884 * 2^31 -.word 225770503 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 88 * 1521161857 * 2^31 -.word 73114909 // zeta^ 56 * 2^31 = 120423310^ 56 * 2^31 = 83528165 * 2^31 -.word 1900170851 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 56 * 1521161857 * 2^31 -.word 60876193 // zeta^132 * 2^31 = 120423310^132 * 2^31 = 81022273 * 2^31 -.word 1537569759 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 120423310^132 * 1521161857 * 2^31 -.word 187447153 // zeta^100 * 2^31 = 120423310^100 * 2^31 = 8496627 * 2^31 -.word 3480836623 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 120423310^100 * 1521161857 * 2^31 -.word 232616897 // zeta^ 20 * 2^31 = 120423310^ 20 * 2^31 = 69713041 * 2^31 -.word 3481998783 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 20 * 1521161857 * 2^31 -.word 95193645 // zeta^180 * 2^31 = 120423310^180 * 2^31 = 83601662 * 2^31 -.word 840925523 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 120423310^180 * 1521161857 * 2^31 -.word 221637219 // zeta^ 76 * 2^31 = 120423310^ 76 * 2^31 = 49737683 * 2^31 -.word 1469014557 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 76 * 1521161857 * 2^31 -.word 138460627 // zeta^ 44 * 2^31 = 120423310^ 44 * 2^31 = 124196042 * 2^31 -.word 1293891245 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 44 * 1521161857 * 2^31 -.word 257571681 // zeta^156 * 2^31 = 120423310^156 * 2^31 = 56896093 * 2^31 -.word 3198336543 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 120423310^156 * 1521161857 * 2^31 -.word 92081835 // zeta^124 * 2^31 = 120423310^124 * 2^31 = 48297088 * 2^31 -.word 2457856469 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 120423310^124 * 1521161857 * 2^31 -.word 182757071 // zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 2855086001 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 106758535 // zeta^ 96 * 2^31 = 120423310^ 96 * 2^31 = 128919936 * 2^31 -.word 1372823801 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 96 * 1521161857 * 2^31 -.word 76112049 // zeta^ 16 * 2^31 = 120423310^ 16 * 2^31 = 35786897 * 2^31 -.word 406589647 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 16 * 1521161857 * 2^31 -.word 193178449 // zeta^176 * 2^31 = 120423310^176 * 2^31 = 21650690 * 2^31 -.word 4174558767 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 120423310^176 * 1521161857 * 2^31 -.word 28113735 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 2077247289 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 47464571 // zeta^ 40 * 2^31 = 120423310^ 40 * 2^31 = 70545107 * 2^31 -.word 89649157 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 40 * 1521161857 * 2^31 -.word 184724965 // zeta^152 * 2^31 = 120423310^152 * 2^31 = 45391772 * 2^31 -.word 2394796443 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 120423310^152 * 1521161857 * 2^31 -.word 87713245 // zeta^120 * 2^31 = 120423310^120 * 2^31 = 90707656 * 2^31 -.word 2620566947 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 120423310^120 * 1521161857 * 2^31 -.word 70392721 // zeta^ 4 * 2^31 = 120423310^ 4 * 2^31 = 120423310 * 2^31 -.word 814130671 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 4 * 1521161857 * 2^31 -.word 2348977 // zeta^164 * 2^31 = 120423310^164 * 2^31 = 72525646 * 2^31 -.word 2351700431 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 120423310^164 * 1521161857 * 2^31 -.word 162646229 // zeta^ 84 * 2^31 = 120423310^ 84 * 2^31 = 45318275 * 2^31 -.word 3454041771 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 84 * 1521161857 * 2^31 -.word 8503315 // zeta^ 52 * 2^31 = 120423310^ 52 * 2^31 = 115031316 * 2^31 -.word 2641073261 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 52 * 1521161857 * 2^31 -.word 119379247 // zeta^140 * 2^31 = 120423310^140 * 2^31 = 4723895 * 2^31 -.word 3001076049 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 120423310^140 * 1521161857 * 2^31 -.word 212096529 // zeta^108 * 2^31 = 120423310^108 * 2^31 = 54461578 * 2^31 -.word 175123311 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 120423310^108 * 1521161857 * 2^31 -.word 165758039 // zeta^ 28 * 2^31 = 120423310^ 28 * 2^31 = 80622849 * 2^31 -.word 1837110825 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 28 * 1521161857 * 2^31 -.word 36569909 // zeta^188 * 2^31 = 120423310^188 * 2^31 = 8599005 * 2^31 -.word 740480075 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 120423310^188 * 1521161857 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_128919937_120423310_incomplete_good_oop_scale -ntt_192_u32_128919937_120423310_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 151081339 // 1/48 -.word 2922143493 // 1/48 twisted -.data -roots: -.word 126696089 /// zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 2223847 /// zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 115409175 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 38212281 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 115409175 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 81022273 // zeta^132 * 2^31 = 120423310^132 * 2^31 = 81022273 * 2^31 -.word 1349628385 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 120423310^132 * 1521161857 * 2^31 -.word 45318275 // zeta^ 84 * 2^31 = 120423310^ 84 * 2^31 = 45318275 * 2^31 -.word 754889095 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 84 * 1521161857 * 2^31 -.word 38212281 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 74458359 // zeta^ 12 * 2^31 = 120423310^ 12 * 2^31 = 74458359 * 2^31 -.word 1240289998 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 12 * 1521161857 * 2^31 -.word 56896093 // zeta^156 * 2^31 = 120423310^156 * 2^31 = 56896093 * 2^31 -.word 947746580 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 120423310^156 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_128919937_120423310_incomplete_good_oop, %function -.global ntt_192_u32_128919937_120423310_incomplete_good_oop -ntt_192_u32_128919937_120423310_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -128919937 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r7 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmla.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r1,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 2773805439 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1362 -// Instruction count: 1000 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input.s b/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input.s deleted file mode 100644 index 0a25b20..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,1237 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input_twiddles -ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 151081339 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 2922143493 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 204918473 // zeta^160 * 2^31 = 120423310^160 * 2^31 = 2223848 * 2^31 -.word 1482262199 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 120423310^160 * 1521161857 * 2^31 -.word 64661425 // zeta^ 80 * 2^31 = 120423310^ 80 * 2^31 = 107269247 * 2^31 -.word 120408527 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 80 * 1521161857 * 2^31 -.word 11853537 // zeta^ 48 * 2^31 = 120423310^ 48 * 2^31 = 14136207 * 2^31 -.word 526998175 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 48 * 1521161857 * 2^31 -.word 210375303 // zeta^136 * 2^31 = 120423310^136 * 2^31 = 58374830 * 2^31 -.word 4205318137 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 120423310^136 * 1521161857 * 2^31 -.word 109569101 // zeta^104 * 2^31 = 120423310^104 * 2^31 = 44864068 * 2^31 -.word 1987598131 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 120423310^104 * 1521161857 * 2^31 -.word 170126629 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 1674400347 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 225931657 // zeta^184 * 2^31 = 120423310^184 * 2^31 = 83604053 * 2^31 -.word 4069196791 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 120423310^184 * 1521161857 * 2^31 -.word 255490897 // zeta^ 68 * 2^31 = 120423310^ 68 * 2^31 = 56394291 * 2^31 -.word 1943266863 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 68 * 1521161857 * 2^31 -.word 196963681 // zeta^ 36 * 2^31 = 120423310^ 36 * 2^31 = 47897664 * 2^31 -.word 2757397535 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 36 * 1521161857 * 2^31 -.word 249336559 // zeta^148 * 2^31 = 120423310^148 * 2^31 = 13888621 * 2^31 -.word 1653894033 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 120423310^148 * 1521161857 * 2^31 -.word 25222977 // zeta^116 * 2^31 = 120423310^116 * 2^31 = 59206896 * 2^31 -.word 812968511 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 120423310^116 * 1521161857 * 2^31 -.word 45743345 // zeta^ 12 * 2^31 = 120423310^ 12 * 2^31 = 74458359 * 2^31 -.word 4119843983 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 12 * 1521161857 * 2^31 -.word 36202655 // zeta^172 * 2^31 = 120423310^172 * 2^31 = 79182254 * 2^31 -.word 2825952737 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 120423310^172 * 1521161857 * 2^31 -.word 221269965 // zeta^ 92 * 2^31 = 120423310^ 92 * 2^31 = 120320932 * 2^31 -.word 3554487219 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 92 * 1521161857 * 2^31 -.word 268193 // zeta^ 60 * 2^31 = 120423310^ 60 * 2^31 = 72023844 * 2^31 -.word 1096630751 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 60 * 1521161857 * 2^31 -.word 52921401 // zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2812705095 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 75082803 // zeta^ 32 * 2^31 = 120423310^ 32 * 2^31 = 126696090 * 2^31 -.word 1439881293 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 32 * 1521161857 * 2^31 -.word 245986337 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 3767969119 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 181727825 // zeta^112 * 2^31 = 120423310^112 * 2^31 = 93133040 * 2^31 -.word 3888377647 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 120423310^112 * 1521161857 * 2^31 -.word 148270773 // zeta^ 8 * 2^31 = 120423310^ 8 * 2^31 = 84055869 * 2^31 -.word 2307369163 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 8 * 1521161857 * 2^31 -.word 229726139 // zeta^168 * 2^31 = 120423310^168 * 2^31 = 13510762 * 2^31 -.word 2217720005 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 120423310^168 * 1521161857 * 2^31 -.word 31908217 // zeta^ 88 * 2^31 = 120423310^ 88 * 2^31 = 45315884 * 2^31 -.word 225770503 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 88 * 1521161857 * 2^31 -.word 73114909 // zeta^ 56 * 2^31 = 120423310^ 56 * 2^31 = 83528165 * 2^31 -.word 1900170851 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 56 * 1521161857 * 2^31 -.word 60876193 // zeta^132 * 2^31 = 120423310^132 * 2^31 = 81022273 * 2^31 -.word 1537569759 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 120423310^132 * 1521161857 * 2^31 -.word 187447153 // zeta^100 * 2^31 = 120423310^100 * 2^31 = 8496627 * 2^31 -.word 3480836623 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 120423310^100 * 1521161857 * 2^31 -.word 232616897 // zeta^ 20 * 2^31 = 120423310^ 20 * 2^31 = 69713041 * 2^31 -.word 3481998783 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 20 * 1521161857 * 2^31 -.word 95193645 // zeta^180 * 2^31 = 120423310^180 * 2^31 = 83601662 * 2^31 -.word 840925523 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 120423310^180 * 1521161857 * 2^31 -.word 221637219 // zeta^ 76 * 2^31 = 120423310^ 76 * 2^31 = 49737683 * 2^31 -.word 1469014557 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 76 * 1521161857 * 2^31 -.word 138460627 // zeta^ 44 * 2^31 = 120423310^ 44 * 2^31 = 124196042 * 2^31 -.word 1293891245 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 44 * 1521161857 * 2^31 -.word 257571681 // zeta^156 * 2^31 = 120423310^156 * 2^31 = 56896093 * 2^31 -.word 3198336543 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 120423310^156 * 1521161857 * 2^31 -.word 92081835 // zeta^124 * 2^31 = 120423310^124 * 2^31 = 48297088 * 2^31 -.word 2457856469 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 120423310^124 * 1521161857 * 2^31 -.word 182757071 // zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 2855086001 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 106758535 // zeta^ 96 * 2^31 = 120423310^ 96 * 2^31 = 128919936 * 2^31 -.word 1372823801 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 96 * 1521161857 * 2^31 -.word 76112049 // zeta^ 16 * 2^31 = 120423310^ 16 * 2^31 = 35786897 * 2^31 -.word 406589647 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 16 * 1521161857 * 2^31 -.word 193178449 // zeta^176 * 2^31 = 120423310^176 * 2^31 = 21650690 * 2^31 -.word 4174558767 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 120423310^176 * 1521161857 * 2^31 -.word 28113735 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 2077247289 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 47464571 // zeta^ 40 * 2^31 = 120423310^ 40 * 2^31 = 70545107 * 2^31 -.word 89649157 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 40 * 1521161857 * 2^31 -.word 184724965 // zeta^152 * 2^31 = 120423310^152 * 2^31 = 45391772 * 2^31 -.word 2394796443 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 120423310^152 * 1521161857 * 2^31 -.word 87713245 // zeta^120 * 2^31 = 120423310^120 * 2^31 = 90707656 * 2^31 -.word 2620566947 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 120423310^120 * 1521161857 * 2^31 -.word 70392721 // zeta^ 4 * 2^31 = 120423310^ 4 * 2^31 = 120423310 * 2^31 -.word 814130671 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 4 * 1521161857 * 2^31 -.word 2348977 // zeta^164 * 2^31 = 120423310^164 * 2^31 = 72525646 * 2^31 -.word 2351700431 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 120423310^164 * 1521161857 * 2^31 -.word 162646229 // zeta^ 84 * 2^31 = 120423310^ 84 * 2^31 = 45318275 * 2^31 -.word 3454041771 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 84 * 1521161857 * 2^31 -.word 8503315 // zeta^ 52 * 2^31 = 120423310^ 52 * 2^31 = 115031316 * 2^31 -.word 2641073261 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 52 * 1521161857 * 2^31 -.word 119379247 // zeta^140 * 2^31 = 120423310^140 * 2^31 = 4723895 * 2^31 -.word 3001076049 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 120423310^140 * 1521161857 * 2^31 -.word 212096529 // zeta^108 * 2^31 = 120423310^108 * 2^31 = 54461578 * 2^31 -.word 175123311 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 120423310^108 * 1521161857 * 2^31 -.word 165758039 // zeta^ 28 * 2^31 = 120423310^ 28 * 2^31 = 80622849 * 2^31 -.word 1837110825 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 28 * 1521161857 * 2^31 -.word 36569909 // zeta^188 * 2^31 = 120423310^188 * 2^31 = 8599005 * 2^31 -.word 740480075 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 120423310^188 * 1521161857 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input_scale -ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 151081339 // 1/48 -.word 2922143493 // 1/48 twisted -.data -roots: -.word 126696089 /// zeta^ 64 * 2^31 = 120423310^ 64 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 64 * 1521161857 * 2^31 -.word 2223847 /// zeta^128 * 2^31 = 120423310^128 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 120423310^128 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 120423310^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 114783730 // zeta^144 * 2^31 = 120423310^144 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 120423310^144 * 1521161857 * 2^31 -.word 115409175 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 38212281 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 115409175 // zeta^ 72 * 2^31 = 120423310^ 72 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 72 * 1521161857 * 2^31 -.word 81022273 // zeta^132 * 2^31 = 120423310^132 * 2^31 = 81022273 * 2^31 -.word 1349628385 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 120423310^132 * 1521161857 * 2^31 -.word 45318275 // zeta^ 84 * 2^31 = 120423310^ 84 * 2^31 = 45318275 * 2^31 -.word 754889095 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 84 * 1521161857 * 2^31 -.word 38212281 // zeta^ 24 * 2^31 = 120423310^ 24 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 24 * 1521161857 * 2^31 -.word 74458359 // zeta^ 12 * 2^31 = 120423310^ 12 * 2^31 = 74458359 * 2^31 -.word 1240289998 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 120423310^ 12 * 1521161857 * 2^31 -.word 56896093 // zeta^156 * 2^31 = 120423310^156 * 2^31 = 56896093 * 2^31 -.word 947746580 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 120423310^156 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input, %function -.global ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input -ntt_192_u32_128919937_120423310_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -128919937 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q0 -// input[4]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r7 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r10 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r1,#(256)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[0] from Q1 -// Release input[64] from Q0 -// input[4]: Already loaded as Q6 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -vmul.u32 Q1, Q0, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vadd.s32 Q2, Q6, Q7 -vqrdmulh.s32 Q0, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q6 -// Release input[4] from Q6 -vstrw.u32 Q2, [r11,#(-480)] -vsub.s32 Q5, Q1, Q7 -// Release input[68] from Q7 -vstrw.u32 Q5, [r1,#(16)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(272)] -// input[8]: Already loaded as Q4 -// input[72]: Already loaded as Q3 -vmul.u32 Q0, Q4, r8 -vadd.s32 Q2, Q3, Q4 -// input[12]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q4, r7 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q5, Q3, Q4 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(288)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(32)] -vsub.s32 Q5, Q5, Q0 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[72] from Q3 -// Release input[8] from Q4 -// input[76]: Already loaded as Q7 -// input[12]: Already loaded as Q6 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q6, Q7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmulh.s32 Q1, Q7, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[12] from Q6 -// Release input[76] from Q7 -// input[16]: Already loaded as Q4 -// input[80]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r7 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[16] from Q4 -vstrw.u32 Q2, [r11,#(-432)] -vsub.s32 Q4, Q1, Q5 -// Release input[80] from Q5 -vstrw.u32 Q4, [r1,#(64)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(320)] -// input[20]: Already loaded as Q6 -// input[84]: Already loaded as Q3 -vmul.u32 Q0, Q6, r8 -vadd.s32 Q2, Q3, Q6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q6, r7 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(80)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[84] from Q3 -// Release input[20] from Q6 -// input[88]: Already loaded as Q7 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q5, Q7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmulh.s32 Q1, Q7, r7 -// input[92]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 92)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[24] from Q5 -// Release input[88] from Q7 -// input[28]: Already loaded as Q4 -// input[92]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[28] from Q4 -vstrw.u32 Q2, [r11,#(-384)] -vsub.s32 Q4, Q1, Q6 -// Release input[92] from Q6 -vstrw.u32 Q4, [r1,#(112)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(368)] -// input[32]: Already loaded as Q3 -vmul.u32 Q0, Q3, r8 -vneg.s32 Q1, Q3 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmulh.s32 Q2, Q3, r7 -vstrw.u32 Q3, [r1,#(384)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(128)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[32] from Q3 -// input[36]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(144)] -vstrw.u32 Q4, [r1,#(400)] -vstrw.u32 Q4, [r11,#(-352)] -// Release input[36] from Q4 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-336)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(416)] -// Release input[40] from Q0 -// input[44]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(432)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(176)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[44] from Q4 -// input[48]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(192)] -vstrw.u32 Q3, [r1,#(448)] -vstrw.u32 Q3, [r11,#(-304)] -// Release input[48] from Q3 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-288)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(464)] -// Release input[52] from Q0 -// input[56]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(480)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(224)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[56] from Q4 -// input[60]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(240)] -vstrw.u32 Q3, [r1,#(496)] -vstrw.u32 Q3, [r11,#(-256)] -// Release input[60] from Q3 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 2773805439 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1204 -// Instruction count: 900 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good.s deleted file mode 100644 index 2f9ae0e..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_33556993_27792935_incomplete_good_twiddles -ntt_192_u32_33556993_27792935_incomplete_good_twiddles: // For base multiplication -.word 56716939 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2862874485 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 25646259 // zeta^160 * 2^31 = 27792935^160 * 2^31 = 25038562 * 2^31 -.word 3487279437 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 27792935^160 * 375649793 * 2^31 -.word 17110297 // zeta^ 80 * 2^31 = 27792935^ 80 * 2^31 = 2013241 * 2^31 -.word 3456754919 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 80 * 375649793 * 2^31 -.word 35519885 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 11895923 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 43957141 // zeta^136 * 2^31 = 27792935^136 * 2^31 = 29356361 * 2^31 -.word 478221931 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 27792935^136 * 375649793 * 2^31 -.word 35166687 // zeta^104 * 2^31 = 27792935^104 * 2^31 = 32616688 * 2^31 -.word 2932743201 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 27792935^104 * 375649793 * 2^31 -.word 17906339 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 3247252317 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 2473265 // zeta^184 * 2^31 = 27792935^184 * 2^31 = 23624597 * 2^31 -.word 1545219279 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 27792935^184 * 375649793 * 2^31 -.word 62160579 // zeta^ 68 * 2^31 = 27792935^ 68 * 2^31 = 2711401 * 2^31 -.word 3272810301 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 68 * 375649793 * 2^31 -.word 21606357 // zeta^ 36 * 2^31 = 27792935^ 36 * 2^31 = 26823146 * 2^31 -.word 3797983787 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 36 * 375649793 * 2^31 -.word 65117653 // zeta^148 * 2^31 = 27792935^148 * 2^31 = 21166324 * 2^31 -.word 3608458283 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 27792935^148 * 375649793 * 2^31 -.word 14684899 // zeta^116 * 2^31 = 27792935^116 * 2^31 = 518908 * 2^31 -.word 3710962461 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 27792935^116 * 375649793 * 2^31 -.word 57316651 // zeta^ 12 * 2^31 = 27792935^ 12 * 2^31 = 14745691 * 2^31 -.word 798824661 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 12 * 375649793 * 2^31 -.word 2740735 // zeta^172 * 2^31 = 27792935^172 * 2^31 = 15739856 * 2^31 -.word 2960008193 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 27792935^172 * 375649793 * 2^31 -.word 1288749 // zeta^ 92 * 2^31 = 27792935^ 92 * 2^31 = 33153165 * 2^31 -.word 1828591571 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 92 * 375649793 * 2^31 -.word 50827131 // zeta^ 60 * 2^31 = 27792935^ 60 * 2^31 = 20044445 * 2^31 -.word 2860990085 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 60 * 375649793 * 2^31 -.word 41467727 // zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 807687857 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 64627673 // zeta^ 32 * 2^31 = 27792935^ 32 * 2^31 = 8518432 * 2^31 -.word 3670562343 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 32 * 375649793 * 2^31 -.word 31594101 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 4283071371 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 15147405 // zeta^112 * 2^31 = 27792935^112 * 2^31 = 19715532 * 2^31 -.word 3444858995 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 27792935^112 * 375649793 * 2^31 -.word 31947299 // zeta^ 8 * 2^31 = 27792935^ 8 * 2^31 = 940305 * 2^31 -.word 1362224093 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 8 * 375649793 * 2^31 -.word 42347447 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1840446025 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 64640721 // zeta^ 88 * 2^31 = 27792935^ 88 * 2^31 = 9932396 * 2^31 -.word 2749748015 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 88 * 375649793 * 2^31 -.word 48990067 // zeta^ 56 * 2^31 = 27792935^ 56 * 2^31 = 24511972 * 2^31 -.word 1702033037 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 56 * 375649793 * 2^31 -.word 45507629 // zeta^132 * 2^31 = 27792935^132 * 2^31 = 6733847 * 2^31 -.word 496983507 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 27792935^132 * 375649793 * 2^31 -.word 6997229 // zeta^100 * 2^31 = 27792935^100 * 2^31 = 9445248 * 2^31 -.word 3769793811 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 27792935^100 * 375649793 * 2^31 -.word 52429087 // zeta^ 20 * 2^31 = 27792935^ 20 * 2^31 = 33038085 * 2^31 -.word 584004833 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 20 * 375649793 * 2^31 -.word 16875761 // zeta^180 * 2^31 = 27792935^180 * 2^31 = 20647416 * 2^31 -.word 4192463119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 27792935^180 * 375649793 * 2^31 -.word 64373251 // zeta^ 76 * 2^31 = 27792935^ 76 * 2^31 = 17817137 * 2^31 -.word 1334959101 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 76 * 375649793 * 2^31 -.word 21018923 // zeta^ 44 * 2^31 = 27792935^ 44 * 2^31 = 32562828 * 2^31 -.word 2133783765 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 44 * 375649793 * 2^31 -.word 16286855 // zeta^156 * 2^31 = 27792935^156 * 2^31 = 13512548 * 2^31 -.word 1433977209 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 27792935^156 * 375649793 * 2^31 -.word 51132597 // zeta^124 * 2^31 = 27792935^124 * 2^31 = 13108720 * 2^31 -.word 3262568779 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 27792935^124 * 375649793 * 2^31 -.word 2486313 // zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 624404951 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 10397047 // zeta^ 96 * 2^31 = 27792935^ 96 * 2^31 = 33556992 * 2^31 -.word 1432092809 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 96 * 375649793 * 2^31 -.word 51966581 // zeta^ 16 * 2^31 = 27792935^ 16 * 2^31 = 13841461 * 2^31 -.word 850108299 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 16 * 375649793 * 2^31 -.word 50003689 // zeta^176 * 2^31 = 27792935^176 * 2^31 = 31543752 * 2^31 -.word 838212375 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 27792935^176 * 375649793 * 2^31 -.word 24766539 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2454521269 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 23156845 // zeta^ 40 * 2^31 = 27792935^ 40 * 2^31 = 4200632 * 2^31 -.word 3816745363 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 40 * 375649793 * 2^31 -.word 18123919 // zeta^152 * 2^31 = 27792935^152 * 2^31 = 9045021 * 2^31 -.word 2592934257 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 27792935^152 * 375649793 * 2^31 -.word 49207647 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 1047714977 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 60116757 // zeta^ 4 * 2^31 = 27792935^ 4 * 2^31 = 24111745 * 2^31 -.word 525173483 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 4 * 375649793 * 2^31 -.word 4953407 // zeta^164 * 2^31 = 27792935^164 * 2^31 = 30845592 * 2^31 -.word 1022156993 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 27792935^164 * 375649793 * 2^31 -.word 50238225 // zeta^ 84 * 2^31 = 27792935^ 84 * 2^31 = 12909577 * 2^31 -.word 102504175 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 84 * 375649793 * 2^31 -.word 1996333 // zeta^ 52 * 2^31 = 27792935^ 52 * 2^31 = 12390669 * 2^31 -.word 686509011 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 52 * 375649793 * 2^31 -.word 46095063 // zeta^140 * 2^31 = 27792935^140 * 2^31 = 994165 * 2^31 -.word 2161183529 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 27792935^140 * 375649793 * 2^31 -.word 9797335 // zeta^108 * 2^31 = 27792935^108 * 2^31 = 18811302 * 2^31 -.word 3496142633 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 27792935^108 * 375649793 * 2^31 -.word 15981389 // zeta^ 28 * 2^31 = 27792935^ 28 * 2^31 = 20448273 * 2^31 -.word 1032398515 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 28 * 375649793 * 2^31 -.word 65825237 // zeta^188 * 2^31 = 27792935^188 * 2^31 = 403828 * 2^31 -.word 2466375723 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 27792935^188 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_33556993_27792935_incomplete_good_scale -ntt_192_u32_33556993_27792935_incomplete_good_scale: // Constants for scaling by 1/N -.word 56716939 // 1/48 -.word 2862874485 // 1/48 twisted -.data -roots: -.word 893127 /// zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 66384763 /// zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 18598075 // zeta^132 * 2^31 = 27792935^132 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 27792935^132 * 375649793 * 2^31 -.word 4885007 // zeta^ 84 * 2^31 = 27792935^ 84 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 84 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 64683161 // zeta^ 12 * 2^31 = 27792935^ 12 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 12 * 375649793 * 2^31 -.word 34427601 // zeta^156 * 2^31 = 27792935^156 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 27792935^156 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_33556993_27792935_incomplete_good, %function -.global ntt_192_u32_33556993_27792935_incomplete_good -ntt_192_u32_33556993_27792935_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s deleted file mode 100644 index 0107319..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 66384763 /// zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 893127 /// zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 32686385 // zeta^ 60 * 2^31 = 27792935^ 60 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 60 * 375649793 * 2^31 -.word 2430825 // zeta^108 * 2^31 = 27792935^108 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 27792935^108 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 62228979 // zeta^180 * 2^31 = 27792935^180 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 27792935^180 * 375649793 * 2^31 -.word 48515911 // zeta^ 36 * 2^31 = 27792935^ 36 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 36 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_33556993_27792935_incomplete_good_bitrev, %function -.global ntt_192_u32_33556993_27792935_incomplete_good_bitrev -ntt_192_u32_33556993_27792935_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vmul.u32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good.s deleted file mode 100644 index bb1b225..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_45387457_16877098_incomplete_good_twiddles -ntt_192_u32_45387457_16877098_incomplete_good_twiddles: // For base multiplication -.word 3050923 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 1370315925 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 58792077 // zeta^160 * 2^31 = 16877098^160 * 2^31 = 27201077 * 2^31 -.word 2040215347 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 16877098^160 * 450429249 * 2^31 -.word 9560897 // zeta^ 80 * 2^31 = 16877098^ 80 * 2^31 = 43749424 * 2^31 -.word 217647999 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 80 * 450429249 * 2^31 -.word 16322801 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 4050174415 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 65472817 // zeta^136 * 2^31 = 16877098^136 * 2^31 = 6908982 * 2^31 -.word 1455716751 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 16877098^136 * 450429249 * 2^31 -.word 34650023 // zeta^104 * 2^31 = 16877098^104 * 2^31 = 38432301 * 2^31 -.word 3329931161 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 16877098^104 * 450429249 * 2^31 -.word 22720737 // zeta^ 24 * 2^31 = 16877098^ 24 * 2^31 = 40340716 * 2^31 -.word 2223977951 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 24 * 450429249 * 2^31 -.word 41602881 // zeta^184 * 2^31 = 16877098^184 * 2^31 = 24079121 * 2^31 -.word 2615746431 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 16877098^184 * 450429249 * 2^31 -.word 16163777 // zeta^ 68 * 2^31 = 16877098^ 68 * 2^31 = 4138342 * 2^31 -.word 1646504703 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 68 * 450429249 * 2^31 -.word 34282441 // zeta^ 36 * 2^31 = 16877098^ 36 * 2^31 = 21015440 * 2^31 -.word 2024876279 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 36 * 450429249 * 2^31 -.word 63685223 // zeta^148 * 2^31 = 16877098^148 * 2^31 = 12104035 * 2^31 -.word 4264705241 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 16877098^148 * 450429249 * 2^31 -.word 43044525 // zeta^116 * 2^31 = 16877098^116 * 2^31 = 41757216 * 2^31 -.word 96983315 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 16877098^116 * 450429249 * 2^31 -.word 87208369 // zeta^ 12 * 2^31 = 16877098^ 12 * 2^31 = 38013065 * 2^31 -.word 3798348047 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 12 * 450429249 * 2^31 -.word 8670255 // zeta^172 * 2^31 = 16877098^172 * 2^31 = 4764854 * 2^31 -.word 4552977 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 16877098^172 * 450429249 * 2^31 -.word 26490773 // zeta^ 92 * 2^31 = 16877098^ 92 * 2^31 = 34257499 * 2^31 -.word 3319352875 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 92 * 450429249 * 2^31 -.word 25361455 // zeta^ 60 * 2^31 = 16877098^ 60 * 2^31 = 20563366 * 2^31 -.word 2431306001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 60 * 450429249 * 2^31 -.word 31982837 // zeta^ 64 * 2^31 = 16877098^ 64 * 2^31 = 18186380 * 2^31 -.word 2254751947 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 64 * 450429249 * 2^31 -.word 80421217 // zeta^ 32 * 2^31 = 16877098^ 32 * 2^31 = 18186381 * 2^31 -.word 3625067871 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 32 * 450429249 * 2^31 -.word 74452113 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 244792879 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 38625553 // zeta^112 * 2^31 = 16877098^112 * 2^31 = 29011006 * 2^31 -.word 462440879 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 16877098^112 * 450429249 * 2^31 -.word 56124891 // zeta^ 8 * 2^31 = 16877098^ 8 * 2^31 = 6955156 * 2^31 -.word 965036133 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 8 * 450429249 * 2^31 -.word 76210251 // zeta^168 * 2^31 = 16877098^168 * 2^31 = 13864138 * 2^31 -.word 2420752885 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 16877098^168 * 450429249 * 2^31 -.word 49172033 // zeta^ 88 * 2^31 = 16877098^ 88 * 2^31 = 21308336 * 2^31 -.word 1679220863 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 88 * 450429249 * 2^31 -.word 26505313 // zeta^ 56 * 2^31 = 16877098^ 56 * 2^31 = 16261595 * 2^31 -.word 3903198815 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 56 * 450429249 * 2^31 -.word 56492473 // zeta^132 * 2^31 = 16877098^132 * 2^31 = 24372017 * 2^31 -.word 2270091015 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 16877098^132 * 450429249 * 2^31 -.word 27268793 // zeta^100 * 2^31 = 16877098^100 * 2^31 = 28510359 * 2^31 -.word 3916595719 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 16877098^100 * 450429249 * 2^31 -.word 47730389 // zeta^ 20 * 2^31 = 16877098^ 20 * 2^31 = 3630241 * 2^31 -.word 4197983979 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 20 * 450429249 * 2^31 -.word 66028155 // zeta^180 * 2^31 = 16877098^180 * 2^31 = 15734276 * 2^31 -.word 4167721925 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 16877098^180 * 450429249 * 2^31 -.word 82104659 // zeta^ 76 * 2^31 = 16877098^ 76 * 2^31 = 40622603 * 2^31 -.word 4290414317 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 76 * 450429249 * 2^31 -.word 33150657 // zeta^ 44 * 2^31 = 16877098^ 44 * 2^31 = 33248211 * 2^31 -.word 3793795071 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 44 * 450429249 * 2^31 -.word 65413459 // zeta^156 * 2^31 = 16877098^156 * 2^31 = 24824091 * 2^31 -.word 1863661293 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 16877098^156 * 450429249 * 2^31 -.word 46516775 // zeta^124 * 2^31 = 16877098^124 * 2^31 = 13694133 * 2^31 -.word 888046873 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 16877098^124 * 450429249 * 2^31 -.word 10353697 // zeta^128 * 2^31 = 16877098^128 * 2^31 = 27201076 * 2^31 -.word 669899423 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 16877098^128 * 450429249 * 2^31 -.word 87723991 // zeta^ 96 * 2^31 = 16877098^ 96 * 2^31 = 45387456 * 2^31 -.word 2924651369 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 96 * 450429249 * 2^31 -.word 52149361 // zeta^ 16 * 2^31 = 16877098^ 16 * 2^31 = 16376451 * 2^31 -.word 3832526415 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 16 * 450429249 * 2^31 -.word 81214017 // zeta^176 * 2^31 = 16877098^176 * 2^31 = 1638033 * 2^31 -.word 4077319295 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 16877098^176 * 450429249 * 2^31 -.word 14564663 // zeta^ 72 * 2^31 = 16877098^ 72 * 2^31 = 31523319 * 2^31 -.word 1874214409 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 72 * 450429249 * 2^31 -.word 25302097 // zeta^ 40 * 2^31 = 16877098^ 40 * 2^31 = 38478475 * 2^31 -.word 2839250543 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 40 * 450429249 * 2^31 -.word 64269601 // zeta^152 * 2^31 = 16877098^152 * 2^31 = 29125862 * 2^31 -.word 391768479 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 16877098^152 * 450429249 * 2^31 -.word 68054177 // zeta^120 * 2^31 = 16877098^120 * 2^31 = 5046741 * 2^31 -.word 2070989343 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 16877098^120 * 450429249 * 2^31 -.word 63506121 // zeta^ 4 * 2^31 = 16877098^ 4 * 2^31 = 16877098 * 2^31 -.word 378371575 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 4 * 450429249 * 2^31 -.word 74611137 // zeta^164 * 2^31 = 16877098^164 * 2^31 = 41249115 * 2^31 -.word 2648462591 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 16877098^164 * 450429249 * 2^31 -.word 24746759 // zeta^ 84 * 2^31 = 16877098^ 84 * 2^31 = 29653181 * 2^31 -.word 127245369 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 84 * 450429249 * 2^31 -.word 27089691 // zeta^ 52 * 2^31 = 16877098^ 52 * 2^31 = 33283422 * 2^31 -.word 30262053 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 52 * 450429249 * 2^31 -.word 57624257 // zeta^140 * 2^31 = 16877098^140 * 2^31 = 12139246 * 2^31 -.word 501172223 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 16877098^140 * 450429249 * 2^31 -.word 3566545 // zeta^108 * 2^31 = 16877098^108 * 2^31 = 7374392 * 2^31 -.word 496619247 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 16877098^108 * 450429249 * 2^31 -.word 44258139 // zeta^ 28 * 2^31 = 16877098^ 28 * 2^31 = 31693324 * 2^31 -.word 3406920421 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 28 * 450429249 * 2^31 -.word 64284141 // zeta^188 * 2^31 = 16877098^188 * 2^31 = 11129958 * 2^31 -.word 975614419 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 16877098^188 * 450429249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_45387457_16877098_incomplete_good_scale -ntt_192_u32_45387457_16877098_incomplete_good_scale: // Constants for scaling by 1/N -.word 3050923 // 1/48 -.word 1370315925 // 1/48 twisted -.data -roots: -.word 9023783 /// zeta^ 64 * 2^31 = 16877098^ 64 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 64 * 450429249 * 2^31 -.word 22090505 /// zeta^128 * 2^31 = 16877098^128 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 16877098^128 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 78782351 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 88323005 // zeta^ 72 * 2^31 = 16877098^ 72 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 72 * 450429249 * 2^31 -.word 84188761 // zeta^ 24 * 2^31 = 16877098^ 24 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 24 * 450429249 * 2^31 -.word 88323005 // zeta^ 72 * 2^31 = 16877098^ 72 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 72 * 450429249 * 2^31 -.word 16804439 // zeta^132 * 2^31 = 16877098^132 * 2^31 = 24372017 * 2^31 -.word 3300632809 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 16877098^132 * 450429249 * 2^31 -.word 19157039 // zeta^ 84 * 2^31 = 16877098^ 84 * 2^31 = 29653181 * 2^31 -.word 3550508305 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 84 * 450429249 * 2^31 -.word 84188761 // zeta^ 24 * 2^31 = 16877098^ 24 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 24 * 450429249 * 2^31 -.word 65804887 // zeta^ 12 * 2^31 = 16877098^ 12 * 2^31 = 38013065 * 2^31 -.word 3946051817 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 12 * 450429249 * 2^31 -.word 82969997 // zeta^156 * 2^31 = 16877098^156 * 2^31 = 24824091 * 2^31 -.word 3322022451 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 16877098^156 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_45387457_16877098_incomplete_good, %function -.global ntt_192_u32_45387457_16877098_incomplete_good -ntt_192_u32_45387457_16877098_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 45387457 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 3844538047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s deleted file mode 100644 index b2d471f..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 22090505 /// zeta^128 * 2^31 = 16877098^128 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 16877098^128 * 450429249 * 2^31 -.word 9023783 /// zeta^ 64 * 2^31 = 16877098^ 64 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 64 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 11992563 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 6586153 // zeta^120 * 2^31 = 16877098^120 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 16877098^120 * 450429249 * 2^31 -.word 2451909 // zeta^168 * 2^31 = 16877098^168 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 16877098^168 * 450429249 * 2^31 -.word 6586153 // zeta^120 * 2^31 = 16877098^120 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 16877098^120 * 450429249 * 2^31 -.word 7804917 // zeta^ 60 * 2^31 = 16877098^ 60 * 2^31 = 20563366 * 2^31 -.word 972944843 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 60 * 450429249 * 2^31 -.word 24970027 // zeta^108 * 2^31 = 16877098^108 * 2^31 = 7374392 * 2^31 -.word 348915477 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 16877098^108 * 450429249 * 2^31 -.word 2451909 // zeta^168 * 2^31 = 16877098^168 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 16877098^168 * 450429249 * 2^31 -.word 71617875 // zeta^180 * 2^31 = 16877098^180 * 2^31 = 15734276 * 2^31 -.word 744458989 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 16877098^180 * 450429249 * 2^31 -.word 73970475 // zeta^ 36 * 2^31 = 16877098^ 36 * 2^31 = 21015440 * 2^31 -.word 994334485 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 36 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_45387457_16877098_incomplete_good_bitrev, %function -.global ntt_192_u32_45387457_16877098_incomplete_good_bitrev -ntt_192_u32_45387457_16877098_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 45387457 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vmul.u32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 3844538047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good.s b/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good.s deleted file mode 100644 index 6713e33..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_88299073_9670361_incomplete_good_twiddles -ntt_192_u32_88299073_9670361_incomplete_good_twiddles: // For base multiplication -.word 62163489 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 3607823391 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 31257951 // zeta^160 * 2^31 = 9670361^160 * 2^31 = 2534357 * 2^31 -.word 3835714657 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 9670361^160 * 2066201025 * 2^31 -.word 149330681 // zeta^ 80 * 2^31 = 9670361^ 80 * 2^31 = 5579523 * 2^31 -.word 1698183495 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 80 * 2066201025 * 2^31 -.word 12706985 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 1188395927 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 128346011 // zeta^136 * 2^31 = 9670361^136 * 2^31 = 41822566 * 2^31 -.word 3104825637 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 9670361^136 * 2066201025 * 2^31 -.word 75665841 // zeta^104 * 2^31 = 9670361^104 * 2^31 = 76960665 * 2^31 -.word 22036623 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 9670361^104 * 2066201025 * 2^31 -.word 153143541 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 415003211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 63785177 // zeta^184 * 2^31 = 9670361^184 * 2^31 = 22220342 * 2^31 -.word 1820869479 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 9670361^184 * 2066201025 * 2^31 -.word 135693903 // zeta^ 68 * 2^31 = 9670361^ 68 * 2^31 = 55309930 * 2^31 -.word 887924593 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 68 * 2066201025 * 2^31 -.word 159293731 // zeta^ 36 * 2^31 = 9670361^ 36 * 2^31 = 64980291 * 2^31 -.word 4156635549 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 36 * 2066201025 * 2^31 -.word 96170719 // zeta^148 * 2^31 = 9670361^148 * 2^31 = 62204288 * 2^31 -.word 2461155041 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 9670361^148 * 2066201025 * 2^31 -.word 63279659 // zeta^116 * 2^31 = 9670361^116 * 2^31 = 32274711 * 2^31 -.word 1943976597 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 9670361^116 * 2066201025 * 2^31 -.word 53905215 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1284407937 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 146011323 // zeta^172 * 2^31 = 9670361^172 * 2^31 = 41675533 * 2^31 -.word 1123834885 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 9670361^172 * 2066201025 * 2^31 -.word 24870969 // zeta^ 92 * 2^31 = 9670361^ 92 * 2^31 = 67630520 * 2^31 -.word 3062637575 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 92 * 2066201025 * 2^31 -.word 150915321 // zeta^ 60 * 2^31 = 9670361^ 60 * 2^31 = 78801296 * 2^31 -.word 3619654471 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 60 * 2066201025 * 2^31 -.word 145340195 // zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 459252637 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 119204611 // zeta^ 32 * 2^31 = 9670361^ 32 * 2^31 = 85764717 * 2^31 -.word 4067076029 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 32 * 2066201025 * 2^31 -.word 163891161 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 3106571367 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 48324623 // zeta^112 * 2^31 = 9670361^112 * 2^31 = 69154324 * 2^31 -.word 509787569 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 9670361^112 * 2066201025 * 2^31 -.word 100932305 // zeta^ 8 * 2^31 = 9670361^ 8 * 2^31 = 11338408 * 2^31 -.word 4272930671 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 8 * 2066201025 * 2^31 -.word 140979243 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 3082789013 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 112812969 // zeta^ 88 * 2^31 = 9670361^ 88 * 2^31 = 66078731 * 2^31 -.word 2474097815 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 88 * 2066201025 * 2^31 -.word 1059291 // zeta^ 56 * 2^31 = 9670361^ 56 * 2^31 = 43898970 * 2^31 -.word 2889101029 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 56 * 2066201025 * 2^31 -.word 17304415 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 138331745 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 64699245 // zeta^100 * 2^31 = 9670361^100 * 2^31 = 78628712 * 2^31 -.word 1026256339 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 9670361^100 * 2066201025 * 2^31 -.word 113318487 // zeta^ 20 * 2^31 = 9670361^ 20 * 2^31 = 56024362 * 2^31 -.word 2350990697 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 20 * 2066201025 * 2^31 -.word 121190133 // zeta^180 * 2^31 = 9670361^180 * 2^31 = 29929577 * 2^31 -.word 517178443 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 9670361^180 * 2066201025 * 2^31 -.word 30586823 // zeta^ 76 * 2^31 = 9670361^ 76 * 2^31 = 46623540 * 2^31 -.word 3171132409 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 76 * 2066201025 * 2^31 -.word 172791111 // zeta^ 44 * 2^31 = 9670361^ 44 * 2^31 = 23363129 * 2^31 -.word 160573049 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 44 * 2066201025 * 2^31 -.word 25682825 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 675312823 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.word 138852867 // zeta^124 * 2^31 = 9670361^124 * 2^31 = 77128297 * 2^31 -.word 3737950397 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 9670361^124 * 2066201025 * 2^31 -.word 57393535 // zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 227891265 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 114434657 // zeta^ 96 * 2^31 = 9670361^ 96 * 2^31 = 88299072 * 2^31 -.word 687143903 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 96 * 2066201025 * 2^31 -.word 128273523 // zeta^ 16 * 2^31 = 9670361^ 16 * 2^31 = 19144749 * 2^31 -.word 3785179725 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 16 * 2066201025 * 2^31 -.word 27267465 // zeta^176 * 2^31 = 9670361^176 * 2^31 = 82719550 * 2^31 -.word 2596783799 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 9670361^176 * 2066201025 * 2^31 -.word 35618903 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 1212178281 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 48252135 // zeta^ 40 * 2^31 = 9670361^ 40 * 2^31 = 46476507 * 2^31 -.word 1190141657 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 40 * 2066201025 * 2^31 -.word 175538855 // zeta^152 * 2^31 = 9670361^152 * 2^31 = 44400103 * 2^31 -.word 1405866265 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 9670361^152 * 2066201025 * 2^31 -.word 23454605 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 3879964083 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 111898901 // zeta^ 4 * 2^31 = 9670361^ 4 * 2^31 = 9670361 * 2^31 -.word 3268710955 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 4 * 2066201025 * 2^31 -.word 40904243 // zeta^164 * 2^31 = 9670361^164 * 2^31 = 32989143 * 2^31 -.word 3407042701 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 9670361^164 * 2066201025 * 2^31 -.word 55408013 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 3777788851 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 80427427 // zeta^ 52 * 2^31 = 9670361^ 52 * 2^31 = 26094785 * 2^31 -.word 1833812253 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 52 * 2066201025 * 2^31 -.word 3807035 // zeta^140 * 2^31 = 9670361^140 * 2^31 = 64935944 * 2^31 -.word 4134394245 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 9670361^140 * 2066201025 * 2^31 -.word 122692931 // zeta^108 * 2^31 = 9670361^108 * 2^31 = 23260411 * 2^31 -.word 3010559357 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 9670361^108 * 2066201025 * 2^31 -.word 37745279 // zeta^ 28 * 2^31 = 9670361^ 28 * 2^31 = 11170776 * 2^31 -.word 557016897 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 28 * 2066201025 * 2^31 -.word 151727177 // zeta^188 * 2^31 = 9670361^188 * 2^31 = 20668553 * 2^31 -.word 1232329719 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 9670361^188 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_88299073_9670361_incomplete_good_scale -ntt_192_u32_88299073_9670361_incomplete_good_scale: // Constants for scaling by 1/N -.word 62163489 // 1/48 -.word 3607823391 // 1/48 twisted -.data -roots: -.word 85764716 /// zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 2534356 /// zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 23318782 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 58369496 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 1419579322 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 65038662 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1581777230 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 9497777 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_88299073_9670361_incomplete_good, %function -.global ntt_192_u32_88299073_9670361_incomplete_good -ntt_192_u32_88299073_9670361_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -88299073 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 2228766271 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s b/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s deleted file mode 100644 index 4e2abfa..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 2534356 /// zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 85764716 /// zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 24724272 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 22179761 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 53160974 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 22179761 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 78801296 // zeta^ 60 * 2^31 = 9670361^ 60 * 2^31 = 78801296 * 2^31 -.word 1916492312 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 60 * 2066201025 * 2^31 -.word 23260411 // zeta^108 * 2^31 = 9670361^108 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 9670361^108 * 2066201025 * 2^31 -.word 53160974 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 29929577 // zeta^180 * 2^31 = 9670361^180 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 9670361^180 * 2066201025 * 2^31 -.word 64980291 // zeta^ 36 * 2^31 = 9670361^ 36 * 2^31 = 64980291 * 2^31 -.word 1580357614 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 36 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_88299073_9670361_incomplete_good_bitrev, %function -.global ntt_192_u32_88299073_9670361_incomplete_good_bitrev -ntt_192_u32_88299073_9670361_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -88299073 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 2228766271 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s b/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s deleted file mode 100644 index e258229..0000000 --- a/tests/ntt_192/auto/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,1237 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_twiddles -ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 62163489 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 3607823391 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 31257951 // zeta^160 * 2^31 = 9670361^160 * 2^31 = 2534357 * 2^31 -.word 3835714657 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 9670361^160 * 2066201025 * 2^31 -.word 149330681 // zeta^ 80 * 2^31 = 9670361^ 80 * 2^31 = 5579523 * 2^31 -.word 1698183495 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 80 * 2066201025 * 2^31 -.word 12706985 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 1188395927 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 128346011 // zeta^136 * 2^31 = 9670361^136 * 2^31 = 41822566 * 2^31 -.word 3104825637 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 9670361^136 * 2066201025 * 2^31 -.word 75665841 // zeta^104 * 2^31 = 9670361^104 * 2^31 = 76960665 * 2^31 -.word 22036623 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 9670361^104 * 2066201025 * 2^31 -.word 153143541 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 415003211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 63785177 // zeta^184 * 2^31 = 9670361^184 * 2^31 = 22220342 * 2^31 -.word 1820869479 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 9670361^184 * 2066201025 * 2^31 -.word 135693903 // zeta^ 68 * 2^31 = 9670361^ 68 * 2^31 = 55309930 * 2^31 -.word 887924593 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 68 * 2066201025 * 2^31 -.word 159293731 // zeta^ 36 * 2^31 = 9670361^ 36 * 2^31 = 64980291 * 2^31 -.word 4156635549 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 36 * 2066201025 * 2^31 -.word 96170719 // zeta^148 * 2^31 = 9670361^148 * 2^31 = 62204288 * 2^31 -.word 2461155041 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 9670361^148 * 2066201025 * 2^31 -.word 63279659 // zeta^116 * 2^31 = 9670361^116 * 2^31 = 32274711 * 2^31 -.word 1943976597 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 9670361^116 * 2066201025 * 2^31 -.word 53905215 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1284407937 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 146011323 // zeta^172 * 2^31 = 9670361^172 * 2^31 = 41675533 * 2^31 -.word 1123834885 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 9670361^172 * 2066201025 * 2^31 -.word 24870969 // zeta^ 92 * 2^31 = 9670361^ 92 * 2^31 = 67630520 * 2^31 -.word 3062637575 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 92 * 2066201025 * 2^31 -.word 150915321 // zeta^ 60 * 2^31 = 9670361^ 60 * 2^31 = 78801296 * 2^31 -.word 3619654471 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 60 * 2066201025 * 2^31 -.word 145340195 // zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 459252637 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 119204611 // zeta^ 32 * 2^31 = 9670361^ 32 * 2^31 = 85764717 * 2^31 -.word 4067076029 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 32 * 2066201025 * 2^31 -.word 163891161 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 3106571367 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 48324623 // zeta^112 * 2^31 = 9670361^112 * 2^31 = 69154324 * 2^31 -.word 509787569 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 9670361^112 * 2066201025 * 2^31 -.word 100932305 // zeta^ 8 * 2^31 = 9670361^ 8 * 2^31 = 11338408 * 2^31 -.word 4272930671 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 8 * 2066201025 * 2^31 -.word 140979243 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 3082789013 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 112812969 // zeta^ 88 * 2^31 = 9670361^ 88 * 2^31 = 66078731 * 2^31 -.word 2474097815 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 88 * 2066201025 * 2^31 -.word 1059291 // zeta^ 56 * 2^31 = 9670361^ 56 * 2^31 = 43898970 * 2^31 -.word 2889101029 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 56 * 2066201025 * 2^31 -.word 17304415 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 138331745 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 64699245 // zeta^100 * 2^31 = 9670361^100 * 2^31 = 78628712 * 2^31 -.word 1026256339 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 9670361^100 * 2066201025 * 2^31 -.word 113318487 // zeta^ 20 * 2^31 = 9670361^ 20 * 2^31 = 56024362 * 2^31 -.word 2350990697 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 20 * 2066201025 * 2^31 -.word 121190133 // zeta^180 * 2^31 = 9670361^180 * 2^31 = 29929577 * 2^31 -.word 517178443 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 9670361^180 * 2066201025 * 2^31 -.word 30586823 // zeta^ 76 * 2^31 = 9670361^ 76 * 2^31 = 46623540 * 2^31 -.word 3171132409 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 76 * 2066201025 * 2^31 -.word 172791111 // zeta^ 44 * 2^31 = 9670361^ 44 * 2^31 = 23363129 * 2^31 -.word 160573049 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 44 * 2066201025 * 2^31 -.word 25682825 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 675312823 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.word 138852867 // zeta^124 * 2^31 = 9670361^124 * 2^31 = 77128297 * 2^31 -.word 3737950397 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 9670361^124 * 2066201025 * 2^31 -.word 57393535 // zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 227891265 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 114434657 // zeta^ 96 * 2^31 = 9670361^ 96 * 2^31 = 88299072 * 2^31 -.word 687143903 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 96 * 2066201025 * 2^31 -.word 128273523 // zeta^ 16 * 2^31 = 9670361^ 16 * 2^31 = 19144749 * 2^31 -.word 3785179725 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 16 * 2066201025 * 2^31 -.word 27267465 // zeta^176 * 2^31 = 9670361^176 * 2^31 = 82719550 * 2^31 -.word 2596783799 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 9670361^176 * 2066201025 * 2^31 -.word 35618903 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 1212178281 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 48252135 // zeta^ 40 * 2^31 = 9670361^ 40 * 2^31 = 46476507 * 2^31 -.word 1190141657 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 40 * 2066201025 * 2^31 -.word 175538855 // zeta^152 * 2^31 = 9670361^152 * 2^31 = 44400103 * 2^31 -.word 1405866265 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 9670361^152 * 2066201025 * 2^31 -.word 23454605 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 3879964083 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 111898901 // zeta^ 4 * 2^31 = 9670361^ 4 * 2^31 = 9670361 * 2^31 -.word 3268710955 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 4 * 2066201025 * 2^31 -.word 40904243 // zeta^164 * 2^31 = 9670361^164 * 2^31 = 32989143 * 2^31 -.word 3407042701 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 9670361^164 * 2066201025 * 2^31 -.word 55408013 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 3777788851 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 80427427 // zeta^ 52 * 2^31 = 9670361^ 52 * 2^31 = 26094785 * 2^31 -.word 1833812253 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 52 * 2066201025 * 2^31 -.word 3807035 // zeta^140 * 2^31 = 9670361^140 * 2^31 = 64935944 * 2^31 -.word 4134394245 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 9670361^140 * 2066201025 * 2^31 -.word 122692931 // zeta^108 * 2^31 = 9670361^108 * 2^31 = 23260411 * 2^31 -.word 3010559357 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 9670361^108 * 2066201025 * 2^31 -.word 37745279 // zeta^ 28 * 2^31 = 9670361^ 28 * 2^31 = 11170776 * 2^31 -.word 557016897 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 28 * 2066201025 * 2^31 -.word 151727177 // zeta^188 * 2^31 = 9670361^188 * 2^31 = 20668553 * 2^31 -.word 1232329719 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 9670361^188 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_scale -ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 62163489 // 1/48 -.word 3607823391 // 1/48 twisted -.data -roots: -.word 85764716 /// zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 2534356 /// zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 23318782 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 58369496 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 1419579322 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 65038662 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1581777230 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 9497777 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input, %function -.global ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input -ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -88299073 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q0 -// input[4]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r7 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r10 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r1,#(256)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[0] from Q1 -// Release input[64] from Q0 -// input[4]: Already loaded as Q6 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -vmul.u32 Q1, Q0, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vadd.s32 Q2, Q6, Q7 -vqrdmulh.s32 Q0, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q6 -// Release input[4] from Q6 -vstrw.u32 Q2, [r11,#(-480)] -vsub.s32 Q5, Q1, Q7 -// Release input[68] from Q7 -vstrw.u32 Q5, [r1,#(16)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(272)] -// input[8]: Already loaded as Q4 -// input[72]: Already loaded as Q3 -vmul.u32 Q0, Q4, r8 -vadd.s32 Q2, Q3, Q4 -// input[12]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q4, r7 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q5, Q3, Q4 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(288)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(32)] -vsub.s32 Q5, Q5, Q0 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[72] from Q3 -// Release input[8] from Q4 -// input[76]: Already loaded as Q7 -// input[12]: Already loaded as Q6 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q6, Q7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmulh.s32 Q1, Q7, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[12] from Q6 -// Release input[76] from Q7 -// input[16]: Already loaded as Q4 -// input[80]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r7 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[16] from Q4 -vstrw.u32 Q2, [r11,#(-432)] -vsub.s32 Q4, Q1, Q5 -// Release input[80] from Q5 -vstrw.u32 Q4, [r1,#(64)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(320)] -// input[20]: Already loaded as Q6 -// input[84]: Already loaded as Q3 -vmul.u32 Q0, Q6, r8 -vadd.s32 Q2, Q3, Q6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q6, r7 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(80)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[84] from Q3 -// Release input[20] from Q6 -// input[88]: Already loaded as Q7 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q5, Q7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmulh.s32 Q1, Q7, r7 -// input[92]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 92)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[24] from Q5 -// Release input[88] from Q7 -// input[28]: Already loaded as Q4 -// input[92]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[28] from Q4 -vstrw.u32 Q2, [r11,#(-384)] -vsub.s32 Q4, Q1, Q6 -// Release input[92] from Q6 -vstrw.u32 Q4, [r1,#(112)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(368)] -// input[32]: Already loaded as Q3 -vmul.u32 Q0, Q3, r8 -vneg.s32 Q1, Q3 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmulh.s32 Q2, Q3, r7 -vstrw.u32 Q3, [r1,#(384)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(128)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[32] from Q3 -// input[36]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(144)] -vstrw.u32 Q4, [r1,#(400)] -vstrw.u32 Q4, [r11,#(-352)] -// Release input[36] from Q4 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-336)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(416)] -// Release input[40] from Q0 -// input[44]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(432)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(176)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[44] from Q4 -// input[48]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(192)] -vstrw.u32 Q3, [r1,#(448)] -vstrw.u32 Q3, [r11,#(-304)] -// Release input[48] from Q3 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-288)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(464)] -// Release input[52] from Q0 -// input[56]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(480)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(224)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[56] from Q4 -// input[60]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(240)] -vstrw.u32 Q3, [r1,#(496)] -vstrw.u32 Q3, [r11,#(-256)] -// Release input[60] from Q3 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 2228766271 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1204 -// Instruction count: 900 \ No newline at end of file diff --git a/tests/ntt_256/auto/ntt_256_u32_33556993_26036764_complete.s b/tests/ntt_256/auto/ntt_256_u32_33556993_26036764_complete.s deleted file mode 100644 index bfb3c13..0000000 --- a/tests/ntt_256/auto/ntt_256_u32_33556993_26036764_complete.s +++ /dev/null @@ -1,2907 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 39999747 // zeta^ 8 * 2^31 = 26036764^ 8 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 8 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 26036764^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 72 * 375649793 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 48811299 // zeta^ 40 * 2^31 = 26036764^ 40 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 40 * 375649793 * 2^31 -.word 54571669 // zeta^104 * 2^31 = 26036764^104 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 26036764^104 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 26036764^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 24 * 375649793 * 2^31 -.word 40500013 // zeta^ 88 * 2^31 = 26036764^ 88 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 88 * 375649793 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 25917637 // zeta^ 56 * 2^31 = 26036764^ 56 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 56 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 26036764^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 26036764^120 * 375649793 * 2^31 -.word 39999747 // zeta^ 8 * 2^31 = 26036764^ 8 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 8 * 375649793 * 2^31 -.word 31719253 // zeta^ 4 * 2^31 = 26036764^ 4 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 4 * 375649793 * 2^31 -.word 54842419 // zeta^ 68 * 2^31 = 26036764^ 68 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 68 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 26036764^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 72 * 375649793 * 2^31 -.word 1316163 // zeta^ 36 * 2^31 = 26036764^ 36 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 36 * 375649793 * 2^31 -.word 10391631 // zeta^100 * 2^31 = 26036764^100 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 26036764^100 * 375649793 * 2^31 -.word 48811299 // zeta^ 40 * 2^31 = 26036764^ 40 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 40 * 375649793 * 2^31 -.word 54335767 // zeta^ 20 * 2^31 = 26036764^ 20 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 20 * 375649793 * 2^31 -.word 46002083 // zeta^ 84 * 2^31 = 26036764^ 84 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 84 * 375649793 * 2^31 -.word 54571669 // zeta^104 * 2^31 = 26036764^104 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 26036764^104 * 375649793 * 2^31 -.word 35733845 // zeta^ 52 * 2^31 = 26036764^ 52 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 52 * 375649793 * 2^31 -.word 61099389 // zeta^116 * 2^31 = 26036764^116 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 26036764^116 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 26036764^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 24 * 375649793 * 2^31 -.word 26241327 // zeta^ 12 * 2^31 = 26036764^ 12 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 12 * 375649793 * 2^31 -.word 5033605 // zeta^ 76 * 2^31 = 26036764^ 76 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 76 * 375649793 * 2^31 -.word 40500013 // zeta^ 88 * 2^31 = 26036764^ 88 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 88 * 375649793 * 2^31 -.word 8316793 // zeta^ 44 * 2^31 = 26036764^ 44 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 44 * 375649793 * 2^31 -.word 16634213 // zeta^108 * 2^31 = 26036764^108 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 26036764^108 * 375649793 * 2^31 -.word 25917637 // zeta^ 56 * 2^31 = 26036764^ 56 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 56 * 375649793 * 2^31 -.word 63329695 // zeta^ 28 * 2^31 = 26036764^ 28 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 28 * 375649793 * 2^31 -.word 9983051 // zeta^ 92 * 2^31 = 26036764^ 92 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 92 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 26036764^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 26036764^120 * 375649793 * 2^31 -.word 7721125 // zeta^ 60 * 2^31 = 26036764^ 60 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 60 * 375649793 * 2^31 -.word 9383201 // zeta^124 * 2^31 = 26036764^124 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 26036764^124 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 -.word 2147483711 // zeta^ 0 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 3280343807 // zeta^ 64 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 2356128651 // zeta^ 32 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 933021651 // zeta^ 96 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 -.word 2147483711 // zeta^ 0 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 2356128651 // zeta^ 32 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 2578416965 // zeta^ 16 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 3091135847 // zeta^ 48 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 -.word 2578416965 // zeta^ 16 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 2973633521 // zeta^ 80 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 3091135847 // zeta^ 48 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 864737071 // zeta^112 * (q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 39999747 // zeta^ 8 * 2^31 = 26036764^ 8 * 2^31 -.word 48811299 // zeta^ 40 * 2^31 = 26036764^ 40 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 26036764^ 24 * 2^31 -.word 25917637 // zeta^ 56 * 2^31 = 26036764^ 56 * 2^31 -.word 3454780669 // zeta^ 8 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 8 * 375649793 * 2^31 -.word 4050555101 // zeta^ 40 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 40 * 375649793 * 2^31 -.word 3509906701 // zeta^ 24 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 24 * 375649793 * 2^31 -.word 1446525243 // zeta^ 56 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 56 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 26036764^ 72 * 2^31 -.word 54571669 // zeta^104 * 2^31 = 26036764^104 * 2^31 -.word 40500013 // zeta^ 88 * 2^31 = 26036764^ 88 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 26036764^120 * 2^31 -.word 39999747 // zeta^ 8 * 2^31 = 26036764^ 8 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 26036764^ 72 * 2^31 -.word 48811299 // zeta^ 40 * 2^31 = 26036764^ 40 * 2^31 -.word 54571669 // zeta^104 * 2^31 = 26036764^104 * 2^31 -.word 3454780669 // zeta^ 8 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 8 * 375649793 * 2^31 -.word 3083517997 // zeta^ 72 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 72 * 375649793 * 2^31 -.word 4050555101 // zeta^ 40 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 40 * 375649793 * 2^31 -.word 4085587819 // zeta^104 * (q^(-1) mod 2^32) * 2^31 = 26036764^104 * 375649793 * 2^31 -.word 31719253 // zeta^ 4 * 2^31 = 26036764^ 4 * 2^31 -.word 1316163 // zeta^ 36 * 2^31 = 26036764^ 36 * 2^31 -.word 54335767 // zeta^ 20 * 2^31 = 26036764^ 20 * 2^31 -.word 35733845 // zeta^ 52 * 2^31 = 26036764^ 52 * 2^31 -.word 3672199851 // zeta^ 4 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 4 * 375649793 * 2^31 -.word 3096742077 // zeta^ 36 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 36 * 375649793 * 2^31 -.word 2562837737 // zeta^ 20 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 20 * 375649793 * 2^31 -.word 2000162987 // zeta^ 52 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 52 * 375649793 * 2^31 -.word 54842419 // zeta^ 68 * 2^31 = 26036764^ 68 * 2^31 -.word 10391631 // zeta^100 * 2^31 = 26036764^100 * 2^31 -.word 46002083 // zeta^ 84 * 2^31 = 26036764^ 84 * 2^31 -.word 61099389 // zeta^116 * 2^31 = 26036764^116 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 26036764^ 24 * 2^31 -.word 40500013 // zeta^ 88 * 2^31 = 26036764^ 88 * 2^31 -.word 25917637 // zeta^ 56 * 2^31 = 26036764^ 56 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 26036764^120 * 2^31 -.word 3509906701 // zeta^ 24 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 24 * 375649793 * 2^31 -.word 634504915 // zeta^ 88 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 88 * 375649793 * 2^31 -.word 1446525243 // zeta^ 56 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 56 * 375649793 * 2^31 -.word 1036987221 // zeta^120 * (q^(-1) mod 2^32) * 2^31 = 26036764^120 * 375649793 * 2^31 -.word 26241327 // zeta^ 12 * 2^31 = 26036764^ 12 * 2^31 -.word 8316793 // zeta^ 44 * 2^31 = 26036764^ 44 * 2^31 -.word 63329695 // zeta^ 28 * 2^31 = 26036764^ 28 * 2^31 -.word 7721125 // zeta^ 60 * 2^31 = 26036764^ 60 * 2^31 -.word 2184146129 // zeta^ 12 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 12 * 375649793 * 2^31 -.word 591909511 // zeta^ 44 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 44 * 375649793 * 2^31 -.word 2675302497 // zeta^ 28 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 28 * 375649793 * 2^31 -.word 3946619227 // zeta^ 60 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 60 * 375649793 * 2^31 -.word 5033605 // zeta^ 76 * 2^31 = 26036764^ 76 * 2^31 -.word 16634213 // zeta^108 * 2^31 = 26036764^108 * 2^31 -.word 9983051 // zeta^ 92 * 2^31 = 26036764^ 92 * 2^31 -.word 9383201 // zeta^124 * 2^31 = 26036764^124 * 2^31 -.word 31719253 // zeta^ 4 * 2^31 = 26036764^ 4 * 2^31 -.word 54842419 // zeta^ 68 * 2^31 = 26036764^ 68 * 2^31 -.word 1316163 // zeta^ 36 * 2^31 = 26036764^ 36 * 2^31 -.word 10391631 // zeta^100 * 2^31 = 26036764^100 * 2^31 -.word 3672199851 // zeta^ 4 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 4 * 375649793 * 2^31 -.word 1729702349 // zeta^ 68 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 68 * 375649793 * 2^31 -.word 3096742077 // zeta^ 36 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 36 * 375649793 * 2^31 -.word 136873393 // zeta^100 * (q^(-1) mod 2^32) * 2^31 = 26036764^100 * 375649793 * 2^31 -.word 5075563 // zeta^ 2 * 2^31 = 26036764^ 2 * 2^31 -.word 35131011 // zeta^ 34 * 2^31 = 26036764^ 34 * 2^31 -.word 65968403 // zeta^ 18 * 2^31 = 26036764^ 18 * 2^31 -.word 52363231 // zeta^ 50 * 2^31 = 26036764^ 50 * 2^31 -.word 576633749 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 2 * 375649793 * 2^31 -.word 21827453 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 34 * 375649793 * 2^31 -.word 3768591597 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 18 * 375649793 * 2^31 -.word 365147681 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 50 * 375649793 * 2^31 -.word 43115375 // zeta^ 66 * 2^31 = 26036764^ 66 * 2^31 -.word 44664611 // zeta^ 98 * 2^31 = 26036764^ 98 * 2^31 -.word 53949037 // zeta^ 82 * 2^31 = 26036764^ 82 * 2^31 -.word 39928117 // zeta^114 * 2^31 = 26036764^114 * 2^31 -.word 54335767 // zeta^ 20 * 2^31 = 26036764^ 20 * 2^31 -.word 46002083 // zeta^ 84 * 2^31 = 26036764^ 84 * 2^31 -.word 35733845 // zeta^ 52 * 2^31 = 26036764^ 52 * 2^31 -.word 61099389 // zeta^116 * 2^31 = 26036764^116 * 2^31 -.word 2562837737 // zeta^ 20 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 20 * 375649793 * 2^31 -.word 3404885597 // zeta^ 84 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 84 * 375649793 * 2^31 -.word 2000162987 // zeta^ 52 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 52 * 375649793 * 2^31 -.word 1687065731 // zeta^116 * (q^(-1) mod 2^32) * 2^31 = 26036764^116 * 375649793 * 2^31 -.word 54457727 // zeta^ 10 * 2^31 = 26036764^ 10 * 2^31 -.word 14847715 // zeta^ 42 * 2^31 = 26036764^ 42 * 2^31 -.word 54563587 // zeta^ 26 * 2^31 = 26036764^ 26 * 2^31 -.word 52947923 // zeta^ 58 * 2^31 = 26036764^ 58 * 2^31 -.word 2730229889 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 10 * 375649793 * 2^31 -.word 2248560413 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 42 * 375649793 * 2^31 -.word 3545336573 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 26 * 375649793 * 2^31 -.word 1268929069 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 58 * 375649793 * 2^31 -.word 27596809 // zeta^ 74 * 2^31 = 26036764^ 74 * 2^31 -.word 1129279 // zeta^106 * 2^31 = 26036764^106 * 2^31 -.word 35404977 // zeta^ 90 * 2^31 = 26036764^ 90 * 2^31 -.word 41822583 // zeta^122 * 2^31 = 26036764^122 * 2^31 -.word 26241327 // zeta^ 12 * 2^31 = 26036764^ 12 * 2^31 -.word 5033605 // zeta^ 76 * 2^31 = 26036764^ 76 * 2^31 -.word 8316793 // zeta^ 44 * 2^31 = 26036764^ 44 * 2^31 -.word 16634213 // zeta^108 * 2^31 = 26036764^108 * 2^31 -.word 2184146129 // zeta^ 12 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 12 * 375649793 * 2^31 -.word 3855639419 // zeta^ 76 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 76 * 375649793 * 2^31 -.word 591909511 // zeta^ 44 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 44 * 375649793 * 2^31 -.word 1874600091 // zeta^108 * (q^(-1) mod 2^32) * 2^31 = 26036764^108 * 375649793 * 2^31 -.word 12770159 // zeta^ 6 * 2^31 = 26036764^ 6 * 2^31 -.word 61827033 // zeta^ 38 * 2^31 = 26036764^ 38 * 2^31 -.word 19091691 // zeta^ 22 * 2^31 = 26036764^ 22 * 2^31 -.word 20871313 // zeta^ 54 * 2^31 = 26036764^ 54 * 2^31 -.word 1517517457 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 6 * 375649793 * 2^31 -.word 2677740071 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 38 * 375649793 * 2^31 -.word 2453265685 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 22 * 375649793 * 2^31 -.word 3771937135 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 54 * 375649793 * 2^31 -.word 24980679 // zeta^ 70 * 2^31 = 26036764^ 70 * 2^31 -.word 11221523 // zeta^102 * 2^31 = 26036764^102 * 2^31 -.word 32210035 // zeta^ 86 * 2^31 = 26036764^ 86 * 2^31 -.word 46581651 // zeta^118 * 2^31 = 26036764^118 * 2^31 -.word 63329695 // zeta^ 28 * 2^31 = 26036764^ 28 * 2^31 -.word 9983051 // zeta^ 92 * 2^31 = 26036764^ 92 * 2^31 -.word 7721125 // zeta^ 60 * 2^31 = 26036764^ 60 * 2^31 -.word 9383201 // zeta^124 * 2^31 = 26036764^124 * 2^31 -.word 2675302497 // zeta^ 28 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 28 * 375649793 * 2^31 -.word 2472974773 // zeta^ 92 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 92 * 375649793 * 2^31 -.word 3946619227 // zeta^ 60 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 60 * 375649793 * 2^31 -.word 542121183 // zeta^124 * (q^(-1) mod 2^32) * 2^31 = 26036764^124 * 375649793 * 2^31 -.word 51221435 // zeta^ 14 * 2^31 = 26036764^ 14 * 2^31 -.word 37083207 // zeta^ 46 * 2^31 = 26036764^ 46 * 2^31 -.word 8896309 // zeta^ 30 * 2^31 = 26036764^ 30 * 2^31 -.word 23761465 // zeta^ 62 * 2^31 = 26036764^ 62 * 2^31 -.word 3182148165 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 14 * 375649793 * 2^31 -.word 2189487545 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 46 * 375649793 * 2^31 -.word 238834379 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 30 * 375649793 * 2^31 -.word 604481479 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 62 * 375649793 * 2^31 -.word 18467171 // zeta^ 78 * 2^31 = 26036764^ 78 * 2^31 -.word 52674527 // zeta^110 * 2^31 = 26036764^110 * 2^31 -.word 2061353 // zeta^ 94 * 2^31 = 26036764^ 94 * 2^31 -.word 24512363 // zeta^126 * 2^31 = 26036764^126 * 2^31 -.word 5075563 // zeta^ 2 * 2^31 = 26036764^ 2 * 2^31 -.word 43115375 // zeta^ 66 * 2^31 = 26036764^ 66 * 2^31 -.word 35131011 // zeta^ 34 * 2^31 = 26036764^ 34 * 2^31 -.word 44664611 // zeta^ 98 * 2^31 = 26036764^ 98 * 2^31 -.word 576633749 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 2 * 375649793 * 2^31 -.word 1324642961 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 66 * 375649793 * 2^31 -.word 21827453 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 34 * 375649793 * 2^31 -.word 3505510109 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 98 * 375649793 * 2^31 -.word 13704133 // zeta^ 1 * 2^31 = 26036764^ 1 * 2^31 -.word 26703739 // zeta^ 33 * 2^31 = 26036764^ 33 * 2^31 -.word 5481609 // zeta^ 17 * 2^31 = 26036764^ 17 * 2^31 -.word 54494203 // zeta^ 49 * 2^31 = 26036764^ 49 * 2^31 -.word 1666225723 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 1 * 375649793 * 2^31 -.word 2869384837 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 33 * 375649793 * 2^31 -.word 949335415 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 17 * 375649793 * 2^31 -.word 1474054661 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 49 * 375649793 * 2^31 -.word 41177999 // zeta^ 65 * 2^31 = 26036764^ 65 * 2^31 -.word 65289035 // zeta^ 97 * 2^31 = 26036764^ 97 * 2^31 -.word 12552175 // zeta^ 81 * 2^31 = 26036764^ 81 * 2^31 -.word 32704019 // zeta^113 * 2^31 = 26036764^113 * 2^31 -.word 65968403 // zeta^ 18 * 2^31 = 26036764^ 18 * 2^31 -.word 53949037 // zeta^ 82 * 2^31 = 26036764^ 82 * 2^31 -.word 52363231 // zeta^ 50 * 2^31 = 26036764^ 50 * 2^31 -.word 39928117 // zeta^114 * 2^31 = 26036764^114 * 2^31 -.word 3768591597 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 18 * 375649793 * 2^31 -.word 338497427 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 82 * 375649793 * 2^31 -.word 365147681 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 50 * 375649793 * 2^31 -.word 3279343819 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 26036764^114 * 375649793 * 2^31 -.word 43973433 // zeta^ 9 * 2^31 = 26036764^ 9 * 2^31 -.word 14937153 // zeta^ 41 * 2^31 = 26036764^ 41 * 2^31 -.word 40841465 // zeta^ 25 * 2^31 = 26036764^ 25 * 2^31 -.word 33845545 // zeta^ 57 * 2^31 = 26036764^ 57 * 2^31 -.word 720191175 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 9 * 375649793 * 2^31 -.word 116563391 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 41 * 375649793 * 2^31 -.word 3459680519 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 25 * 375649793 * 2^31 -.word 1885546711 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 57 * 375649793 * 2^31 -.word 14453865 // zeta^ 73 * 2^31 = 26036764^ 73 * 2^31 -.word 39701997 // zeta^105 * 2^31 = 26036764^105 * 2^31 -.word 3577749 // zeta^ 89 * 2^31 = 26036764^ 89 * 2^31 -.word 19555165 // zeta^121 * 2^31 = 26036764^121 * 2^31 -.word 54457727 // zeta^ 10 * 2^31 = 26036764^ 10 * 2^31 -.word 27596809 // zeta^ 74 * 2^31 = 26036764^ 74 * 2^31 -.word 14847715 // zeta^ 42 * 2^31 = 26036764^ 42 * 2^31 -.word 1129279 // zeta^106 * 2^31 = 26036764^106 * 2^31 -.word 2730229889 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 10 * 375649793 * 2^31 -.word 1204240887 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 74 * 375649793 * 2^31 -.word 2248560413 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 42 * 375649793 * 2^31 -.word 497236673 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 26036764^106 * 375649793 * 2^31 -.word 12974361 // zeta^ 5 * 2^31 = 26036764^ 5 * 2^31 -.word 56379967 // zeta^ 37 * 2^31 = 26036764^ 37 * 2^31 -.word 52771617 // zeta^ 21 * 2^31 = 26036764^ 21 * 2^31 -.word 51483005 // zeta^ 53 * 2^31 = 26036764^ 53 * 2^31 -.word 1194393831 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 5 * 375649793 * 2^31 -.word 753806273 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 37 * 375649793 * 2^31 -.word 2185629407 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 21 * 375649793 * 2^31 -.word 432623747 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 53 * 375649793 * 2^31 -.word 41807515 // zeta^ 69 * 2^31 = 26036764^ 69 * 2^31 -.word 13380915 // zeta^101 * 2^31 = 26036764^101 * 2^31 -.word 23396495 // zeta^ 85 * 2^31 = 26036764^ 85 * 2^31 -.word 11487943 // zeta^117 * 2^31 = 26036764^117 * 2^31 -.word 54563587 // zeta^ 26 * 2^31 = 26036764^ 26 * 2^31 -.word 35404977 // zeta^ 90 * 2^31 = 26036764^ 90 * 2^31 -.word 52947923 // zeta^ 58 * 2^31 = 26036764^ 58 * 2^31 -.word 41822583 // zeta^122 * 2^31 = 26036764^122 * 2^31 -.word 3545336573 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 26 * 375649793 * 2^31 -.word 756985167 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 90 * 375649793 * 2^31 -.word 1268929069 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 58 * 375649793 * 2^31 -.word 2124709001 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 26036764^122 * 375649793 * 2^31 -.word 35192755 // zeta^ 13 * 2^31 = 26036764^ 13 * 2^31 -.word 6787663 // zeta^ 45 * 2^31 = 26036764^ 45 * 2^31 -.word 55869129 // zeta^ 29 * 2^31 = 26036764^ 29 * 2^31 -.word 43560065 // zeta^ 61 * 2^31 = 26036764^ 61 * 2^31 -.word 3019374157 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 13 * 375649793 * 2^31 -.word 443777969 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 45 * 375649793 * 2^31 -.word 2098944823 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 29 * 375649793 * 2^31 -.word 2076204415 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 61 * 375649793 * 2^31 -.word 36544119 // zeta^ 77 * 2^31 = 26036764^ 77 * 2^31 -.word 63484749 // zeta^109 * 2^31 = 26036764^109 * 2^31 -.word 16038683 // zeta^ 93 * 2^31 = 26036764^ 93 * 2^31 -.word 25949329 // zeta^125 * 2^31 = 26036764^125 * 2^31 -.word 12770159 // zeta^ 6 * 2^31 = 26036764^ 6 * 2^31 -.word 24980679 // zeta^ 70 * 2^31 = 26036764^ 70 * 2^31 -.word 61827033 // zeta^ 38 * 2^31 = 26036764^ 38 * 2^31 -.word 11221523 // zeta^102 * 2^31 = 26036764^102 * 2^31 -.word 1517517457 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 6 * 375649793 * 2^31 -.word 1250335033 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 70 * 375649793 * 2^31 -.word 2677740071 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 38 * 375649793 * 2^31 -.word 1580041197 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 26036764^102 * 375649793 * 2^31 -.word 5623923 // zeta^ 3 * 2^31 = 26036764^ 3 * 2^31 -.word 18571677 // zeta^ 35 * 2^31 = 26036764^ 35 * 2^31 -.word 26799603 // zeta^ 19 * 2^31 = 26036764^ 19 * 2^31 -.word 39332725 // zeta^ 51 * 2^31 = 26036764^ 51 * 2^31 -.word 182627725 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 3 * 375649793 * 2^31 -.word 1902166115 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 35 * 375649793 * 2^31 -.word 583438349 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 19 * 375649793 * 2^31 -.word 1738958475 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 51 * 375649793 * 2^31 -.word 38701067 // zeta^ 67 * 2^31 = 26036764^ 67 * 2^31 -.word 14491707 // zeta^ 99 * 2^31 = 26036764^ 99 * 2^31 -.word 33463463 // zeta^ 83 * 2^31 = 26036764^ 83 * 2^31 -.word 61125067 // zeta^115 * 2^31 = 26036764^115 * 2^31 -.word 19091691 // zeta^ 22 * 2^31 = 26036764^ 22 * 2^31 -.word 32210035 // zeta^ 86 * 2^31 = 26036764^ 86 * 2^31 -.word 20871313 // zeta^ 54 * 2^31 = 26036764^ 54 * 2^31 -.word 46581651 // zeta^118 * 2^31 = 26036764^118 * 2^31 -.word 2453265685 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 22 * 375649793 * 2^31 -.word 2986672525 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 86 * 375649793 * 2^31 -.word 3771937135 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 54 * 375649793 * 2^31 -.word 697890413 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 26036764^118 * 375649793 * 2^31 -.word 65375453 // zeta^ 11 * 2^31 = 26036764^ 11 * 2^31 -.word 59835311 // zeta^ 43 * 2^31 = 26036764^ 43 * 2^31 -.word 12921459 // zeta^ 27 * 2^31 = 26036764^ 27 * 2^31 -.word 61505033 // zeta^ 59 * 2^31 = 26036764^ 59 * 2^31 -.word 4014413091 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 11 * 375649793 * 2^31 -.word 741855825 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 43 * 375649793 * 2^31 -.word 1006064525 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 27 * 375649793 * 2^31 -.word 2747128823 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 59 * 375649793 * 2^31 -.word 40797001 // zeta^ 75 * 2^31 = 26036764^ 75 * 2^31 -.word 32875577 // zeta^107 * 2^31 = 26036764^107 * 2^31 -.word 63769677 // zeta^ 91 * 2^31 = 26036764^ 91 * 2^31 -.word 65692461 // zeta^123 * 2^31 = 26036764^123 * 2^31 -.word 51221435 // zeta^ 14 * 2^31 = 26036764^ 14 * 2^31 -.word 18467171 // zeta^ 78 * 2^31 = 26036764^ 78 * 2^31 -.word 37083207 // zeta^ 46 * 2^31 = 26036764^ 46 * 2^31 -.word 52674527 // zeta^110 * 2^31 = 26036764^110 * 2^31 -.word 3182148165 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 14 * 375649793 * 2^31 -.word 3558347933 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 78 * 375649793 * 2^31 -.word 2189487545 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 46 * 375649793 * 2^31 -.word 1161754145 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 26036764^110 * 375649793 * 2^31 -.word 23458751 // zeta^ 7 * 2^31 = 26036764^ 7 * 2^31 -.word 33711991 // zeta^ 39 * 2^31 = 26036764^ 39 * 2^31 -.word 41803191 // zeta^ 23 * 2^31 = 26036764^ 23 * 2^31 -.word 9664027 // zeta^ 55 * 2^31 = 26036764^ 55 * 2^31 -.word 1501790785 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 7 * 375649793 * 2^31 -.word 1905016457 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 39 * 375649793 * 2^31 -.word 2460960841 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 23 * 375649793 * 2^31 -.word 1300076517 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 55 * 375649793 * 2^31 -.word 9406759 // zeta^ 71 * 2^31 = 26036764^ 71 * 2^31 -.word 32167773 // zeta^103 * 2^31 = 26036764^103 * 2^31 -.word 19377381 // zeta^ 87 * 2^31 = 26036764^ 87 * 2^31 -.word 55794235 // zeta^119 * 2^31 = 26036764^119 * 2^31 -.word 8896309 // zeta^ 30 * 2^31 = 26036764^ 30 * 2^31 -.word 2061353 // zeta^ 94 * 2^31 = 26036764^ 94 * 2^31 -.word 23761465 // zeta^ 62 * 2^31 = 26036764^ 62 * 2^31 -.word 24512363 // zeta^126 * 2^31 = 26036764^126 * 2^31 -.word 238834379 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 30 * 375649793 * 2^31 -.word 1415980503 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 94 * 375649793 * 2^31 -.word 604481479 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 62 * 375649793 * 2^31 -.word 2198349461 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 26036764^126 * 375649793 * 2^31 -.word 43355169 // zeta^ 15 * 2^31 = 26036764^ 15 * 2^31 -.word 40694335 // zeta^ 47 * 2^31 = 26036764^ 47 * 2^31 -.word 27553395 // zeta^ 31 * 2^31 = 26036764^ 31 * 2^31 -.word 689375 // zeta^ 63 * 2^31 = 26036764^ 63 * 2^31 -.word 1107279327 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 15 * 375649793 * 2^31 -.word 879592385 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 47 * 375649793 * 2^31 -.word 1673531277 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 31 * 375649793 * 2^31 -.word 1477062945 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 26036764^ 63 * 375649793 * 2^31 -.word 5591977 // zeta^ 79 * 2^31 = 26036764^ 79 * 2^31 -.word 25071607 // zeta^111 * 2^31 = 26036764^111 * 2^31 -.word 7648471 // zeta^ 95 * 2^31 = 26036764^ 95 * 2^31 -.word 46555773 // zeta^127 * 2^31 = 26036764^127 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_256_u32_33556993_26036764, %function -.global ntt_256_u32_33556993_26036764 -ntt_256_u32_33556993_26036764: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r11, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q5, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -// Butterfly [0, 1, 2, 3] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [16, 17, 18, 19] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [32, 33, 34, 35] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [48, 49, 50, 51] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [64, 65, 66, 67] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [80, 81, 82, 83] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [96, 97, 98, 99] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [112, 113, 114, 115] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [128, 129, 130, 131] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [144, 145, 146, 147] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [160, 161, 162, 163] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [176, 177, 178, 179] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [192, 193, 194, 195] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [208, 209, 210, 211] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [224, 225, 226, 227] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r12 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Butterfly [240, 241, 242, 243] -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2875 -// Instruction count: 2421 \ No newline at end of file diff --git a/tests/ntt_256/auto/ntt_256_u32_33556993_26036764_incomplete.s b/tests/ntt_256/auto/ntt_256_u32_33556993_26036764_incomplete.s deleted file mode 100644 index 9cc8dca..0000000 --- a/tests/ntt_256/auto/ntt_256_u32_33556993_26036764_incomplete.s +++ /dev/null @@ -1,2027 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 26036764^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 29095681 // zeta^ 64 * 2^31 = 26036764^ 64 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 64 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^ 32 * 2^31 = 26036764^ 32 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 32 * 375649793 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 26036764^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 18598075 // zeta^ 16 * 2^31 = 26036764^ 16 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 16 * 375649793 * 2^31 -.word 39999747 // zeta^ 8 * 2^31 = 26036764^ 8 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 8 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 26036764^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 72 * 375649793 * 2^31 -.word 4885007 // zeta^ 80 * 2^31 = 26036764^ 80 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 80 * 375649793 * 2^31 -.word 48811299 // zeta^ 40 * 2^31 = 26036764^ 40 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 40 * 375649793 * 2^31 -.word 54571669 // zeta^104 * 2^31 = 26036764^104 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 26036764^104 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 26036764^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 26036764^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 24 * 375649793 * 2^31 -.word 40500013 // zeta^ 88 * 2^31 = 26036764^ 88 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 88 * 375649793 * 2^31 -.word 34427601 // zeta^112 * 2^31 = 26036764^112 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 26036764^112 * 375649793 * 2^31 -.word 25917637 // zeta^ 56 * 2^31 = 26036764^ 56 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 56 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 26036764^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 26036764^120 * 375649793 * 2^31 -.word 39999747 // zeta^ 8 * 2^31 = 26036764^ 8 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 8 * 375649793 * 2^31 -.word 31719253 // zeta^ 4 * 2^31 = 26036764^ 4 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 4 * 375649793 * 2^31 -.word 54842419 // zeta^ 68 * 2^31 = 26036764^ 68 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 68 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 26036764^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 72 * 375649793 * 2^31 -.word 1316163 // zeta^ 36 * 2^31 = 26036764^ 36 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 36 * 375649793 * 2^31 -.word 10391631 // zeta^100 * 2^31 = 26036764^100 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 26036764^100 * 375649793 * 2^31 -.word 48811299 // zeta^ 40 * 2^31 = 26036764^ 40 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 40 * 375649793 * 2^31 -.word 54335767 // zeta^ 20 * 2^31 = 26036764^ 20 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 20 * 375649793 * 2^31 -.word 46002083 // zeta^ 84 * 2^31 = 26036764^ 84 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 84 * 375649793 * 2^31 -.word 54571669 // zeta^104 * 2^31 = 26036764^104 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 26036764^104 * 375649793 * 2^31 -.word 35733845 // zeta^ 52 * 2^31 = 26036764^ 52 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 52 * 375649793 * 2^31 -.word 61099389 // zeta^116 * 2^31 = 26036764^116 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 26036764^116 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 26036764^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 24 * 375649793 * 2^31 -.word 26241327 // zeta^ 12 * 2^31 = 26036764^ 12 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 12 * 375649793 * 2^31 -.word 5033605 // zeta^ 76 * 2^31 = 26036764^ 76 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 76 * 375649793 * 2^31 -.word 40500013 // zeta^ 88 * 2^31 = 26036764^ 88 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 88 * 375649793 * 2^31 -.word 8316793 // zeta^ 44 * 2^31 = 26036764^ 44 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 44 * 375649793 * 2^31 -.word 16634213 // zeta^108 * 2^31 = 26036764^108 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 26036764^108 * 375649793 * 2^31 -.word 25917637 // zeta^ 56 * 2^31 = 26036764^ 56 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 56 * 375649793 * 2^31 -.word 63329695 // zeta^ 28 * 2^31 = 26036764^ 28 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 28 * 375649793 * 2^31 -.word 9983051 // zeta^ 92 * 2^31 = 26036764^ 92 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 92 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 26036764^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 26036764^120 * 375649793 * 2^31 -.word 7721125 // zeta^ 60 * 2^31 = 26036764^ 60 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 26036764^ 60 * 375649793 * 2^31 -.word 9383201 // zeta^124 * 2^31 = 26036764^124 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 26036764^124 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_256_u32_33556993_26036764_incomplete, %function -.global ntt_256_u32_33556993_26036764_incomplete -ntt_256_u32_33556993_26036764_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1995 -// Instruction count: 1557 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good.s b/tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good.s deleted file mode 100644 index 6361fbd..0000000 --- a/tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good.s +++ /dev/null @@ -1,1484 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_33556993_27792935_incomplete_good_twiddles -ntt_192_u32_33556993_27792935_incomplete_good_twiddles: // For base multiplication -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 66220859 // zeta^160 * 2^31 = 27792935^160 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 27792935^160 * 375649793 * 2^31 -.word 54773291 // zeta^ 80 * 2^31 = 27792935^ 80 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 80 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 11561947 // zeta^136 * 2^31 = 27792935^136 * 2^31 = 29356361 * 2^31 -.word 4026147365 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 27792935^136 * 375649793 * 2^31 -.word 59595857 // zeta^104 * 2^31 = 27792935^104 * 2^31 = 32616688 * 2^31 -.word 2087308719 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 27792935^104 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 9032575 // zeta^184 * 2^31 = 27792935^184 * 2^31 = 23624597 * 2^31 -.word 3659342465 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 27792935^184 * 375649793 * 2^31 -.word 52902781 // zeta^ 68 * 2^31 = 27792935^ 68 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 68 * 375649793 * 2^31 -.word 48515911 // zeta^ 36 * 2^31 = 27792935^ 36 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 36 * 375649793 * 2^31 -.word 44552409 // zeta^148 * 2^31 = 27792935^148 * 2^31 = 21166324 * 2^31 -.word 1354541351 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 27792935^148 * 375649793 * 2^31 -.word 15880423 // zeta^116 * 2^31 = 27792935^116 * 2^31 = 518908 * 2^31 -.word 33207577 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 27792935^116 * 375649793 * 2^31 -.word 64683161 // zeta^ 12 * 2^31 = 27792935^ 12 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 12 * 375649793 * 2^31 -.word 2707023 // zeta^172 * 2^31 = 27792935^172 * 2^31 = 15739856 * 2^31 -.word 1007273905 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 27792935^172 * 375649793 * 2^31 -.word 48191309 // zeta^ 92 * 2^31 = 27792935^ 92 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 92 * 375649793 * 2^31 -.word 32686385 // zeta^ 60 * 2^31 = 27792935^ 60 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 60 * 375649793 * 2^31 -.word 893127 // zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 729223 // zeta^ 32 * 2^31 = 27792935^ 32 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 32 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 50311979 // zeta^112 * 2^31 = 27792935^112 * 2^31 = 19715532 * 2^31 -.word 1261697749 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 27792935^112 * 375649793 * 2^31 -.word 7518129 // zeta^ 8 * 2^31 = 27792935^ 8 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 8 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 58081411 // zeta^ 88 * 2^31 = 27792935^ 88 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 88 * 375649793 * 2^31 -.word 728237 // zeta^ 56 * 2^31 = 27792935^ 56 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 56 * 375649793 * 2^31 -.word 18598075 // zeta^132 * 2^31 = 27792935^132 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 27792935^132 * 375649793 * 2^31 -.word 37943863 // zeta^100 * 2^31 = 27792935^100 * 2^31 = 9445248 * 2^31 -.word 604449737 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 27792935^100 * 375649793 * 2^31 -.word 51233563 // zeta^ 20 * 2^31 = 27792935^ 20 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 20 * 375649793 * 2^31 -.word 62228979 // zeta^180 * 2^31 = 27792935^180 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 27792935^180 * 375649793 * 2^31 -.word 64406963 // zeta^ 76 * 2^31 = 27792935^ 76 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 76 * 375649793 * 2^31 -.word 28419145 // zeta^ 44 * 2^31 = 27792935^ 44 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 44 * 375649793 * 2^31 -.word 34427601 // zeta^156 * 2^31 = 27792935^156 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 27792935^156 * 375649793 * 2^31 -.word 49061917 // zeta^124 * 2^31 = 27792935^124 * 2^31 = 13108720 * 2^31 -.word 838894051 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 27792935^124 * 375649793 * 2^31 -.word 66384763 // zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 33720897 // zeta^ 96 * 2^31 = 27792935^ 96 * 2^31 = 33556992 * 2^31 -.word 2147483583 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 96 * 375649793 * 2^31 -.word 16802007 // zeta^ 16 * 2^31 = 27792935^ 16 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 16 * 375649793 * 2^31 -.word 12340695 // zeta^176 * 2^31 = 27792935^176 * 2^31 = 31543752 * 2^31 -.word 2018646057 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 27792935^176 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 55552039 // zeta^ 40 * 2^31 = 27792935^ 40 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 40 * 375649793 * 2^31 -.word 66385749 // zeta^152 * 2^31 = 27792935^152 * 2^31 = 9045021 * 2^31 -.word 2726320811 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 27792935^152 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 29170123 // zeta^ 4 * 2^31 = 27792935^ 4 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 4 * 375649793 * 2^31 -.word 14211205 // zeta^164 * 2^31 = 27792935^164 * 2^31 = 30845592 * 2^31 -.word 1973967227 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 27792935^164 * 375649793 * 2^31 -.word 4885007 // zeta^ 84 * 2^31 = 27792935^ 84 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 84 * 375649793 * 2^31 -.word 22561577 // zeta^ 52 * 2^31 = 27792935^ 52 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 52 * 375649793 * 2^31 -.word 38694841 // zeta^140 * 2^31 = 27792935^140 * 2^31 = 994165 * 2^31 -.word 2211105351 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 27792935^140 * 375649793 * 2^31 -.word 2430825 // zeta^108 * 2^31 = 27792935^108 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 27792935^108 * 375649793 * 2^31 -.word 18052069 // zeta^ 28 * 2^31 = 27792935^ 28 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 28 * 375649793 * 2^31 -.word 18922677 // zeta^188 * 2^31 = 27792935^188 * 2^31 = 403828 * 2^31 -.word 25843019 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 27792935^188 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.data -roots: -.word 893127 /// zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 66384763 /// zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 18598075 // zeta^132 * 2^31 = 27792935^132 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 27792935^132 * 375649793 * 2^31 -.word 4885007 // zeta^ 84 * 2^31 = 27792935^ 84 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 84 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 64683161 // zeta^ 12 * 2^31 = 27792935^ 12 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 12 * 375649793 * 2^31 -.word 34427601 // zeta^156 * 2^31 = 27792935^156 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 27792935^156 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_33556993_27792935_incomplete_good, %function -.global ntt_192_u32_33556993_27792935_incomplete_good -ntt_192_u32_33556993_27792935_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(0)] -vsub.s32 Q3, Q0, Q2 -// Release input[0] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-480)] -vsub.s32 Q3, Q0, Q2 -// Release input[132] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(288)] -vsub.s32 Q3, Q0, Q2 -// Release input[72] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(48)] -vsub.s32 Q3, Q0, Q2 -// Release input[12] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-432)] -vsub.s32 Q3, Q0, Q2 -// Release input[144] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -// input[84]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 84)] -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(336)] -vsub.s32 Q3, Q0, Q2 -// Release input[84] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(96)] -vsub.s32 Q3, Q0, Q2 -// Release input[24] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-384)] -vsub.s32 Q3, Q0, Q2 -// Release input[156] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(384)] -vsub.s32 Q3, Q0, Q2 -// Release input[96] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(144)] -vsub.s32 Q3, Q0, Q2 -// Release input[36] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -// input[168]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -84)] -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-336)] -vsub.s32 Q3, Q0, Q2 -// Release input[168] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(432)] -vsub.s32 Q3, Q0, Q2 -// Release input[108] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(192)] -vsub.s32 Q3, Q0, Q2 -// Release input[48] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -// input[180]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -72)] -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-288)] -vsub.s32 Q3, Q0, Q2 -// Release input[180] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(480)] -vsub.s32 Q3, Q0, Q2 -// Release input[120] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(240)] -vsub.s32 Q3, Q0, Q2 -// Release input[60] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q0, r10 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vmul.u32 Q4, Q4, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r9 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[92]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(368)] -// Release input[92] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[44]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q1, Q1, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1451 -// Instruction count: 1134 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s deleted file mode 100644 index fa8df86..0000000 --- a/tests/ntt_384/auto/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s +++ /dev/null @@ -1,1383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 66384763 /// zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 893127 /// zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 32686385 // zeta^ 60 * 2^31 = 27792935^ 60 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 60 * 375649793 * 2^31 -.word 2430825 // zeta^108 * 2^31 = 27792935^108 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 27792935^108 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 62228979 // zeta^180 * 2^31 = 27792935^180 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 27792935^180 * 375649793 * 2^31 -.word 48515911 // zeta^ 36 * 2^31 = 27792935^ 36 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 36 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_33556993_27792935_incomplete_good_bitrev, %function -.global ntt_192_u32_33556993_27792935_incomplete_good_bitrev -ntt_192_u32_33556993_27792935_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(0)] -vsub.s32 Q3, Q0, Q2 -// Release input[0] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(384)] -vsub.s32 Q3, Q0, Q2 -// Release input[96] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-432)] -vsub.s32 Q3, Q0, Q2 -// Release input[144] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(192)] -vsub.s32 Q3, Q0, Q2 -// Release input[48] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(288)] -vsub.s32 Q3, Q0, Q2 -// Release input[72] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -// input[168]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -84)] -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-336)] -vsub.s32 Q3, Q0, Q2 -// Release input[168] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(96)] -vsub.s32 Q3, Q0, Q2 -// Release input[24] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(480)] -vsub.s32 Q3, Q0, Q2 -// Release input[120] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-480)] -vsub.s32 Q3, Q0, Q2 -// Release input[132] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(144)] -vsub.s32 Q3, Q0, Q2 -// Release input[36] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -// input[84]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 84)] -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(336)] -vsub.s32 Q3, Q0, Q2 -// Release input[84] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -// input[180]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -72)] -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-288)] -vsub.s32 Q3, Q0, Q2 -// Release input[180] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(48)] -vsub.s32 Q3, Q0, Q2 -// Release input[12] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(432)] -vsub.s32 Q3, Q0, Q2 -// Release input[108] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-384)] -vsub.s32 Q3, Q0, Q2 -// Release input[156] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vadd.s32 Q3, Q0, Q1 -vadd.s32 Q3, Q3, Q2 -vstrw.u32 Q3, [r0,#(240)] -vsub.s32 Q3, Q0, Q2 -// Release input[60] from Q0 -vsub.s32 Q0, Q1, Q2 -vqrdmulh.s32 Q4, Q0, r10 -vmul.u32 Q5, Q0, r9 -vqrdmlah.s32 Q4, Q5, r12 -vadd.s32 Q1, Q3, Q4 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmulh.s32 Q1, Q0, r8 -vmul.u32 Q0, Q0, r7 -vqrdmlah.s32 Q1, Q0, r12 -vadd.s32 Q2, Q3, Q1 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q0, Q0, r9 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vmul.u32 Q4, Q4, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q0, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmul.u32 Q2, Q2, r9 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q1, Q1, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q2, Q2, r9 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q1, Q1, r9 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q1, Q1, r9 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vadd.s32 Q2, Q2, Q6 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q0, Q0, r9 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vmul.u32 Q4, Q4, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q1, Q1, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[180]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-288)] -// Release input[180] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q1, Q1, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q0, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q1, Q1, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1351 -// Instruction count: 1035 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good.s deleted file mode 100644 index ff18dcc..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_106117153_1392340_incomplete_good_twiddles -ntt_384_u32_106117153_1392340_incomplete_good_twiddles: // For base multiplication -.word 37890045 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 3768695715 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 141040949 // zeta^ 64 * 2^31 = 1392340^ 64 * 2^31 = 51456573 * 2^31 -.word 1866321259 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 64 * 2586463201 * 2^31 -.word 72909029 // zeta^ 32 * 2^31 = 1392340^ 32 * 2^31 = 71252774 * 2^31 -.word 1491181499 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 32 * 2586463201 * 2^31 -.word 136914403 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 8270461 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 195316801 // zeta^ 16 * 2^31 = 1392340^ 16 * 2^31 = 68534739 * 2^31 -.word 93843423 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 16 * 2586463201 * 2^31 -.word 164989409 // zeta^ 80 * 2^31 = 1392340^ 80 * 2^31 = 254604 * 2^31 -.word 3540182591 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 80 * 2586463201 * 2^31 -.word 68592855 // zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 1686643209 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 175196685 // zeta^112 * 2^31 = 1392340^112 * 2^31 = 89497534 * 2^31 -.word 2845449107 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 1392340^112 * 2586463201 * 2^31 -.word 151765235 // zeta^ 8 * 2^31 = 1392340^ 8 * 2^31 = 62524596 * 2^31 -.word 3309713773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 8 * 2586463201 * 2^31 -.word 126514603 // zeta^ 72 * 2^31 = 1392340^ 72 * 2^31 = 101908685 * 2^31 -.word 2730982325 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 72 * 2586463201 * 2^31 -.word 76763575 // zeta^ 40 * 2^31 = 1392340^ 40 * 2^31 = 54230858 * 2^31 -.word 1401852201 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 40 * 2586463201 * 2^31 -.word 203981715 // zeta^104 * 2^31 = 1392340^104 * 2^31 = 88837097 * 2^31 -.word 1840925389 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 1392340^104 * 2586463201 * 2^31 -.word 16429529 // zeta^ 24 * 2^31 = 1392340^ 24 * 2^31 = 87659826 * 2^31 -.word 4193590599 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 24 * 2586463201 * 2^31 -.word 127555815 // zeta^ 88 * 2^31 = 1392340^ 88 * 2^31 = 59766995 * 2^31 -.word 2372284409 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 88 * 2586463201 * 2^31 -.word 209645089 // zeta^ 56 * 2^31 = 1392340^ 56 * 2^31 = 20339234 * 2^31 -.word 3913484799 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 56 * 2586463201 * 2^31 -.word 186595525 // zeta^120 * 2^31 = 1392340^120 * 2^31 = 66124790 * 2^31 -.word 1100702683 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1392340^120 * 2586463201 * 2^31 -.word 992809 // zeta^ 4 * 2^31 = 1392340^ 4 * 2^31 = 1392340 * 2^31 -.word 2512876279 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 4 * 2586463201 * 2^31 -.word 53454909 // zeta^ 68 * 2^31 = 1392340^ 68 * 2^31 = 49002870 * 2^31 -.word 4040246115 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 68 * 2586463201 * 2^31 -.word 48183541 // zeta^ 36 * 2^31 = 1392340^ 36 * 2^31 = 9948684 * 2^31 -.word 1508714923 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 36 * 2586463201 * 2^31 -.word 105529301 // zeta^100 * 2^31 = 1392340^100 * 2^31 = 14737829 * 2^31 -.word 488144587 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 1392340^100 * 2586463201 * 2^31 -.word 11656863 // zeta^ 20 * 2^31 = 1392340^ 20 * 2^31 = 37124223 * 2^31 -.word 459063617 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 20 * 2586463201 * 2^31 -.word 108201343 // zeta^ 84 * 2^31 = 1392340^ 84 * 2^31 = 64042340 * 2^31 -.word 1433794145 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 84 * 2586463201 * 2^31 -.word 93085077 // zeta^ 52 * 2^31 = 1392340^ 52 * 2^31 = 56731543 * 2^31 -.word 63248651 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 52 * 2586463201 * 2^31 -.word 48800199 // zeta^116 * 2^31 = 1392340^116 * 2^31 = 64416179 * 2^31 -.word 159286041 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 1392340^116 * 2586463201 * 2^31 -.word 161225519 // zeta^ 12 * 2^31 = 1392340^ 12 * 2^31 = 61070877 * 2^31 -.word 371152561 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 12 * 2586463201 * 2^31 -.word 157992763 // zeta^ 76 * 2^31 = 1392340^ 76 * 2^31 = 64736387 * 2^31 -.word 1125817381 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 76 * 2586463201 * 2^31 -.word 117865359 // zeta^ 44 * 2^31 = 1392340^ 44 * 2^31 = 26493417 * 2^31 -.word 2711913041 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 44 * 2586463201 * 2^31 -.word 58891053 // zeta^108 * 2^31 = 1392340^108 * 2^31 = 16694344 * 2^31 -.word 526216819 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1392340^108 * 2586463201 * 2^31 -.word 134087109 // zeta^ 28 * 2^31 = 1392340^ 28 * 2^31 = 46852595 * 2^31 -.word 3270097627 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 28 * 2586463201 * 2^31 -.word 106564557 // zeta^ 92 * 2^31 = 1392340^ 92 * 2^31 = 73724383 * 2^31 -.word 3351548371 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 92 * 2586463201 * 2^31 -.word 47641089 // zeta^ 60 * 2^31 = 1392340^ 60 * 2^31 = 68915062 * 2^31 -.word 973406751 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 60 * 2586463201 * 2^31 -.word 16048813 // zeta^124 * 2^31 = 1392340^124 * 2^31 = 99228576 * 2^31 -.word 670701299 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 1392340^124 * 2586463201 * 2^31 -.word 209268057 // zeta^128 * 2^31 = 1392340^128 * 2^31 = 51456572 * 2^31 -.word 2392592839 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1392340^128 * 2586463201 * 2^31 -.word 174344261 // zeta^192 * 2^31 = 1392340^192 * 2^31 = 106117152 * 2^31 -.word 526271579 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 1392340^192 * 2586463201 * 2^31 -.word 170122527 // zeta^160 * 2^31 = 1392340^160 * 2^31 = 91733486 * 2^31 -.word 2812056257 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 1392340^160 * 2586463201 * 2^31 -.word 139325277 // zeta^224 * 2^31 = 1392340^224 * 2^31 = 34864379 * 2^31 -.word 2803785795 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 1392340^224 * 2586463201 * 2^31 -.word 75789761 // zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 3446339167 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 16917505 // zeta^208 * 2^31 = 1392340^208 * 2^31 = 37582414 * 2^31 -.word 4201123871 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 1392340^208 * 2586463201 * 2^31 -.word 486677 // zeta^176 * 2^31 = 1392340^176 * 2^31 = 53294696 * 2^31 -.word 1158805899 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 1392340^176 * 2586463201 * 2^31 -.word 143641451 // zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 2608324085 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 80866521 // zeta^136 * 2^31 = 1392340^136 * 2^31 = 39384089 * 2^31 -.word 3716235847 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 1392340^136 * 2586463201 * 2^31 -.word 60469071 // zeta^200 * 2^31 = 1392340^200 * 2^31 = 43592557 * 2^31 -.word 985253521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 1392340^200 * 2586463201 * 2^31 -.word 21100987 // zeta^168 * 2^31 = 1392340^168 * 2^31 = 34606239 * 2^31 -.word 439073189 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1392340^168 * 2586463201 * 2^31 -.word 135470731 // zeta^232 * 2^31 = 1392340^232 * 2^31 = 51886295 * 2^31 -.word 2893115093 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 1392340^232 * 2586463201 * 2^31 -.word 5009133 // zeta^152 * 2^31 = 1392340^152 * 2^31 = 78224322 * 2^31 -.word 2473661107 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 1392340^152 * 2586463201 * 2^31 -.word 195804777 // zeta^216 * 2^31 = 1392340^216 * 2^31 = 18457327 * 2^31 -.word 101376695 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 1392340^216 * 2586463201 * 2^31 -.word 83067589 // zeta^184 * 2^31 = 1392340^184 * 2^31 = 45785556 * 2^31 -.word 1482185179 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 1392340^184 * 2586463201 * 2^31 -.word 2589217 // zeta^248 * 2^31 = 1392340^248 * 2^31 = 85777919 * 2^31 -.word 381482495 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 1392340^248 * 2586463201 * 2^31 -.word 158579253 // zeta^132 * 2^31 = 1392340^132 * 2^31 = 47610530 * 2^31 -.word 1527369835 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1392340^132 * 2586463201 * 2^31 -.word 211241497 // zeta^196 * 2^31 = 1392340^196 * 2^31 = 104724813 * 2^31 -.word 1782091015 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 1392340^196 * 2586463201 * 2^31 -.word 163462913 // zeta^164 * 2^31 = 1392340^164 * 2^31 = 4789145 * 2^31 -.word 3274396959 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 1392340^164 * 2586463201 * 2^31 -.word 164050765 // zeta^228 * 2^31 = 1392340^228 * 2^31 = 96168469 * 2^31 -.word 2786252371 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 1392340^228 * 2586463201 * 2^31 -.word 202661633 // zeta^148 * 2^31 = 1392340^148 * 2^31 = 26918117 * 2^31 -.word 974730527 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 1392340^148 * 2586463201 * 2^31 -.word 200577443 // zeta^212 * 2^31 = 1392340^212 * 2^31 = 68992930 * 2^31 -.word 3835903677 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 1392340^212 * 2586463201 * 2^31 -.word 61832275 // zeta^180 * 2^31 = 1392340^180 * 2^31 = 7684636 * 2^31 -.word 96037389 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1392340^180 * 2586463201 * 2^31 -.word 119149229 // zeta^244 * 2^31 = 1392340^244 * 2^31 = 49385610 * 2^31 -.word 4231718643 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 1392340^244 * 2586463201 * 2^31 -.word 102884397 // zeta^140 * 2^31 = 1392340^140 * 2^31 = 3665510 * 2^31 -.word 754664819 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 1392340^140 * 2586463201 * 2^31 -.word 51008787 // zeta^204 * 2^31 = 1392340^204 * 2^31 = 45046276 * 2^31 -.word 3923814733 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 1392340^204 * 2586463201 * 2^31 -.word 47142847 // zeta^172 * 2^31 = 1392340^172 * 2^31 = 96318080 * 2^31 -.word 2109271073 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 1392340^172 * 2586463201 * 2^31 -.word 94368947 // zeta^236 * 2^31 = 1392340^236 * 2^31 = 79623736 * 2^31 -.word 1583054253 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 1392340^236 * 2586463201 * 2^31 -.word 78594601 // zeta^156 * 2^31 = 1392340^156 * 2^31 = 26871788 * 2^31 -.word 81450743 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1392340^156 * 2586463201 * 2^31 -.word 78147197 // zeta^220 * 2^31 = 1392340^220 * 2^31 = 59264558 * 2^31 -.word 1024869667 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 1392340^220 * 2586463201 * 2^31 -.word 74524877 // zeta^188 * 2^31 = 1392340^188 * 2^31 = 30313514 * 2^31 -.word 3992261843 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 1392340^188 * 2586463201 * 2^31 -.word 164593217 // zeta^252 * 2^31 = 1392340^252 * 2^31 = 37202091 * 2^31 -.word 3321560543 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 1392340^252 * 2586463201 * 2^31 -.word 71193357 // zeta^256 * 2^31 = 1392340^256 * 2^31 = 54660580 * 2^31 -.word 2428646035 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 1392340^256 * 2586463201 * 2^31 -.word 2966249 // zeta^320 * 2^31 = 1392340^320 * 2^31 = 54660581 * 2^31 -.word 1902374455 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 1392340^320 * 2586463201 * 2^31 -.word 75319903 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 4286696833 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 42111779 // zeta^352 * 2^31 = 1392340^352 * 2^31 = 14383667 * 2^31 -.word 1482911037 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 1392340^352 * 2586463201 * 2^31 -.word 47244897 // zeta^272 * 2^31 = 1392340^272 * 2^31 = 105862549 * 2^31 -.word 754784703 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 1392340^272 * 2586463201 * 2^31 -.word 136444545 // zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 848628127 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 37037621 // zeta^304 * 2^31 = 1392340^304 * 2^31 = 16619619 * 2^31 -.word 1449518187 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 1392340^304 * 2586463201 * 2^31 -.word 211747629 // zeta^368 * 2^31 = 1392340^368 * 2^31 = 52822457 * 2^31 -.word 3136161395 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 1392340^368 * 2586463201 * 2^31 -.word 85719703 // zeta^264 * 2^31 = 1392340^264 * 2^31 = 4208468 * 2^31 -.word 1563984969 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 1392340^264 * 2586463201 * 2^31 -.word 131367785 // zeta^328 * 2^31 = 1392340^328 * 2^31 = 66733064 * 2^31 -.word 578731447 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 1392340^328 * 2586463201 * 2^31 -.word 8252591 // zeta^296 * 2^31 = 1392340^296 * 2^31 = 17280056 * 2^31 -.word 2454041905 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 1392340^296 * 2586463201 * 2^31 -.word 191133319 // zeta^360 * 2^31 = 1392340^360 * 2^31 = 71510914 * 2^31 -.word 3855894105 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 1392340^360 * 2586463201 * 2^31 -.word 84678491 // zeta^280 * 2^31 = 1392340^280 * 2^31 = 46350158 * 2^31 -.word 1922682885 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 1392340^280 * 2586463201 * 2^31 -.word 207225173 // zeta^344 * 2^31 = 1392340^344 * 2^31 = 27892831 * 2^31 -.word 1821306187 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 1392340^344 * 2586463201 * 2^31 -.word 25638781 // zeta^312 * 2^31 = 1392340^312 * 2^31 = 39992363 * 2^31 -.word 3194264611 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 1392340^312 * 2586463201 * 2^31 -.word 129166717 // zeta^376 * 2^31 = 1392340^376 * 2^31 = 60331597 * 2^31 -.word 2812782115 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 1392340^376 * 2586463201 * 2^31 -.word 158779397 // zeta^260 * 2^31 = 1392340^260 * 2^31 = 57114283 * 2^31 -.word 254721179 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 1392340^260 * 2586463201 * 2^31 -.word 53655053 // zeta^324 * 2^31 = 1392340^324 * 2^31 = 58506623 * 2^31 -.word 2767597459 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 1392340^324 * 2586463201 * 2^31 -.word 106705005 // zeta^292 * 2^31 = 1392340^292 * 2^31 = 91379324 * 2^31 -.word 3806822707 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 1392340^292 * 2586463201 * 2^31 -.word 48771393 // zeta^356 * 2^31 = 1392340^356 * 2^31 = 101328008 * 2^31 -.word 1020570335 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 1392340^356 * 2586463201 * 2^31 -.word 104032963 // zeta^276 * 2^31 = 1392340^276 * 2^31 = 42074813 * 2^31 -.word 2861173149 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 1392340^276 * 2586463201 * 2^31 -.word 9572673 // zeta^340 * 2^31 = 1392340^340 * 2^31 = 79199036 * 2^31 -.word 3320236767 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 1392340^340 * 2586463201 * 2^31 -.word 163434107 // zeta^308 * 2^31 = 1392340^308 * 2^31 = 41700974 * 2^31 -.word 4135681253 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 1392340^308 * 2586463201 * 2^31 -.word 150402031 // zeta^372 * 2^31 = 1392340^372 * 2^31 = 98432517 * 2^31 -.word 4198929905 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 1392340^372 * 2586463201 * 2^31 -.word 54241543 // zeta^268 * 2^31 = 1392340^268 * 2^31 = 41380766 * 2^31 -.word 3169149913 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 1392340^268 * 2586463201 * 2^31 -.word 109349909 // zeta^332 * 2^31 = 1392340^332 * 2^31 = 102451643 * 2^31 -.word 3540302475 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 1392340^332 * 2586463201 * 2^31 -.word 153343253 // zeta^300 * 2^31 = 1392340^300 * 2^31 = 89422809 * 2^31 -.word 3768750475 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 1392340^300 * 2586463201 * 2^31 -.word 165091459 // zeta^364 * 2^31 = 1392340^364 * 2^31 = 9799073 * 2^31 -.word 2185696221 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 1392340^364 * 2586463201 * 2^31 -.word 105669749 // zeta^284 * 2^31 = 1392340^284 * 2^31 = 32392770 * 2^31 -.word 943418923 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 1392340^284 * 2586463201 * 2^31 -.word 133639705 // zeta^348 * 2^31 = 1392340^348 * 2^31 = 79245365 * 2^31 -.word 4213516551 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 1392340^348 * 2586463201 * 2^31 -.word 196185493 // zeta^316 * 2^31 = 1392340^316 * 2^31 = 6888577 * 2^31 -.word 3624265995 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 1392340^316 * 2586463201 * 2^31 -.word 137709429 // zeta^380 * 2^31 = 1392340^380 * 2^31 = 75803639 * 2^31 -.word 302705451 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 1392340^380 * 2586463201 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_106117153_1392340_incomplete_good_scale -ntt_384_u32_106117153_1392340_incomplete_good_scale: // Constants for scaling by 1/N -.word 37890045 // 1/96 -.word 3768695715 // 1/96 twisted -.data -roots: -.word 136304203 /// zeta^256 * 2^31 = 1392340^256 * 2^31 = 54660580 * 2^31 -.word 1106161429 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 1392340^256 * 2586463201 * 2^31 -.word 50789515 /// zeta^128 * 2^31 = 1392340^128 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1392340^128 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 86500417 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 3362131 // zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 74219771 // zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 3362131 // zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 207754911 // zeta^264 * 2^31 = 1392340^264 * 2^31 = 4208468 * 2^31 -.word 85166401 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 1392340^264 * 2586463201 * 2^31 -.word 86384727 // zeta^168 * 2^31 = 1392340^168 * 2^31 = 34606239 * 2^31 -.word 2847807113 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1392340^168 * 2586463201 * 2^31 -.word 74219771 // zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 77895747 // zeta^ 24 * 2^31 = 1392340^ 24 * 2^31 = 87659826 * 2^31 -.word 1773964317 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 24 * 2586463201 * 2^31 -.word 42168601 // zeta^312 * 2^31 = 1392340^312 * 2^31 = 39992363 * 2^31 -.word 2956805639 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 1392340^312 * 2586463201 * 2^31 -.word 131257741 // XX: zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 86500417 // XX: zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 3362131 // XX: zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 765704461 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 74219771 // XX: zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 732633701 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 207754911 // XX: zeta^264 * 2^31 = 1392340^264 * 2^31 = 4208468 * 2^31 -.word 85166401 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 1392340^264 * 2586463201 * 2^31 -.word 86384727 // XX: zeta^168 * 2^31 = 1392340^168 * 2^31 = 34606239 * 2^31 -.word 2847807113 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1392340^168 * 2586463201 * 2^31 -.word 77895747 // XX: zeta^ 24 * 2^31 = 1392340^ 24 * 2^31 = 87659826 * 2^31 -.word 1773964317 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 24 * 2586463201 * 2^31 -.word 42168601 // XX: zeta^312 * 2^31 = 1392340^312 * 2^31 = 39992363 * 2^31 -.word 2956805639 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 1392340^312 * 2586463201 * 2^31 -.word 120907359 // XX: zeta^132 * 2^31 = 1392340^132 * 2^31 = 47610530 * 2^31 -.word 963490177 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1392340^132 * 2586463201 * 2^31 -.word 76659711 // XX: zeta^ 36 * 2^31 = 1392340^ 36 * 2^31 = 9948684 * 2^31 -.word 201330657 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 36 * 2586463201 * 2^31 -.word 101045121 // XX: zeta^276 * 2^31 = 1392340^276 * 2^31 = 42074813 * 2^31 -.word 2998947999 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 1392340^276 * 2586463201 * 2^31 -.word 121674239 // XX: zeta^180 * 2^31 = 1392340^180 * 2^31 = 7684636 * 2^31 -.word 155513313 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1392340^180 * 2586463201 * 2^31 -.word 137517881 // XX: zeta^ 12 * 2^31 = 1392340^ 12 * 2^31 = 61070877 * 2^31 -.word 3383369703 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 12 * 2586463201 * 2^31 -.word 131387629 // XX: zeta^300 * 2^31 = 1392340^300 * 2^31 = 89422809 * 2^31 -.word 3957125299 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 1392340^300 * 2586463201 * 2^31 -.word 87076127 // XX: zeta^156 * 2^31 = 1392340^156 * 2^31 = 26871788 * 2^31 -.word 543802049 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1392340^156 * 2586463201 * 2^31 -.word 80366379 // XX: zeta^ 60 * 2^31 = 1392340^ 60 * 2^31 = 68915062 * 2^31 -.word 1394628149 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 60 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_106117153_1392340_incomplete_good, %function -.global ntt_384_u32_106117153_1392340_incomplete_good -ntt_384_u32_106117153_1392340_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 106117153 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vmul.u32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vmul.u32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vmul.u32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vmul.u32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 1708504095 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s deleted file mode 100644 index 06677c5..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 50789515 /// zeta^128 * 2^31 = 1392340^128 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1392340^128 * 2586463201 * 2^31 -.word 136304203 /// zeta^256 * 2^31 = 1392340^256 * 2^31 = 54660580 * 2^31 -.word 1106161429 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 1392340^256 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 125733889 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 125733889 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 125733889 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 138014535 // zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 3562333593 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 208872175 // zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 3529262833 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 138014535 // zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 3562333593 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 170065705 // zeta^120 * 2^31 = 1392340^120 * 2^31 = 66124790 * 2^31 -.word 1338161655 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1392340^120 * 2586463201 * 2^31 -.word 134338559 // zeta^216 * 2^31 = 1392340^216 * 2^31 = 18457327 * 2^31 -.word 2521002977 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 1392340^216 * 2586463201 * 2^31 -.word 208872175 // zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 3529262833 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 125849579 // zeta^360 * 2^31 = 1392340^360 * 2^31 = 71510914 * 2^31 -.word 1447160181 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 1392340^360 * 2586463201 * 2^31 -.word 4479395 // zeta^ 72 * 2^31 = 1392340^ 72 * 2^31 = 101908685 * 2^31 -.word 4209800893 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 72 * 2586463201 * 2^31 -.word 131257741 // XX: zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 125733889 // XX: zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 138014535 // XX: zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 3562333593 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 208872175 // XX: zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 3529262833 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 170065705 // XX: zeta^120 * 2^31 = 1392340^120 * 2^31 = 66124790 * 2^31 -.word 1338161655 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1392340^120 * 2586463201 * 2^31 -.word 134338559 // XX: zeta^216 * 2^31 = 1392340^216 * 2^31 = 18457327 * 2^31 -.word 2521002977 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 1392340^216 * 2586463201 * 2^31 -.word 125849579 // XX: zeta^360 * 2^31 = 1392340^360 * 2^31 = 71510914 * 2^31 -.word 1447160181 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 1392340^360 * 2586463201 * 2^31 -.word 4479395 // XX: zeta^ 72 * 2^31 = 1392340^ 72 * 2^31 = 101908685 * 2^31 -.word 4209800893 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 72 * 2586463201 * 2^31 -.word 131867927 // XX: zeta^252 * 2^31 = 1392340^252 * 2^31 = 37202091 * 2^31 -.word 2900339145 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 1392340^252 * 2586463201 * 2^31 -.word 125158179 // XX: zeta^348 * 2^31 = 1392340^348 * 2^31 = 79245365 * 2^31 -.word 3751165245 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 1392340^348 * 2586463201 * 2^31 -.word 80846677 // XX: zeta^108 * 2^31 = 1392340^108 * 2^31 = 16694344 * 2^31 -.word 337841995 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1392340^108 * 2586463201 * 2^31 -.word 74716425 // XX: zeta^204 * 2^31 = 1392340^204 * 2^31 = 45046276 * 2^31 -.word 911597591 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 1392340^204 * 2586463201 * 2^31 -.word 90560067 // XX: zeta^372 * 2^31 = 1392340^372 * 2^31 = 98432517 * 2^31 -.word 4139453981 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 1392340^372 * 2586463201 * 2^31 -.word 111189185 // XX: zeta^ 84 * 2^31 = 1392340^ 84 * 2^31 = 64042340 * 2^31 -.word 1296019295 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 84 * 2586463201 * 2^31 -.word 135574595 // XX: zeta^228 * 2^31 = 1392340^228 * 2^31 = 96168469 * 2^31 -.word 4093636637 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 1392340^228 * 2586463201 * 2^31 -.word 91326947 // XX: zeta^324 * 2^31 = 1392340^324 * 2^31 = 58506623 * 2^31 -.word 3331477117 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 1392340^324 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_106117153_1392340_incomplete_good_bitrev, %function -.global ntt_384_u32_106117153_1392340_incomplete_good_bitrev -ntt_384_u32_106117153_1392340_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 106117153 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vmul.u32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 1708504095 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good.s deleted file mode 100644 index 33ec980..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_108643009_640922_incomplete_good_twiddles -ntt_384_u32_108643009_640922_incomplete_good_twiddles: // For base multiplication -.word 117231189 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 3747646315 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 167943169 // zeta^ 64 * 2^31 = 640922^ 64 * 2^31 = 67669976 * 2^31 -.word 1929942719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 64 * 3479293249 * 2^31 -.word 10524287 // zeta^ 32 * 2^31 = 640922^ 32 * 2^31 = 8569779 * 2^31 -.word 2274825921 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 32 * 3479293249 * 2^31 -.word 183751195 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 2275215397 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 197898871 // zeta^ 16 * 2^31 = 640922^ 16 * 2^31 = 82310697 * 2^31 -.word 189228233 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 16 * 3479293249 * 2^31 -.word 117636793 // zeta^ 80 * 2^31 = 640922^ 80 * 2^31 = 87332880 * 2^31 -.word 3072994823 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 80 * 3479293249 * 2^31 -.word 59998845 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1940964675 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 22735857 // zeta^112 * 2^31 = 640922^112 * 2^31 = 44058032 * 2^31 -.word 2477333199 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 640922^112 * 3479293249 * 2^31 -.word 127637249 // zeta^ 8 * 2^31 = 640922^ 8 * 2^31 = 1793055 * 2^31 -.word 1932647359 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 8 * 3479293249 * 2^31 -.word 78695545 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 3934662727 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 203907557 // zeta^ 40 * 2^31 = 640922^ 40 * 2^31 = 52463921 * 2^31 -.word 500614107 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 40 * 3479293249 * 2^31 -.word 212278911 // zeta^104 * 2^31 = 640922^104 * 2^31 = 46625229 * 2^31 -.word 3070660289 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 640922^104 * 3479293249 * 2^31 -.word 65439627 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 2806138549 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 141615223 // zeta^ 88 * 2^31 = 640922^ 88 * 2^31 = 56126250 * 2^31 -.word 830518985 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 88 * 3479293249 * 2^31 -.word 96791441 // zeta^ 56 * 2^31 = 640922^ 56 * 2^31 = 17702973 * 2^31 -.word 1466700591 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 56 * 3479293249 * 2^31 -.word 91234029 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 2063031507 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 172736993 // zeta^ 4 * 2^31 = 640922^ 4 * 2^31 = 640922 * 2^31 -.word 1396807903 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 4 * 3479293249 * 2^31 -.word 84666041 // zeta^ 68 * 2^31 = 640922^ 68 * 2^31 = 18021000 * 2^31 -.word 757024263 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 68 * 3479293249 * 2^31 -.word 145858849 // zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 3495799199 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 183858709 // zeta^100 * 2^31 = 640922^100 * 2^31 = 58708509 * 2^31 -.word 4012454827 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 640922^100 * 3479293249 * 2^31 -.word 177838823 // zeta^ 20 * 2^31 = 640922^ 20 * 2^31 = 81518432 * 2^31 -.word 3547181145 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 20 * 3479293249 * 2^31 -.word 41900335 // zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 2540746769 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 60770513 // zeta^ 52 * 2^31 = 640922^ 52 * 2^31 = 82553845 * 2^31 -.word 4044236271 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 52 * 3479293249 * 2^31 -.word 167358029 // zeta^116 * 2^31 = 640922^116 * 2^31 = 31587287 * 2^31 -.word 953816435 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 640922^116 * 3479293249 * 2^31 -.word 51201803 // zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 3348244277 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 80521231 // zeta^ 76 * 2^31 = 640922^ 76 * 2^31 = 40418220 * 2^31 -.word 382095665 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 76 * 3479293249 * 2^31 -.word 99504283 // zeta^ 44 * 2^31 = 640922^ 44 * 2^31 = 52603644 * 2^31 -.word 3359009189 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 44 * 3479293249 * 2^31 -.word 40810197 // zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 935723755 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 171634653 // zeta^ 28 * 2^31 = 640922^ 28 * 2^31 = 31497268 * 2^31 -.word 2671255523 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 28 * 3479293249 * 2^31 -.word 139731691 // zeta^ 92 * 2^31 = 640922^ 92 * 2^31 = 87621537 * 2^31 -.word 1117909845 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 92 * 3479293249 * 2^31 -.word 62594557 // zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1184680387 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.word 164673767 // zeta^124 * 2^31 = 640922^124 * 2^31 = 78082914 * 2^31 -.word 2238255705 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 640922^124 * 3479293249 * 2^31 -.word 159354989 // zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 2477263699 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 100054829 // zeta^192 * 2^31 = 640922^192 * 2^31 = 108643008 * 2^31 -.word 547320979 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 640922^192 * 3479293249 * 2^31 -.word 64583899 // zeta^160 * 2^31 = 640922^160 * 2^31 = 13028154 * 2^31 -.word 389477 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 640922^160 * 3479293249 * 2^31 -.word 206761731 // zeta^224 * 2^31 = 640922^224 * 2^31 = 100073230 * 2^31 -.word 2020141373 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 640922^224 * 3479293249 * 2^31 -.word 28380931 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 2883766589 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 19387147 // zeta^208 * 2^31 = 640922^208 * 2^31 = 26332312 * 2^31 -.word 4105739061 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 640922^208 * 3479293249 * 2^31 -.word 71380021 // zeta^176 * 2^31 = 640922^176 * 2^31 = 70392207 * 2^31 -.word 536368523 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 640922^176 * 3479293249 * 2^31 -.word 157287173 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 2354002619 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 59701305 // zeta^136 * 2^31 = 640922^136 * 2^31 = 106639146 * 2^31 -.word 2002015367 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 640922^136 * 3479293249 * 2^31 -.word 89648769 // zeta^200 * 2^31 = 640922^200 * 2^31 = 106849954 * 2^31 -.word 2362319935 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 640922^200 * 3479293249 * 2^31 -.word 117014363 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2570046181 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 13378461 // zeta^232 * 2^31 = 640922^232 * 2^31 = 56179088 * 2^31 -.word 3794353187 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 640922^232 * 3479293249 * 2^31 -.word 184818605 // zeta^152 * 2^31 = 640922^152 * 2^31 = 65895091 * 2^31 -.word 2319347731 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 640922^152 * 3479293249 * 2^31 -.word 151846391 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 1488828745 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103085597 // zeta^184 * 2^31 = 640922^184 * 2^31 = 105229554 * 2^31 -.word 596330915 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 640922^184 * 3479293249 * 2^31 -.word 120494577 // zeta^248 * 2^31 = 640922^248 * 2^31 = 90940036 * 2^31 -.word 2828266703 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 640922^248 * 3479293249 * 2^31 -.word 20572057 // zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 3655183655 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 44549025 // zeta^196 * 2^31 = 640922^196 * 2^31 = 108002087 * 2^31 -.word 2898159391 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 640922^196 * 3479293249 * 2^31 -.word 146642869 // zeta^164 * 2^31 = 640922^164 * 2^31 = 54775275 * 2^31 -.word 516655627 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 640922^164 * 3479293249 * 2^31 -.word 71427169 // zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 799168095 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 189990539 // zeta^148 * 2^31 = 640922^148 * 2^31 = 61145083 * 2^31 -.word 3288532917 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 640922^148 * 3479293249 * 2^31 -.word 39447195 // zeta^212 * 2^31 = 640922^212 * 2^31 = 27124577 * 2^31 -.word 747786149 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 640922^212 * 3479293249 * 2^31 -.word 215230525 // zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1204547459 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 156515505 // zeta^244 * 2^31 = 640922^244 * 2^31 = 26089164 * 2^31 -.word 250731023 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 640922^244 * 3479293249 * 2^31 -.word 137962437 // zeta^140 * 2^31 = 640922^140 * 2^31 = 57770712 * 2^31 -.word 1328818683 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 640922^140 * 3479293249 * 2^31 -.word 166084215 // zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 946723017 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 49948923 // zeta^172 * 2^31 = 640922^172 * 2^31 = 62290981 * 2^31 -.word 1871681861 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 640922^172 * 3479293249 * 2^31 -.word 117781735 // zeta^236 * 2^31 = 640922^236 * 2^31 = 56039365 * 2^31 -.word 935958105 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 640922^236 * 3479293249 * 2^31 -.word 76740047 // zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 2741621617 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 45651365 // zeta^220 * 2^31 = 640922^220 * 2^31 = 77145741 * 2^31 -.word 1623711771 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 640922^220 * 3479293249 * 2^31 -.word 210722219 // zeta^188 * 2^31 = 640922^188 * 2^31 = 94509732 * 2^31 -.word 1053575317 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 640922^188 * 3479293249 * 2^31 -.word 154691461 // zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 3110286907 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 49342849 // zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 2365024575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 57931029 // zeta^320 * 2^31 = 640922^320 * 2^31 = 40973034 * 2^31 -.word 1817703595 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 640922^320 * 3479293249 * 2^31 -.word 33534823 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 2019751897 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 152702119 // zeta^352 * 2^31 = 640922^352 * 2^31 = 95614855 * 2^31 -.word 4294577817 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 640922^352 * 3479293249 * 2^31 -.word 99649225 // zeta^272 * 2^31 = 640922^272 * 2^31 = 21310129 * 2^31 -.word 1221972471 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 640922^272 * 3479293249 * 2^31 -.word 188905087 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 1411200705 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 194550161 // zeta^304 * 2^31 = 640922^304 * 2^31 = 64584977 * 2^31 -.word 1817634095 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 640922^304 * 3479293249 * 2^31 -.word 145905997 // zeta^368 * 2^31 = 640922^368 * 2^31 = 38250802 * 2^31 -.word 3758598771 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 640922^368 * 3479293249 * 2^31 -.word 138590473 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 360304567 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 157584713 // zeta^328 * 2^31 = 640922^328 * 2^31 = 2003863 * 2^31 -.word 2292951927 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 640922^328 * 3479293249 * 2^31 -.word 5007107 // zeta^296 * 2^31 = 640922^296 * 2^31 = 62017780 * 2^31 -.word 1224307005 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 640922^296 * 3479293249 * 2^31 -.word 100271655 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 1724921113 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 75670795 // zeta^280 * 2^31 = 640922^280 * 2^31 = 52516759 * 2^31 -.word 3464448309 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 640922^280 * 3479293249 * 2^31 -.word 32467413 // zeta^344 * 2^31 = 640922^344 * 2^31 = 42747918 * 2^31 -.word 1975619563 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 640922^344 * 3479293249 * 2^31 -.word 126051989 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 2231935787 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 114200421 // zeta^376 * 2^31 = 640922^376 * 2^31 = 3413455 * 2^31 -.word 3698636379 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 640922^376 * 3479293249 * 2^31 -.word 132619977 // zeta^260 * 2^31 = 640922^260 * 2^31 = 90622009 * 2^31 -.word 3537943031 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 640922^260 * 3479293249 * 2^31 -.word 196713961 // zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 639783639 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.word 33427309 // zeta^292 * 2^31 = 640922^292 * 2^31 = 49934500 * 2^31 -.word 282512467 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 640922^292 * 3479293249 * 2^31 -.word 70643149 // zeta^356 * 2^31 = 640922^356 * 2^31 = 53867734 * 2^31 -.word 3778311667 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 640922^356 * 3479293249 * 2^31 -.word 175385683 // zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1754220525 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 27295479 // zeta^340 * 2^31 = 640922^340 * 2^31 = 47497926 * 2^31 -.word 1006434377 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 640922^340 * 3479293249 * 2^31 -.word 49927989 // zeta^308 * 2^31 = 640922^308 * 2^31 = 77055722 * 2^31 -.word 3341150859 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 640922^308 * 3479293249 * 2^31 -.word 2055493 // zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 3090419835 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 136764787 // zeta^268 * 2^31 = 640922^268 * 2^31 = 68224789 * 2^31 -.word 3912871629 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 640922^268 * 3479293249 * 2^31 -.word 79323581 // zeta^332 * 2^31 = 640922^332 * 2^31 = 50872297 * 2^31 -.word 2966148611 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 640922^332 * 3479293249 * 2^31 -.word 176475821 // zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 3359243539 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 167337095 // zeta^364 * 2^31 = 640922^364 * 2^31 = 46352028 * 2^31 -.word 2423285433 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 640922^364 * 3479293249 * 2^31 -.word 77554327 // zeta^284 * 2^31 = 640922^284 * 2^31 = 21021472 * 2^31 -.word 3177057449 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 640922^284 * 3479293249 * 2^31 -.word 140545971 // zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1553345677 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 52612251 // zeta^316 * 2^31 = 640922^316 * 2^31 = 30560095 * 2^31 -.word 2056711589 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 640922^316 * 3479293249 * 2^31 -.word 6563799 // zeta^380 * 2^31 = 640922^380 * 2^31 = 14133277 * 2^31 -.word 3241391977 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 640922^380 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_108643009_640922_incomplete_good_scale -ntt_384_u32_108643009_640922_incomplete_good_scale: // Constants for scaling by 1/N -.word 117231189 // 1/96 -.word 3747646315 // 1/96 twisted -.data -roots: -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 210808 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 98874168 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // XX: zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // XX: zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // XX: zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 210808 // XX: zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // XX: zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 98874168 // XX: zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // XX: zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 17380078 // XX: zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 343541970 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 3933234 // XX: zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 77745966 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 74622503 // XX: zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1475019943 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 57676451 // XX: zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1140057115 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 91290517 // XX: zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 1804486955 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 102391393 // XX: zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 2023911563 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 56124269 // XX: zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 1109376029 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 92216191 // XX: zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1822784218 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good, %function -.global ntt_384_u32_108643009_640922_incomplete_good -ntt_384_u32_108643009_640922_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -108643009 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 815674047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s deleted file mode 100644 index 9c316f8..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 21597933 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 26334175 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 103620826 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 26334175 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 14289518 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 282452654 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 9768841 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 193095041 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103620826 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 5838692 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 115410055 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 108432201 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 2143316728 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 21597933 // XX: zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 26334175 // XX: zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 520532437 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 103620826 // XX: zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 2048213056 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 14289518 // XX: zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 282452654 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 9768841 // XX: zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 193095041 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 5838692 // XX: zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 115410055 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 108432201 // XX: zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 2143316728 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 16426818 // XX: zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 324699430 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 52518740 // XX: zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1038107619 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 6251616 // XX: zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 123572085 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 17352492 // XX: zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 342996693 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 50966558 // XX: zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 1007426533 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 34020506 // XX: zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 672463705 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 104709775 // XX: zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 2069737682 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 91262931 // XX: zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 1803941678 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good_bitrev, %function -.global ntt_384_u32_108643009_640922_incomplete_good_bitrev -ntt_384_u32_108643009_640922_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -108643009 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 815674047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop.s b/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop.s deleted file mode 100644 index 00d4e3a..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop.s +++ /dev/null @@ -1,3388 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_108643009_640922_incomplete_good_oop_twiddles -ntt_384_u32_108643009_640922_incomplete_good_oop_twiddles: // For base multiplication -.word 117231189 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 3747646315 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 167943169 // zeta^ 64 * 2^31 = 640922^ 64 * 2^31 = 67669976 * 2^31 -.word 1929942719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 64 * 3479293249 * 2^31 -.word 10524287 // zeta^ 32 * 2^31 = 640922^ 32 * 2^31 = 8569779 * 2^31 -.word 2274825921 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 32 * 3479293249 * 2^31 -.word 183751195 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 2275215397 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 197898871 // zeta^ 16 * 2^31 = 640922^ 16 * 2^31 = 82310697 * 2^31 -.word 189228233 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 16 * 3479293249 * 2^31 -.word 117636793 // zeta^ 80 * 2^31 = 640922^ 80 * 2^31 = 87332880 * 2^31 -.word 3072994823 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 80 * 3479293249 * 2^31 -.word 59998845 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1940964675 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 22735857 // zeta^112 * 2^31 = 640922^112 * 2^31 = 44058032 * 2^31 -.word 2477333199 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 640922^112 * 3479293249 * 2^31 -.word 127637249 // zeta^ 8 * 2^31 = 640922^ 8 * 2^31 = 1793055 * 2^31 -.word 1932647359 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 8 * 3479293249 * 2^31 -.word 78695545 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 3934662727 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 203907557 // zeta^ 40 * 2^31 = 640922^ 40 * 2^31 = 52463921 * 2^31 -.word 500614107 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 40 * 3479293249 * 2^31 -.word 212278911 // zeta^104 * 2^31 = 640922^104 * 2^31 = 46625229 * 2^31 -.word 3070660289 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 640922^104 * 3479293249 * 2^31 -.word 65439627 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 2806138549 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 141615223 // zeta^ 88 * 2^31 = 640922^ 88 * 2^31 = 56126250 * 2^31 -.word 830518985 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 88 * 3479293249 * 2^31 -.word 96791441 // zeta^ 56 * 2^31 = 640922^ 56 * 2^31 = 17702973 * 2^31 -.word 1466700591 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 56 * 3479293249 * 2^31 -.word 91234029 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 2063031507 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 172736993 // zeta^ 4 * 2^31 = 640922^ 4 * 2^31 = 640922 * 2^31 -.word 1396807903 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 4 * 3479293249 * 2^31 -.word 84666041 // zeta^ 68 * 2^31 = 640922^ 68 * 2^31 = 18021000 * 2^31 -.word 757024263 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 68 * 3479293249 * 2^31 -.word 145858849 // zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 3495799199 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 183858709 // zeta^100 * 2^31 = 640922^100 * 2^31 = 58708509 * 2^31 -.word 4012454827 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 640922^100 * 3479293249 * 2^31 -.word 177838823 // zeta^ 20 * 2^31 = 640922^ 20 * 2^31 = 81518432 * 2^31 -.word 3547181145 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 20 * 3479293249 * 2^31 -.word 41900335 // zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 2540746769 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 60770513 // zeta^ 52 * 2^31 = 640922^ 52 * 2^31 = 82553845 * 2^31 -.word 4044236271 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 52 * 3479293249 * 2^31 -.word 167358029 // zeta^116 * 2^31 = 640922^116 * 2^31 = 31587287 * 2^31 -.word 953816435 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 640922^116 * 3479293249 * 2^31 -.word 51201803 // zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 3348244277 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 80521231 // zeta^ 76 * 2^31 = 640922^ 76 * 2^31 = 40418220 * 2^31 -.word 382095665 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 76 * 3479293249 * 2^31 -.word 99504283 // zeta^ 44 * 2^31 = 640922^ 44 * 2^31 = 52603644 * 2^31 -.word 3359009189 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 44 * 3479293249 * 2^31 -.word 40810197 // zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 935723755 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 171634653 // zeta^ 28 * 2^31 = 640922^ 28 * 2^31 = 31497268 * 2^31 -.word 2671255523 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 28 * 3479293249 * 2^31 -.word 139731691 // zeta^ 92 * 2^31 = 640922^ 92 * 2^31 = 87621537 * 2^31 -.word 1117909845 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 92 * 3479293249 * 2^31 -.word 62594557 // zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1184680387 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.word 164673767 // zeta^124 * 2^31 = 640922^124 * 2^31 = 78082914 * 2^31 -.word 2238255705 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 640922^124 * 3479293249 * 2^31 -.word 159354989 // zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 2477263699 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 100054829 // zeta^192 * 2^31 = 640922^192 * 2^31 = 108643008 * 2^31 -.word 547320979 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 640922^192 * 3479293249 * 2^31 -.word 64583899 // zeta^160 * 2^31 = 640922^160 * 2^31 = 13028154 * 2^31 -.word 389477 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 640922^160 * 3479293249 * 2^31 -.word 206761731 // zeta^224 * 2^31 = 640922^224 * 2^31 = 100073230 * 2^31 -.word 2020141373 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 640922^224 * 3479293249 * 2^31 -.word 28380931 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 2883766589 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 19387147 // zeta^208 * 2^31 = 640922^208 * 2^31 = 26332312 * 2^31 -.word 4105739061 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 640922^208 * 3479293249 * 2^31 -.word 71380021 // zeta^176 * 2^31 = 640922^176 * 2^31 = 70392207 * 2^31 -.word 536368523 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 640922^176 * 3479293249 * 2^31 -.word 157287173 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 2354002619 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 59701305 // zeta^136 * 2^31 = 640922^136 * 2^31 = 106639146 * 2^31 -.word 2002015367 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 640922^136 * 3479293249 * 2^31 -.word 89648769 // zeta^200 * 2^31 = 640922^200 * 2^31 = 106849954 * 2^31 -.word 2362319935 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 640922^200 * 3479293249 * 2^31 -.word 117014363 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2570046181 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 13378461 // zeta^232 * 2^31 = 640922^232 * 2^31 = 56179088 * 2^31 -.word 3794353187 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 640922^232 * 3479293249 * 2^31 -.word 184818605 // zeta^152 * 2^31 = 640922^152 * 2^31 = 65895091 * 2^31 -.word 2319347731 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 640922^152 * 3479293249 * 2^31 -.word 151846391 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 1488828745 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103085597 // zeta^184 * 2^31 = 640922^184 * 2^31 = 105229554 * 2^31 -.word 596330915 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 640922^184 * 3479293249 * 2^31 -.word 120494577 // zeta^248 * 2^31 = 640922^248 * 2^31 = 90940036 * 2^31 -.word 2828266703 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 640922^248 * 3479293249 * 2^31 -.word 20572057 // zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 3655183655 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 44549025 // zeta^196 * 2^31 = 640922^196 * 2^31 = 108002087 * 2^31 -.word 2898159391 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 640922^196 * 3479293249 * 2^31 -.word 146642869 // zeta^164 * 2^31 = 640922^164 * 2^31 = 54775275 * 2^31 -.word 516655627 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 640922^164 * 3479293249 * 2^31 -.word 71427169 // zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 799168095 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 189990539 // zeta^148 * 2^31 = 640922^148 * 2^31 = 61145083 * 2^31 -.word 3288532917 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 640922^148 * 3479293249 * 2^31 -.word 39447195 // zeta^212 * 2^31 = 640922^212 * 2^31 = 27124577 * 2^31 -.word 747786149 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 640922^212 * 3479293249 * 2^31 -.word 215230525 // zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1204547459 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 156515505 // zeta^244 * 2^31 = 640922^244 * 2^31 = 26089164 * 2^31 -.word 250731023 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 640922^244 * 3479293249 * 2^31 -.word 137962437 // zeta^140 * 2^31 = 640922^140 * 2^31 = 57770712 * 2^31 -.word 1328818683 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 640922^140 * 3479293249 * 2^31 -.word 166084215 // zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 946723017 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 49948923 // zeta^172 * 2^31 = 640922^172 * 2^31 = 62290981 * 2^31 -.word 1871681861 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 640922^172 * 3479293249 * 2^31 -.word 117781735 // zeta^236 * 2^31 = 640922^236 * 2^31 = 56039365 * 2^31 -.word 935958105 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 640922^236 * 3479293249 * 2^31 -.word 76740047 // zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 2741621617 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 45651365 // zeta^220 * 2^31 = 640922^220 * 2^31 = 77145741 * 2^31 -.word 1623711771 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 640922^220 * 3479293249 * 2^31 -.word 210722219 // zeta^188 * 2^31 = 640922^188 * 2^31 = 94509732 * 2^31 -.word 1053575317 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 640922^188 * 3479293249 * 2^31 -.word 154691461 // zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 3110286907 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 49342849 // zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 2365024575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 57931029 // zeta^320 * 2^31 = 640922^320 * 2^31 = 40973034 * 2^31 -.word 1817703595 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 640922^320 * 3479293249 * 2^31 -.word 33534823 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 2019751897 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 152702119 // zeta^352 * 2^31 = 640922^352 * 2^31 = 95614855 * 2^31 -.word 4294577817 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 640922^352 * 3479293249 * 2^31 -.word 99649225 // zeta^272 * 2^31 = 640922^272 * 2^31 = 21310129 * 2^31 -.word 1221972471 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 640922^272 * 3479293249 * 2^31 -.word 188905087 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 1411200705 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 194550161 // zeta^304 * 2^31 = 640922^304 * 2^31 = 64584977 * 2^31 -.word 1817634095 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 640922^304 * 3479293249 * 2^31 -.word 145905997 // zeta^368 * 2^31 = 640922^368 * 2^31 = 38250802 * 2^31 -.word 3758598771 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 640922^368 * 3479293249 * 2^31 -.word 138590473 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 360304567 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 157584713 // zeta^328 * 2^31 = 640922^328 * 2^31 = 2003863 * 2^31 -.word 2292951927 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 640922^328 * 3479293249 * 2^31 -.word 5007107 // zeta^296 * 2^31 = 640922^296 * 2^31 = 62017780 * 2^31 -.word 1224307005 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 640922^296 * 3479293249 * 2^31 -.word 100271655 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 1724921113 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 75670795 // zeta^280 * 2^31 = 640922^280 * 2^31 = 52516759 * 2^31 -.word 3464448309 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 640922^280 * 3479293249 * 2^31 -.word 32467413 // zeta^344 * 2^31 = 640922^344 * 2^31 = 42747918 * 2^31 -.word 1975619563 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 640922^344 * 3479293249 * 2^31 -.word 126051989 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 2231935787 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 114200421 // zeta^376 * 2^31 = 640922^376 * 2^31 = 3413455 * 2^31 -.word 3698636379 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 640922^376 * 3479293249 * 2^31 -.word 132619977 // zeta^260 * 2^31 = 640922^260 * 2^31 = 90622009 * 2^31 -.word 3537943031 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 640922^260 * 3479293249 * 2^31 -.word 196713961 // zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 639783639 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.word 33427309 // zeta^292 * 2^31 = 640922^292 * 2^31 = 49934500 * 2^31 -.word 282512467 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 640922^292 * 3479293249 * 2^31 -.word 70643149 // zeta^356 * 2^31 = 640922^356 * 2^31 = 53867734 * 2^31 -.word 3778311667 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 640922^356 * 3479293249 * 2^31 -.word 175385683 // zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1754220525 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 27295479 // zeta^340 * 2^31 = 640922^340 * 2^31 = 47497926 * 2^31 -.word 1006434377 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 640922^340 * 3479293249 * 2^31 -.word 49927989 // zeta^308 * 2^31 = 640922^308 * 2^31 = 77055722 * 2^31 -.word 3341150859 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 640922^308 * 3479293249 * 2^31 -.word 2055493 // zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 3090419835 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 136764787 // zeta^268 * 2^31 = 640922^268 * 2^31 = 68224789 * 2^31 -.word 3912871629 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 640922^268 * 3479293249 * 2^31 -.word 79323581 // zeta^332 * 2^31 = 640922^332 * 2^31 = 50872297 * 2^31 -.word 2966148611 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 640922^332 * 3479293249 * 2^31 -.word 176475821 // zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 3359243539 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 167337095 // zeta^364 * 2^31 = 640922^364 * 2^31 = 46352028 * 2^31 -.word 2423285433 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 640922^364 * 3479293249 * 2^31 -.word 77554327 // zeta^284 * 2^31 = 640922^284 * 2^31 = 21021472 * 2^31 -.word 3177057449 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 640922^284 * 3479293249 * 2^31 -.word 140545971 // zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1553345677 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 52612251 // zeta^316 * 2^31 = 640922^316 * 2^31 = 30560095 * 2^31 -.word 2056711589 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 640922^316 * 3479293249 * 2^31 -.word 6563799 // zeta^380 * 2^31 = 640922^380 * 2^31 = 14133277 * 2^31 -.word 3241391977 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 640922^380 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_108643009_640922_incomplete_good_oop_scale -ntt_384_u32_108643009_640922_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 117231189 // 1/96 -.word 3747646315 // 1/96 twisted -.data -roots: -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 210808 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 98874168 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // XX: zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // XX: zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // XX: zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 210808 // XX: zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // XX: zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 98874168 // XX: zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // XX: zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 17380078 // XX: zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 343541970 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 3933234 // XX: zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 77745966 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 74622503 // XX: zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1475019943 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 57676451 // XX: zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1140057115 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 91290517 // XX: zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 1804486955 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 102391393 // XX: zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 2023911563 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 56124269 // XX: zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 1109376029 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 92216191 // XX: zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1822784218 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good_oop, %function -.global ntt_384_u32_108643009_640922_incomplete_good_oop -ntt_384_u32_108643009_640922_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -108643009 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r7 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r6 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r9 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r6 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmla.s32 Q2, Q3, r9 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r11,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r11,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r11,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r11,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r11,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r11,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r11,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r11,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r11,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r11,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r10,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r11,#(0)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 815674047 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3355 -// Instruction count: 2397 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s b/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s deleted file mode 100644 index 509f96a..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,3075 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_twiddles -ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 117231189 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 3747646315 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 167943169 // zeta^ 64 * 2^31 = 640922^ 64 * 2^31 = 67669976 * 2^31 -.word 1929942719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 64 * 3479293249 * 2^31 -.word 10524287 // zeta^ 32 * 2^31 = 640922^ 32 * 2^31 = 8569779 * 2^31 -.word 2274825921 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 32 * 3479293249 * 2^31 -.word 183751195 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 2275215397 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 197898871 // zeta^ 16 * 2^31 = 640922^ 16 * 2^31 = 82310697 * 2^31 -.word 189228233 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 16 * 3479293249 * 2^31 -.word 117636793 // zeta^ 80 * 2^31 = 640922^ 80 * 2^31 = 87332880 * 2^31 -.word 3072994823 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 80 * 3479293249 * 2^31 -.word 59998845 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1940964675 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 22735857 // zeta^112 * 2^31 = 640922^112 * 2^31 = 44058032 * 2^31 -.word 2477333199 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 640922^112 * 3479293249 * 2^31 -.word 127637249 // zeta^ 8 * 2^31 = 640922^ 8 * 2^31 = 1793055 * 2^31 -.word 1932647359 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 8 * 3479293249 * 2^31 -.word 78695545 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 3934662727 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 203907557 // zeta^ 40 * 2^31 = 640922^ 40 * 2^31 = 52463921 * 2^31 -.word 500614107 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 40 * 3479293249 * 2^31 -.word 212278911 // zeta^104 * 2^31 = 640922^104 * 2^31 = 46625229 * 2^31 -.word 3070660289 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 640922^104 * 3479293249 * 2^31 -.word 65439627 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 2806138549 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 141615223 // zeta^ 88 * 2^31 = 640922^ 88 * 2^31 = 56126250 * 2^31 -.word 830518985 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 88 * 3479293249 * 2^31 -.word 96791441 // zeta^ 56 * 2^31 = 640922^ 56 * 2^31 = 17702973 * 2^31 -.word 1466700591 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 56 * 3479293249 * 2^31 -.word 91234029 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 2063031507 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 172736993 // zeta^ 4 * 2^31 = 640922^ 4 * 2^31 = 640922 * 2^31 -.word 1396807903 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 4 * 3479293249 * 2^31 -.word 84666041 // zeta^ 68 * 2^31 = 640922^ 68 * 2^31 = 18021000 * 2^31 -.word 757024263 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 68 * 3479293249 * 2^31 -.word 145858849 // zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 3495799199 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 183858709 // zeta^100 * 2^31 = 640922^100 * 2^31 = 58708509 * 2^31 -.word 4012454827 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 640922^100 * 3479293249 * 2^31 -.word 177838823 // zeta^ 20 * 2^31 = 640922^ 20 * 2^31 = 81518432 * 2^31 -.word 3547181145 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 20 * 3479293249 * 2^31 -.word 41900335 // zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 2540746769 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 60770513 // zeta^ 52 * 2^31 = 640922^ 52 * 2^31 = 82553845 * 2^31 -.word 4044236271 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 52 * 3479293249 * 2^31 -.word 167358029 // zeta^116 * 2^31 = 640922^116 * 2^31 = 31587287 * 2^31 -.word 953816435 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 640922^116 * 3479293249 * 2^31 -.word 51201803 // zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 3348244277 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 80521231 // zeta^ 76 * 2^31 = 640922^ 76 * 2^31 = 40418220 * 2^31 -.word 382095665 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 76 * 3479293249 * 2^31 -.word 99504283 // zeta^ 44 * 2^31 = 640922^ 44 * 2^31 = 52603644 * 2^31 -.word 3359009189 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 44 * 3479293249 * 2^31 -.word 40810197 // zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 935723755 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 171634653 // zeta^ 28 * 2^31 = 640922^ 28 * 2^31 = 31497268 * 2^31 -.word 2671255523 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 28 * 3479293249 * 2^31 -.word 139731691 // zeta^ 92 * 2^31 = 640922^ 92 * 2^31 = 87621537 * 2^31 -.word 1117909845 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 92 * 3479293249 * 2^31 -.word 62594557 // zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1184680387 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.word 164673767 // zeta^124 * 2^31 = 640922^124 * 2^31 = 78082914 * 2^31 -.word 2238255705 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 640922^124 * 3479293249 * 2^31 -.word 159354989 // zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 2477263699 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 100054829 // zeta^192 * 2^31 = 640922^192 * 2^31 = 108643008 * 2^31 -.word 547320979 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 640922^192 * 3479293249 * 2^31 -.word 64583899 // zeta^160 * 2^31 = 640922^160 * 2^31 = 13028154 * 2^31 -.word 389477 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 640922^160 * 3479293249 * 2^31 -.word 206761731 // zeta^224 * 2^31 = 640922^224 * 2^31 = 100073230 * 2^31 -.word 2020141373 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 640922^224 * 3479293249 * 2^31 -.word 28380931 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 2883766589 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 19387147 // zeta^208 * 2^31 = 640922^208 * 2^31 = 26332312 * 2^31 -.word 4105739061 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 640922^208 * 3479293249 * 2^31 -.word 71380021 // zeta^176 * 2^31 = 640922^176 * 2^31 = 70392207 * 2^31 -.word 536368523 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 640922^176 * 3479293249 * 2^31 -.word 157287173 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 2354002619 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 59701305 // zeta^136 * 2^31 = 640922^136 * 2^31 = 106639146 * 2^31 -.word 2002015367 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 640922^136 * 3479293249 * 2^31 -.word 89648769 // zeta^200 * 2^31 = 640922^200 * 2^31 = 106849954 * 2^31 -.word 2362319935 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 640922^200 * 3479293249 * 2^31 -.word 117014363 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2570046181 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 13378461 // zeta^232 * 2^31 = 640922^232 * 2^31 = 56179088 * 2^31 -.word 3794353187 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 640922^232 * 3479293249 * 2^31 -.word 184818605 // zeta^152 * 2^31 = 640922^152 * 2^31 = 65895091 * 2^31 -.word 2319347731 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 640922^152 * 3479293249 * 2^31 -.word 151846391 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 1488828745 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103085597 // zeta^184 * 2^31 = 640922^184 * 2^31 = 105229554 * 2^31 -.word 596330915 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 640922^184 * 3479293249 * 2^31 -.word 120494577 // zeta^248 * 2^31 = 640922^248 * 2^31 = 90940036 * 2^31 -.word 2828266703 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 640922^248 * 3479293249 * 2^31 -.word 20572057 // zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 3655183655 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 44549025 // zeta^196 * 2^31 = 640922^196 * 2^31 = 108002087 * 2^31 -.word 2898159391 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 640922^196 * 3479293249 * 2^31 -.word 146642869 // zeta^164 * 2^31 = 640922^164 * 2^31 = 54775275 * 2^31 -.word 516655627 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 640922^164 * 3479293249 * 2^31 -.word 71427169 // zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 799168095 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 189990539 // zeta^148 * 2^31 = 640922^148 * 2^31 = 61145083 * 2^31 -.word 3288532917 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 640922^148 * 3479293249 * 2^31 -.word 39447195 // zeta^212 * 2^31 = 640922^212 * 2^31 = 27124577 * 2^31 -.word 747786149 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 640922^212 * 3479293249 * 2^31 -.word 215230525 // zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1204547459 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 156515505 // zeta^244 * 2^31 = 640922^244 * 2^31 = 26089164 * 2^31 -.word 250731023 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 640922^244 * 3479293249 * 2^31 -.word 137962437 // zeta^140 * 2^31 = 640922^140 * 2^31 = 57770712 * 2^31 -.word 1328818683 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 640922^140 * 3479293249 * 2^31 -.word 166084215 // zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 946723017 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 49948923 // zeta^172 * 2^31 = 640922^172 * 2^31 = 62290981 * 2^31 -.word 1871681861 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 640922^172 * 3479293249 * 2^31 -.word 117781735 // zeta^236 * 2^31 = 640922^236 * 2^31 = 56039365 * 2^31 -.word 935958105 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 640922^236 * 3479293249 * 2^31 -.word 76740047 // zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 2741621617 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 45651365 // zeta^220 * 2^31 = 640922^220 * 2^31 = 77145741 * 2^31 -.word 1623711771 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 640922^220 * 3479293249 * 2^31 -.word 210722219 // zeta^188 * 2^31 = 640922^188 * 2^31 = 94509732 * 2^31 -.word 1053575317 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 640922^188 * 3479293249 * 2^31 -.word 154691461 // zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 3110286907 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 49342849 // zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 2365024575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 57931029 // zeta^320 * 2^31 = 640922^320 * 2^31 = 40973034 * 2^31 -.word 1817703595 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 640922^320 * 3479293249 * 2^31 -.word 33534823 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 2019751897 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 152702119 // zeta^352 * 2^31 = 640922^352 * 2^31 = 95614855 * 2^31 -.word 4294577817 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 640922^352 * 3479293249 * 2^31 -.word 99649225 // zeta^272 * 2^31 = 640922^272 * 2^31 = 21310129 * 2^31 -.word 1221972471 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 640922^272 * 3479293249 * 2^31 -.word 188905087 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 1411200705 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 194550161 // zeta^304 * 2^31 = 640922^304 * 2^31 = 64584977 * 2^31 -.word 1817634095 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 640922^304 * 3479293249 * 2^31 -.word 145905997 // zeta^368 * 2^31 = 640922^368 * 2^31 = 38250802 * 2^31 -.word 3758598771 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 640922^368 * 3479293249 * 2^31 -.word 138590473 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 360304567 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 157584713 // zeta^328 * 2^31 = 640922^328 * 2^31 = 2003863 * 2^31 -.word 2292951927 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 640922^328 * 3479293249 * 2^31 -.word 5007107 // zeta^296 * 2^31 = 640922^296 * 2^31 = 62017780 * 2^31 -.word 1224307005 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 640922^296 * 3479293249 * 2^31 -.word 100271655 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 1724921113 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 75670795 // zeta^280 * 2^31 = 640922^280 * 2^31 = 52516759 * 2^31 -.word 3464448309 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 640922^280 * 3479293249 * 2^31 -.word 32467413 // zeta^344 * 2^31 = 640922^344 * 2^31 = 42747918 * 2^31 -.word 1975619563 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 640922^344 * 3479293249 * 2^31 -.word 126051989 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 2231935787 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 114200421 // zeta^376 * 2^31 = 640922^376 * 2^31 = 3413455 * 2^31 -.word 3698636379 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 640922^376 * 3479293249 * 2^31 -.word 132619977 // zeta^260 * 2^31 = 640922^260 * 2^31 = 90622009 * 2^31 -.word 3537943031 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 640922^260 * 3479293249 * 2^31 -.word 196713961 // zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 639783639 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.word 33427309 // zeta^292 * 2^31 = 640922^292 * 2^31 = 49934500 * 2^31 -.word 282512467 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 640922^292 * 3479293249 * 2^31 -.word 70643149 // zeta^356 * 2^31 = 640922^356 * 2^31 = 53867734 * 2^31 -.word 3778311667 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 640922^356 * 3479293249 * 2^31 -.word 175385683 // zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1754220525 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 27295479 // zeta^340 * 2^31 = 640922^340 * 2^31 = 47497926 * 2^31 -.word 1006434377 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 640922^340 * 3479293249 * 2^31 -.word 49927989 // zeta^308 * 2^31 = 640922^308 * 2^31 = 77055722 * 2^31 -.word 3341150859 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 640922^308 * 3479293249 * 2^31 -.word 2055493 // zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 3090419835 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 136764787 // zeta^268 * 2^31 = 640922^268 * 2^31 = 68224789 * 2^31 -.word 3912871629 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 640922^268 * 3479293249 * 2^31 -.word 79323581 // zeta^332 * 2^31 = 640922^332 * 2^31 = 50872297 * 2^31 -.word 2966148611 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 640922^332 * 3479293249 * 2^31 -.word 176475821 // zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 3359243539 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 167337095 // zeta^364 * 2^31 = 640922^364 * 2^31 = 46352028 * 2^31 -.word 2423285433 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 640922^364 * 3479293249 * 2^31 -.word 77554327 // zeta^284 * 2^31 = 640922^284 * 2^31 = 21021472 * 2^31 -.word 3177057449 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 640922^284 * 3479293249 * 2^31 -.word 140545971 // zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1553345677 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 52612251 // zeta^316 * 2^31 = 640922^316 * 2^31 = 30560095 * 2^31 -.word 2056711589 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 640922^316 * 3479293249 * 2^31 -.word 6563799 // zeta^380 * 2^31 = 640922^380 * 2^31 = 14133277 * 2^31 -.word 3241391977 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 640922^380 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_scale -ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 117231189 // 1/96 -.word 3747646315 // 1/96 twisted -.data -roots: -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 210808 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 98874168 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // XX: zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // XX: zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // XX: zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 210808 // XX: zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // XX: zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 98874168 // XX: zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // XX: zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 17380078 // XX: zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 343541970 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 3933234 // XX: zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 77745966 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 74622503 // XX: zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1475019943 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 57676451 // XX: zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1140057115 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 91290517 // XX: zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 1804486955 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 102391393 // XX: zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 2023911563 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 56124269 // XX: zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 1109376029 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 92216191 // XX: zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1822784218 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good_oop_half_input, %function -.global ntt_384_u32_108643009_640922_incomplete_good_oop_half_input -ntt_384_u32_108643009_640922_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -108643009 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q0 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r6 -// input[4]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 4)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r9 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r11,#(-496)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(16)] -// Release input[0] from Q1 -// Release input[128] from Q0 -// input[4]: Already loaded as Q7 -// input[132]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmulh.s32 Q1, Q7, r6 -// input[8]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 8)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -// Release input[132] from Q6 -// Release input[4] from Q7 -// input[136]: Already loaded as Q4 -// input[8]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[140]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[136] from Q4 -vstrw.u32 Q2, [r11,#(48)] -vsub.s32 Q4, Q1, Q5 -// Release input[8] from Q5 -vstrw.u32 Q4, [r11,#(-464)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(32)] -// input[140]: Already loaded as Q6 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmulh.s32 Q1, Q6, r6 -// input[16]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-448)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release input[12] from Q3 -// Release input[140] from Q6 -// input[16]: Already loaded as Q7 -// input[144]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmulh.s32 Q1, Q7, r6 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(64)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(80)] -// Release input[144] from Q5 -// Release input[16] from Q7 -// input[148]: Already loaded as Q4 -// input[20]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[148] from Q4 -vstrw.u32 Q2, [r11,#(96)] -vsub.s32 Q4, Q1, Q6 -// Release input[20] from Q6 -vstrw.u32 Q4, [r11,#(-416)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(80)] -// input[152]: Already loaded as Q5 -// input[24]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[156]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmulh.s32 Q1, Q5, r6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-400)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(112)] -// Release input[24] from Q3 -// Release input[152] from Q5 -// input[28]: Already loaded as Q7 -// input[156]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmulh.s32 Q1, Q7, r6 -// input[32]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 32)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(112)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release input[156] from Q6 -// Release input[28] from Q7 -// input[160]: Already loaded as Q4 -// input[32]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[160] from Q4 -vstrw.u32 Q2, [r11,#(144)] -vsub.s32 Q4, Q1, Q5 -// Release input[32] from Q5 -vstrw.u32 Q4, [r11,#(-368)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(128)] -// input[164]: Already loaded as Q6 -// input[36]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmulh.s32 Q1, Q6, r6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-352)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(160)] -// Release input[36] from Q3 -// Release input[164] from Q6 -// input[40]: Already loaded as Q7 -// input[168]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmulh.s32 Q1, Q7, r6 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(176)] -// Release input[168] from Q5 -// Release input[40] from Q7 -// input[172]: Already loaded as Q4 -// input[44]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[172] from Q4 -vstrw.u32 Q2, [r11,#(192)] -vsub.s32 Q4, Q1, Q6 -// Release input[44] from Q6 -vstrw.u32 Q4, [r11,#(-320)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(176)] -// input[176]: Already loaded as Q5 -// input[48]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmulh.s32 Q1, Q5, r6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-304)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(208)] -// Release input[48] from Q3 -// Release input[176] from Q5 -// input[52]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmulh.s32 Q1, Q7, r6 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(224)] -// Release input[180] from Q6 -// Release input[52] from Q7 -// input[184]: Already loaded as Q4 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[188]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[184] from Q4 -vstrw.u32 Q2, [r11,#(240)] -vsub.s32 Q4, Q1, Q5 -// Release input[56] from Q5 -vstrw.u32 Q4, [r11,#(-272)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(224)] -// input[188]: Already loaded as Q6 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -vqrdmulh.s32 Q1, Q6, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-256)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release input[60] from Q3 -// Release input[188] from Q6 -// input[64]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -vneg.s32 Q1, Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q5, r6 -vstrw.u32 Q5, [r11,#(-240)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(256)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(272)] -// Release input[64] from Q5 -// input[68]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -vneg.s32 Q1, Q3 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q3, r6 -vstrw.u32 Q3, [r11,#(288)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(272)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[68] from Q3 -// input[72]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(288)] -vstrw.u32 Q4, [r11,#(304)] -vstrw.u32 Q4, [r11,#(-208)] -// Release input[72] from Q4 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-192)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(320)] -// Release input[76] from Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(336)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(320)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[80] from Q4 -// input[84]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(336)] -vstrw.u32 Q3, [r11,#(352)] -vstrw.u32 Q3, [r11,#(-160)] -// Release input[84] from Q3 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-144)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(368)] -// Release input[88] from Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(384)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(368)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[92] from Q4 -// input[96]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(384)] -vstrw.u32 Q3, [r11,#(400)] -vstrw.u32 Q3, [r11,#(-112)] -// Release input[96] from Q3 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-96)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(400)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release input[100] from Q0 -// input[104]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(432)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(416)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[104] from Q4 -// input[108]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(432)] -vstrw.u32 Q3, [r11,#(448)] -vstrw.u32 Q3, [r11,#(-64)] -// Release input[108] from Q3 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-48)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(448)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(464)] -// Release input[112] from Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(480)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(464)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[116] from Q4 -// input[120]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(480)] -vstrw.u32 Q3, [r11,#(496)] -vstrw.u32 Q3, [r11,#(-16)] -// Release input[120] from Q3 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(0)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(496)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[124] from Q0 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 815674047 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3042 -// Instruction count: 2201 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good.s deleted file mode 100644 index 44e101d..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_114826273_2551686_incomplete_good_twiddles -ntt_384_u32_114826273_2551686_incomplete_good_twiddles: // For base multiplication -.word 28609785 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 3700025895 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 41241721 // zeta^ 64 * 2^31 = 2551686^ 64 * 2^31 = 42887728 * 2^31 -.word 2769623719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 64 * 553543649 * 2^31 -.word 86448423 // zeta^ 32 * 2^31 = 2551686^ 32 * 2^31 = 10217507 * 2^31 -.word 3906089913 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 32 * 553543649 * 2^31 -.word 183227061 // zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 3579771883 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 126234215 // zeta^ 16 * 2^31 = 2551686^ 16 * 2^31 = 35511129 * 2^31 -.word 2083767929 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 16 * 553543649 * 2^31 -.word 87914355 // zeta^ 80 * 2^31 = 2551686^ 80 * 2^31 = 29513246 * 2^31 -.word 3318791917 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 80 * 553543649 * 2^31 -.word 3386929 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 316682223 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 43010579 // zeta^112 * 2^31 = 2551686^112 * 2^31 = 91801134 * 2^31 -.word 3532079181 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 2551686^112 * 553543649 * 2^31 -.word 64379879 // zeta^ 8 * 2^31 = 2551686^ 8 * 2^31 = 107284677 * 2^31 -.word 1382893817 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 8 * 553543649 * 2^31 -.word 34806835 // zeta^ 72 * 2^31 = 2551686^ 72 * 2^31 = 38894874 * 2^31 -.word 574730797 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 72 * 553543649 * 2^31 -.word 38994743 // zeta^ 40 * 2^31 = 2551686^ 40 * 2^31 = 42274665 * 2^31 -.word 2116881321 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 40 * 553543649 * 2^31 -.word 66134929 // zeta^104 * 2^31 = 2551686^104 * 2^31 = 83688403 * 2^31 -.word 434267791 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 2551686^104 * 553543649 * 2^31 -.word 69421183 // zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1557147489 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 45651711 // zeta^ 88 * 2^31 = 2551686^ 88 * 2^31 = 87903397 * 2^31 -.word 2785063137 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 88 * 553543649 * 2^31 -.word 139403893 // zeta^ 56 * 2^31 = 2551686^ 56 * 2^31 = 72290827 * 2^31 -.word 192914475 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 56 * 553543649 * 2^31 -.word 79481515 // zeta^120 * 2^31 = 2551686^120 * 2^31 = 35185048 * 2^31 -.word 1792873141 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 2551686^120 * 553543649 * 2^31 -.word 203088573 // zeta^ 4 * 2^31 = 2551686^ 4 * 2^31 = 2551686 * 2^31 -.word 4087262435 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 4 * 553543649 * 2^31 -.word 54238839 // zeta^ 68 * 2^31 = 2551686^ 68 * 2^31 = 31842847 * 2^31 -.word 298178665 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 68 * 553543649 * 2^31 -.word 37245341 // zeta^ 36 * 2^31 = 2551686^ 36 * 2^31 = 104977060 * 2^31 -.word 3387169283 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 36 * 553543649 * 2^31 -.word 135079565 // zeta^100 * 2^31 = 2551686^100 * 2^31 = 8890579 * 2^31 -.word 3255329555 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 2551686^100 * 553543649 * 2^31 -.word 221900801 // zeta^ 20 * 2^31 = 2551686^ 20 * 2^31 = 49422185 * 2^31 -.word 3481643039 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 20 * 553543649 * 2^31 -.word 54737445 // zeta^ 84 * 2^31 = 2551686^ 84 * 2^31 = 69964525 * 2^31 -.word 3301900923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 84 * 553543649 * 2^31 -.word 209527495 // zeta^ 52 * 2^31 = 2551686^ 52 * 2^31 = 20420695 * 2^31 -.word 1268014617 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 52 * 553543649 * 2^31 -.word 33294343 // zeta^116 * 2^31 = 2551686^116 * 2^31 = 4619010 * 2^31 -.word 55582937 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 2551686^116 * 553543649 * 2^31 -.word 224674633 // zeta^ 12 * 2^31 = 2551686^ 12 * 2^31 = 64987487 * 2^31 -.word 23100887 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 12 * 553543649 * 2^31 -.word 173107497 // zeta^ 76 * 2^31 = 2551686^ 76 * 2^31 = 57394293 * 2^31 -.word 3061126135 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 76 * 553543649 * 2^31 -.word 207049913 // zeta^ 44 * 2^31 = 2551686^ 44 * 2^31 = 80711981 * 2^31 -.word 3566869095 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 44 * 553543649 * 2^31 -.word 106889387 // zeta^108 * 2^31 = 2551686^108 * 2^31 = 87479803 * 2^31 -.word 596745397 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 2551686^108 * 553543649 * 2^31 -.word 62148987 // zeta^ 28 * 2^31 = 2551686^ 28 * 2^31 = 25357458 * 2^31 -.word 4189185509 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 28 * 553543649 * 2^31 -.word 218880525 // zeta^ 92 * 2^31 = 2551686^ 92 * 2^31 = 110972869 * 2^31 -.word 3404996499 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 92 * 553543649 * 2^31 -.word 47821729 // zeta^ 60 * 2^31 = 2551686^ 60 * 2^31 = 21139561 * 2^31 -.word 2376423551 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 60 * 553543649 * 2^31 -.word 79224313 // zeta^124 * 2^31 = 2551686^124 * 2^31 = 24273777 * 2^31 -.word 455588135 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 2551686^124 * 553543649 * 2^31 -.word 127458209 // zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 3364565119 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 201042761 // zeta^192 * 2^31 = 2551686^192 * 2^31 = 114826272 * 2^31 -.word 594941399 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 2551686^192 * 553543649 * 2^31 -.word 211604911 // zeta^160 * 2^31 = 2551686^160 * 2^31 = 1326612 * 2^31 -.word 3968649265 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 2551686^160 * 553543649 * 2^31 -.word 143204123 // zeta^224 * 2^31 = 2551686^224 * 2^31 = 104608766 * 2^31 -.word 388877381 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 2551686^224 * 553543649 * 2^31 -.word 76506413 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 1235023987 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 103418331 // zeta^208 * 2^31 = 2551686^208 * 2^31 = 79315144 * 2^31 -.word 2211199365 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 2551686^208 * 553543649 * 2^31 -.word 154449923 // zeta^176 * 2^31 = 2551686^176 * 2^31 = 59320057 * 2^31 -.word 3215396957 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 2551686^176 * 553543649 * 2^31 -.word 226265617 // zeta^240 * 2^31 = 2551686^240 * 2^31 = 82345196 * 2^31 -.word 3978285071 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 2551686^240 * 553543649 * 2^31 -.word 85253229 // zeta^136 * 2^31 = 2551686^136 * 2^31 = 46436470 * 2^31 -.word 3486804275 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 2551686^136 * 553543649 * 2^31 -.word 165272667 // zeta^200 * 2^31 = 2551686^200 * 2^31 = 7541596 * 2^31 -.word 2912073477 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 2551686^200 * 553543649 * 2^31 -.word 141966459 // zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 2612353765 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 190657803 // zeta^232 * 2^31 = 2551686^232 * 2^31 = 72551608 * 2^31 -.word 2178085973 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 2551686^232 * 553543649 * 2^31 -.word 91056801 // zeta^152 * 2^31 = 2551686^152 * 2^31 = 96718378 * 2^31 -.word 1227915647 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 2551686^152 * 553543649 * 2^31 -.word 160231363 // zeta^216 * 2^31 = 2551686^216 * 2^31 = 8814981 * 2^31 -.word 2737819805 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 2551686^216 * 553543649 * 2^31 -.word 54903895 // zeta^184 * 2^31 = 2551686^184 * 2^31 = 77720494 * 2^31 -.word 1599958665 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 2551686^184 * 553543649 * 2^31 -.word 90248653 // zeta^248 * 2^31 = 2551686^248 * 2^31 = 42535446 * 2^31 -.word 4102052819 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 2551686^248 * 553543649 * 2^31 -.word 195629085 // zeta^132 * 2^31 = 2551686^132 * 2^31 = 29291161 * 2^31 -.word 505883523 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 2551686^132 * 553543649 * 2^31 -.word 26563973 // zeta^196 * 2^31 = 2551686^196 * 2^31 = 112274587 * 2^31 -.word 207704859 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 2551686^196 * 553543649 * 2^31 -.word 212660497 // zeta^164 * 2^31 = 2551686^164 * 2^31 = 18739792 * 2^31 -.word 4163127567 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 2551686^164 * 553543649 * 2^31 -.word 192407205 // zeta^228 * 2^31 = 2551686^228 * 2^31 = 9849213 * 2^31 -.word 907798011 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 2551686^228 * 553543649 * 2^31 -.word 177315463 // zeta^148 * 2^31 = 2551686^148 * 2^31 = 20542340 * 2^31 -.word 4115225177 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 2551686^148 * 553543649 * 2^31 -.word 7751745 // zeta^212 * 2^31 = 2551686^212 * 2^31 = 65404088 * 2^31 -.word 813324255 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 2551686^212 * 553543649 * 2^31 -.word 168245667 // zeta^180 * 2^31 = 2551686^180 * 2^31 = 99024588 * 2^31 -.word 3082535613 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 2551686^180 * 553543649 * 2^31 -.word 20125051 // zeta^244 * 2^31 = 2551686^244 * 2^31 = 94405578 * 2^31 -.word 3026952677 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 2551686^244 * 553543649 * 2^31 -.word 63259137 // zeta^140 * 2^31 = 2551686^140 * 2^31 = 107233079 * 2^31 -.word 3038025247 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 2551686^140 * 553543649 * 2^31 -.word 4977913 // zeta^204 * 2^31 = 2551686^204 * 2^31 = 49838786 * 2^31 -.word 4271866407 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 2551686^204 * 553543649 * 2^31 -.word 14665747 // zeta^172 * 2^31 = 2551686^172 * 2^31 = 6767822 * 2^31 -.word 1324843597 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 2551686^172 * 553543649 * 2^31 -.word 22602633 // zeta^236 * 2^31 = 2551686^236 * 2^31 = 34114292 * 2^31 -.word 728098199 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 2551686^236 * 553543649 * 2^31 -.word 41905265 // zeta^156 * 2^31 = 2551686^156 * 2^31 = 85615411 * 2^31 -.word 3510778287 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 2551686^156 * 553543649 * 2^31 -.word 167503559 // zeta^220 * 2^31 = 2551686^220 * 2^31 = 89468815 * 2^31 -.word 105781785 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 2551686^220 * 553543649 * 2^31 -.word 146228857 // zeta^188 * 2^31 = 2551686^188 * 2^31 = 3134216 * 2^31 -.word 2374131879 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 2551686^188 * 553543649 * 2^31 -.word 181830817 // zeta^252 * 2^31 = 2551686^252 * 2^31 = 93686712 * 2^31 -.word 1918543743 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 2551686^252 * 553543649 * 2^31 -.word 188410825 // zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1525343575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 102194337 // zeta^320 * 2^31 = 2551686^320 * 2^31 = 71938546 * 2^31 -.word 930402175 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 2551686^320 * 553543649 * 2^31 -.word 46425485 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 715195411 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 18047635 // zeta^352 * 2^31 = 2551686^352 * 2^31 = 113499661 * 2^31 -.word 326318029 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 2551686^352 * 553543649 * 2^31 -.word 141738191 // zeta^272 * 2^31 = 2551686^272 * 2^31 = 85313027 * 2^31 -.word 976175377 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 2551686^272 * 553543649 * 2^31 -.word 153146133 // zeta^336 * 2^31 = 2551686^336 * 2^31 = 5997883 * 2^31 -.word 3059943307 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 2551686^336 * 553543649 * 2^31 -.word 186641967 // zeta^304 * 2^31 = 2551686^304 * 2^31 = 23025139 * 2^31 -.word 762888113 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 2551686^304 * 553543649 * 2^31 -.word 75202623 // zeta^368 * 2^31 = 2551686^368 * 2^31 = 55506216 * 2^31 -.word 1079570337 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 2551686^368 * 553543649 * 2^31 -.word 194845711 // zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 3720236497 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 144399317 // zeta^328 * 2^31 = 2551686^328 * 2^31 = 68389803 * 2^31 -.word 808163019 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 2551686^328 * 553543649 * 2^31 -.word 163517617 // zeta^296 * 2^31 = 2551686^296 * 2^31 = 31137870 * 2^31 -.word 3860699503 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 2551686^296 * 553543649 * 2^31 -.word 87686087 // zeta^360 * 2^31 = 2551686^360 * 2^31 = 73412535 * 2^31 -.word 1682613529 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 2551686^360 * 553543649 * 2^31 -.word 184000835 // zeta^280 * 2^31 = 2551686^280 * 2^31 = 26922876 * 2^31 -.word 1509904157 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 2551686^280 * 553543649 * 2^31 -.word 138595745 // zeta^344 * 2^31 = 2551686^344 * 2^31 = 18107895 * 2^31 -.word 3067051647 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 2551686^344 * 553543649 * 2^31 -.word 150171031 // zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 2502094153 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 174748651 // zeta^376 * 2^31 = 2551686^376 * 2^31 = 37105779 * 2^31 -.word 2695008629 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 2551686^376 * 553543649 * 2^31 -.word 175413707 // zeta^260 * 2^31 = 2551686^260 * 2^31 = 82983426 * 2^31 -.word 3996788629 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 2551686^260 * 553543649 * 2^31 -.word 34023461 // zeta^324 * 2^31 = 2551686^324 * 2^31 = 85535112 * 2^31 -.word 3789083771 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 2551686^324 * 553543649 * 2^31 -.word 94572981 // zeta^292 * 2^31 = 2551686^292 * 2^31 = 105935694 * 2^31 -.word 1039637739 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 2551686^292 * 553543649 * 2^31 -.word 16992049 // zeta^356 * 2^31 = 2551686^356 * 2^31 = 96086481 * 2^31 -.word 131839727 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 2551686^356 * 553543649 * 2^31 -.word 174915101 // zeta^276 * 2^31 = 2551686^276 * 2^31 = 44861748 * 2^31 -.word 993066371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 2551686^276 * 553543649 * 2^31 -.word 52337083 // zeta^340 * 2^31 = 2551686^340 * 2^31 = 94283933 * 2^31 -.word 179742117 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 2551686^340 * 553543649 * 2^31 -.word 196358203 // zeta^308 * 2^31 = 2551686^308 * 2^31 = 110207263 * 2^31 -.word 4239384357 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 2551686^308 * 553543649 * 2^31 -.word 61406879 // zeta^372 * 2^31 = 2551686^372 * 2^31 = 15801685 * 2^31 -.word 1212431681 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 2551686^372 * 553543649 * 2^31 -.word 56545049 // zeta^268 * 2^31 = 2551686^268 * 2^31 = 57431980 * 2^31 -.word 1233841159 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 2551686^268 * 553543649 * 2^31 -.word 166393409 // zeta^332 * 2^31 = 2551686^332 * 2^31 = 7593194 * 2^31 -.word 1256942047 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 2551686^332 * 553543649 * 2^31 -.word 122763159 // zeta^300 * 2^31 = 2551686^300 * 2^31 = 27346470 * 2^31 -.word 3698221897 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 2551686^300 * 553543649 * 2^31 -.word 214986799 // zeta^364 * 2^31 = 2551686^364 * 2^31 = 108058451 * 2^31 -.word 2970123697 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 2551686^364 * 553543649 * 2^31 -.word 10772021 // zeta^284 * 2^31 = 2551686^284 * 2^31 = 3853404 * 2^31 -.word 889970795 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 2551686^284 * 553543649 * 2^31 -.word 187747281 // zeta^348 * 2^31 = 2551686^348 * 2^31 = 29210862 * 2^31 -.word 784189007 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 2551686^348 * 553543649 * 2^31 -.word 150428233 // zeta^316 * 2^31 = 2551686^316 * 2^31 = 90552496 * 2^31 -.word 3839379159 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 2551686^316 * 553543649 * 2^31 -.word 83423689 // zeta^380 * 2^31 = 2551686^380 * 2^31 = 111692057 * 2^31 -.word 1920835415 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 2551686^380 * 553543649 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_114826273_2551686_incomplete_good_scale -ntt_384_u32_114826273_2551686_incomplete_good_scale: // Constants for scaling by 1/N -.word 28609785 // 1/96 -.word 3700025895 // 1/96 twisted -.data -roots: -.word 71938545 /// zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 42887727 /// zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 108828390 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 32481077 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 108828390 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 75931399 // zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 1420070803 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 41413738 // zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 774520698 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 32481077 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 106011292 // zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1982625667 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 79641225 // zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 1489452056 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // XX: zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 108828390 // XX: zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 32481077 // XX: zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 75931399 // XX: zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 1420070803 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 41413738 // XX: zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 774520698 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 106011292 // XX: zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1982625667 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 79641225 // XX: zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 1489452056 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 29291161 // XX: zeta^132 * 2^31 = 2551686^132 * 2^31 = 29291161 * 2^31 -.word 547803979 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 2551686^132 * 553543649 * 2^31 -.word 104977060 // XX: zeta^ 36 * 2^31 = 2551686^ 36 * 2^31 = 104977060 * 2^31 -.word 1963283436 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 36 * 553543649 * 2^31 -.word 44861748 // XX: zeta^276 * 2^31 = 2551686^276 * 2^31 = 44861748 * 2^31 -.word 839005462 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 2551686^276 * 553543649 * 2^31 -.word 99024588 // XX: zeta^180 * 2^31 = 2551686^180 * 2^31 = 99024588 * 2^31 -.word 1851960165 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 2551686^180 * 553543649 * 2^31 -.word 64987487 // XX: zeta^ 12 * 2^31 = 2551686^ 12 * 2^31 = 64987487 * 2^31 -.word 1215397505 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 12 * 553543649 * 2^31 -.word 27346470 // XX: zeta^300 * 2^31 = 2551686^300 * 2^31 = 27346470 * 2^31 -.word 511434323 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 2551686^300 * 553543649 * 2^31 -.word 85615411 // XX: zeta^156 * 2^31 = 2551686^156 * 2^31 = 85615411 * 2^31 -.word 1601181422 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 2551686^156 * 553543649 * 2^31 -.word 21139561 // XX: zeta^ 60 * 2^31 = 2551686^ 60 * 2^31 = 21139561 * 2^31 -.word 395352565 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 60 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_114826273_2551686_incomplete_good, %function -.global ntt_384_u32_114826273_2551686_incomplete_good -ntt_384_u32_114826273_2551686_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -114826273 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 3741423647 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_bitrev.s deleted file mode 100644 index 390b03a..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 42887727 /// zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 71938545 /// zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 11544119 // zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 215898384 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 11544119 // zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 215898384 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 11544119 // zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 215898384 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 82345196 // zeta^240 * 2^31 = 2551686^240 * 2^31 = 82345196 * 2^31 -.word 1540021785 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 2551686^240 * 553543649 * 2^31 -.word 5997883 // zeta^336 * 2^31 = 2551686^336 * 2^31 = 5997883 * 2^31 -.word 112172548 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 2551686^336 * 553543649 * 2^31 -.word 82345196 // zeta^240 * 2^31 = 2551686^240 * 2^31 = 82345196 * 2^31 -.word 1540021785 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 2551686^240 * 553543649 * 2^31 -.word 35185048 // zeta^120 * 2^31 = 2551686^120 * 2^31 = 35185048 * 2^31 -.word 658031592 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 2551686^120 * 553543649 * 2^31 -.word 8814981 // zeta^216 * 2^31 = 2551686^216 * 2^31 = 8814981 * 2^31 -.word 164857981 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 2551686^216 * 553543649 * 2^31 -.word 5997883 // zeta^336 * 2^31 = 2551686^336 * 2^31 = 5997883 * 2^31 -.word 112172548 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 2551686^336 * 553543649 * 2^31 -.word 73412535 // zeta^360 * 2^31 = 2551686^360 * 2^31 = 73412535 * 2^31 -.word 1372962950 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 2551686^360 * 553543649 * 2^31 -.word 38894874 // zeta^ 72 * 2^31 = 2551686^ 72 * 2^31 = 38894874 * 2^31 -.word 727412845 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 72 * 553543649 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 11544119 // XX: zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 215898384 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 82345196 // XX: zeta^240 * 2^31 = 2551686^240 * 2^31 = 82345196 * 2^31 -.word 1540021785 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 2551686^240 * 553543649 * 2^31 -.word 5997883 // XX: zeta^336 * 2^31 = 2551686^336 * 2^31 = 5997883 * 2^31 -.word 112172548 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 2551686^336 * 553543649 * 2^31 -.word 35185048 // XX: zeta^120 * 2^31 = 2551686^120 * 2^31 = 35185048 * 2^31 -.word 658031592 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 2551686^120 * 553543649 * 2^31 -.word 8814981 // XX: zeta^216 * 2^31 = 2551686^216 * 2^31 = 8814981 * 2^31 -.word 164857981 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 2551686^216 * 553543649 * 2^31 -.word 73412535 // XX: zeta^360 * 2^31 = 2551686^360 * 2^31 = 73412535 * 2^31 -.word 1372962950 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 2551686^360 * 553543649 * 2^31 -.word 38894874 // XX: zeta^ 72 * 2^31 = 2551686^ 72 * 2^31 = 38894874 * 2^31 -.word 727412845 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 72 * 553543649 * 2^31 -.word 93686712 // XX: zeta^252 * 2^31 = 2551686^252 * 2^31 = 93686712 * 2^31 -.word 1752131083 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 2551686^252 * 553543649 * 2^31 -.word 29210862 // XX: zeta^348 * 2^31 = 2551686^348 * 2^31 = 29210862 * 2^31 -.word 546302226 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 2551686^348 * 553543649 * 2^31 -.word 87479803 // XX: zeta^108 * 2^31 = 2551686^108 * 2^31 = 87479803 * 2^31 -.word 1636049325 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 2551686^108 * 553543649 * 2^31 -.word 49838786 // XX: zeta^204 * 2^31 = 2551686^204 * 2^31 = 49838786 * 2^31 -.word 932086143 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 2551686^204 * 553543649 * 2^31 -.word 15801685 // XX: zeta^372 * 2^31 = 2551686^372 * 2^31 = 15801685 * 2^31 -.word 295523483 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 2551686^372 * 553543649 * 2^31 -.word 69964525 // XX: zeta^ 84 * 2^31 = 2551686^ 84 * 2^31 = 69964525 * 2^31 -.word 1308478186 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 84 * 553543649 * 2^31 -.word 9849213 // XX: zeta^228 * 2^31 = 2551686^228 * 2^31 = 9849213 * 2^31 -.word 184200212 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 2551686^228 * 553543649 * 2^31 -.word 85535112 // XX: zeta^324 * 2^31 = 2551686^324 * 2^31 = 85535112 * 2^31 -.word 1599679669 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 2551686^324 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_114826273_2551686_incomplete_good_bitrev, %function -.global ntt_384_u32_114826273_2551686_incomplete_good_bitrev -ntt_384_u32_114826273_2551686_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -114826273 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 3741423647 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop.s b/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop.s deleted file mode 100644 index 645da99..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop.s +++ /dev/null @@ -1,3388 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_114826273_2551686_incomplete_good_oop_twiddles -ntt_384_u32_114826273_2551686_incomplete_good_oop_twiddles: // For base multiplication -.word 28609785 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 3700025895 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 41241721 // zeta^ 64 * 2^31 = 2551686^ 64 * 2^31 = 42887728 * 2^31 -.word 2769623719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 64 * 553543649 * 2^31 -.word 86448423 // zeta^ 32 * 2^31 = 2551686^ 32 * 2^31 = 10217507 * 2^31 -.word 3906089913 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 32 * 553543649 * 2^31 -.word 183227061 // zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 3579771883 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 126234215 // zeta^ 16 * 2^31 = 2551686^ 16 * 2^31 = 35511129 * 2^31 -.word 2083767929 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 16 * 553543649 * 2^31 -.word 87914355 // zeta^ 80 * 2^31 = 2551686^ 80 * 2^31 = 29513246 * 2^31 -.word 3318791917 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 80 * 553543649 * 2^31 -.word 3386929 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 316682223 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 43010579 // zeta^112 * 2^31 = 2551686^112 * 2^31 = 91801134 * 2^31 -.word 3532079181 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 2551686^112 * 553543649 * 2^31 -.word 64379879 // zeta^ 8 * 2^31 = 2551686^ 8 * 2^31 = 107284677 * 2^31 -.word 1382893817 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 8 * 553543649 * 2^31 -.word 34806835 // zeta^ 72 * 2^31 = 2551686^ 72 * 2^31 = 38894874 * 2^31 -.word 574730797 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 72 * 553543649 * 2^31 -.word 38994743 // zeta^ 40 * 2^31 = 2551686^ 40 * 2^31 = 42274665 * 2^31 -.word 2116881321 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 40 * 553543649 * 2^31 -.word 66134929 // zeta^104 * 2^31 = 2551686^104 * 2^31 = 83688403 * 2^31 -.word 434267791 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 2551686^104 * 553543649 * 2^31 -.word 69421183 // zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1557147489 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 45651711 // zeta^ 88 * 2^31 = 2551686^ 88 * 2^31 = 87903397 * 2^31 -.word 2785063137 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 88 * 553543649 * 2^31 -.word 139403893 // zeta^ 56 * 2^31 = 2551686^ 56 * 2^31 = 72290827 * 2^31 -.word 192914475 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 56 * 553543649 * 2^31 -.word 79481515 // zeta^120 * 2^31 = 2551686^120 * 2^31 = 35185048 * 2^31 -.word 1792873141 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 2551686^120 * 553543649 * 2^31 -.word 203088573 // zeta^ 4 * 2^31 = 2551686^ 4 * 2^31 = 2551686 * 2^31 -.word 4087262435 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 4 * 553543649 * 2^31 -.word 54238839 // zeta^ 68 * 2^31 = 2551686^ 68 * 2^31 = 31842847 * 2^31 -.word 298178665 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 68 * 553543649 * 2^31 -.word 37245341 // zeta^ 36 * 2^31 = 2551686^ 36 * 2^31 = 104977060 * 2^31 -.word 3387169283 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 36 * 553543649 * 2^31 -.word 135079565 // zeta^100 * 2^31 = 2551686^100 * 2^31 = 8890579 * 2^31 -.word 3255329555 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 2551686^100 * 553543649 * 2^31 -.word 221900801 // zeta^ 20 * 2^31 = 2551686^ 20 * 2^31 = 49422185 * 2^31 -.word 3481643039 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 20 * 553543649 * 2^31 -.word 54737445 // zeta^ 84 * 2^31 = 2551686^ 84 * 2^31 = 69964525 * 2^31 -.word 3301900923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 84 * 553543649 * 2^31 -.word 209527495 // zeta^ 52 * 2^31 = 2551686^ 52 * 2^31 = 20420695 * 2^31 -.word 1268014617 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 52 * 553543649 * 2^31 -.word 33294343 // zeta^116 * 2^31 = 2551686^116 * 2^31 = 4619010 * 2^31 -.word 55582937 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 2551686^116 * 553543649 * 2^31 -.word 224674633 // zeta^ 12 * 2^31 = 2551686^ 12 * 2^31 = 64987487 * 2^31 -.word 23100887 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 12 * 553543649 * 2^31 -.word 173107497 // zeta^ 76 * 2^31 = 2551686^ 76 * 2^31 = 57394293 * 2^31 -.word 3061126135 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 76 * 553543649 * 2^31 -.word 207049913 // zeta^ 44 * 2^31 = 2551686^ 44 * 2^31 = 80711981 * 2^31 -.word 3566869095 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 44 * 553543649 * 2^31 -.word 106889387 // zeta^108 * 2^31 = 2551686^108 * 2^31 = 87479803 * 2^31 -.word 596745397 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 2551686^108 * 553543649 * 2^31 -.word 62148987 // zeta^ 28 * 2^31 = 2551686^ 28 * 2^31 = 25357458 * 2^31 -.word 4189185509 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 28 * 553543649 * 2^31 -.word 218880525 // zeta^ 92 * 2^31 = 2551686^ 92 * 2^31 = 110972869 * 2^31 -.word 3404996499 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 92 * 553543649 * 2^31 -.word 47821729 // zeta^ 60 * 2^31 = 2551686^ 60 * 2^31 = 21139561 * 2^31 -.word 2376423551 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 60 * 553543649 * 2^31 -.word 79224313 // zeta^124 * 2^31 = 2551686^124 * 2^31 = 24273777 * 2^31 -.word 455588135 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 2551686^124 * 553543649 * 2^31 -.word 127458209 // zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 3364565119 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 201042761 // zeta^192 * 2^31 = 2551686^192 * 2^31 = 114826272 * 2^31 -.word 594941399 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 2551686^192 * 553543649 * 2^31 -.word 211604911 // zeta^160 * 2^31 = 2551686^160 * 2^31 = 1326612 * 2^31 -.word 3968649265 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 2551686^160 * 553543649 * 2^31 -.word 143204123 // zeta^224 * 2^31 = 2551686^224 * 2^31 = 104608766 * 2^31 -.word 388877381 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 2551686^224 * 553543649 * 2^31 -.word 76506413 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 1235023987 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 103418331 // zeta^208 * 2^31 = 2551686^208 * 2^31 = 79315144 * 2^31 -.word 2211199365 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 2551686^208 * 553543649 * 2^31 -.word 154449923 // zeta^176 * 2^31 = 2551686^176 * 2^31 = 59320057 * 2^31 -.word 3215396957 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 2551686^176 * 553543649 * 2^31 -.word 226265617 // zeta^240 * 2^31 = 2551686^240 * 2^31 = 82345196 * 2^31 -.word 3978285071 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 2551686^240 * 553543649 * 2^31 -.word 85253229 // zeta^136 * 2^31 = 2551686^136 * 2^31 = 46436470 * 2^31 -.word 3486804275 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 2551686^136 * 553543649 * 2^31 -.word 165272667 // zeta^200 * 2^31 = 2551686^200 * 2^31 = 7541596 * 2^31 -.word 2912073477 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 2551686^200 * 553543649 * 2^31 -.word 141966459 // zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 2612353765 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 190657803 // zeta^232 * 2^31 = 2551686^232 * 2^31 = 72551608 * 2^31 -.word 2178085973 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 2551686^232 * 553543649 * 2^31 -.word 91056801 // zeta^152 * 2^31 = 2551686^152 * 2^31 = 96718378 * 2^31 -.word 1227915647 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 2551686^152 * 553543649 * 2^31 -.word 160231363 // zeta^216 * 2^31 = 2551686^216 * 2^31 = 8814981 * 2^31 -.word 2737819805 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 2551686^216 * 553543649 * 2^31 -.word 54903895 // zeta^184 * 2^31 = 2551686^184 * 2^31 = 77720494 * 2^31 -.word 1599958665 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 2551686^184 * 553543649 * 2^31 -.word 90248653 // zeta^248 * 2^31 = 2551686^248 * 2^31 = 42535446 * 2^31 -.word 4102052819 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 2551686^248 * 553543649 * 2^31 -.word 195629085 // zeta^132 * 2^31 = 2551686^132 * 2^31 = 29291161 * 2^31 -.word 505883523 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 2551686^132 * 553543649 * 2^31 -.word 26563973 // zeta^196 * 2^31 = 2551686^196 * 2^31 = 112274587 * 2^31 -.word 207704859 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 2551686^196 * 553543649 * 2^31 -.word 212660497 // zeta^164 * 2^31 = 2551686^164 * 2^31 = 18739792 * 2^31 -.word 4163127567 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 2551686^164 * 553543649 * 2^31 -.word 192407205 // zeta^228 * 2^31 = 2551686^228 * 2^31 = 9849213 * 2^31 -.word 907798011 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 2551686^228 * 553543649 * 2^31 -.word 177315463 // zeta^148 * 2^31 = 2551686^148 * 2^31 = 20542340 * 2^31 -.word 4115225177 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 2551686^148 * 553543649 * 2^31 -.word 7751745 // zeta^212 * 2^31 = 2551686^212 * 2^31 = 65404088 * 2^31 -.word 813324255 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 2551686^212 * 553543649 * 2^31 -.word 168245667 // zeta^180 * 2^31 = 2551686^180 * 2^31 = 99024588 * 2^31 -.word 3082535613 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 2551686^180 * 553543649 * 2^31 -.word 20125051 // zeta^244 * 2^31 = 2551686^244 * 2^31 = 94405578 * 2^31 -.word 3026952677 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 2551686^244 * 553543649 * 2^31 -.word 63259137 // zeta^140 * 2^31 = 2551686^140 * 2^31 = 107233079 * 2^31 -.word 3038025247 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 2551686^140 * 553543649 * 2^31 -.word 4977913 // zeta^204 * 2^31 = 2551686^204 * 2^31 = 49838786 * 2^31 -.word 4271866407 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 2551686^204 * 553543649 * 2^31 -.word 14665747 // zeta^172 * 2^31 = 2551686^172 * 2^31 = 6767822 * 2^31 -.word 1324843597 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 2551686^172 * 553543649 * 2^31 -.word 22602633 // zeta^236 * 2^31 = 2551686^236 * 2^31 = 34114292 * 2^31 -.word 728098199 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 2551686^236 * 553543649 * 2^31 -.word 41905265 // zeta^156 * 2^31 = 2551686^156 * 2^31 = 85615411 * 2^31 -.word 3510778287 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 2551686^156 * 553543649 * 2^31 -.word 167503559 // zeta^220 * 2^31 = 2551686^220 * 2^31 = 89468815 * 2^31 -.word 105781785 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 2551686^220 * 553543649 * 2^31 -.word 146228857 // zeta^188 * 2^31 = 2551686^188 * 2^31 = 3134216 * 2^31 -.word 2374131879 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 2551686^188 * 553543649 * 2^31 -.word 181830817 // zeta^252 * 2^31 = 2551686^252 * 2^31 = 93686712 * 2^31 -.word 1918543743 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 2551686^252 * 553543649 * 2^31 -.word 188410825 // zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1525343575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 102194337 // zeta^320 * 2^31 = 2551686^320 * 2^31 = 71938546 * 2^31 -.word 930402175 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 2551686^320 * 553543649 * 2^31 -.word 46425485 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 715195411 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 18047635 // zeta^352 * 2^31 = 2551686^352 * 2^31 = 113499661 * 2^31 -.word 326318029 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 2551686^352 * 553543649 * 2^31 -.word 141738191 // zeta^272 * 2^31 = 2551686^272 * 2^31 = 85313027 * 2^31 -.word 976175377 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 2551686^272 * 553543649 * 2^31 -.word 153146133 // zeta^336 * 2^31 = 2551686^336 * 2^31 = 5997883 * 2^31 -.word 3059943307 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 2551686^336 * 553543649 * 2^31 -.word 186641967 // zeta^304 * 2^31 = 2551686^304 * 2^31 = 23025139 * 2^31 -.word 762888113 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 2551686^304 * 553543649 * 2^31 -.word 75202623 // zeta^368 * 2^31 = 2551686^368 * 2^31 = 55506216 * 2^31 -.word 1079570337 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 2551686^368 * 553543649 * 2^31 -.word 194845711 // zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 3720236497 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 144399317 // zeta^328 * 2^31 = 2551686^328 * 2^31 = 68389803 * 2^31 -.word 808163019 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 2551686^328 * 553543649 * 2^31 -.word 163517617 // zeta^296 * 2^31 = 2551686^296 * 2^31 = 31137870 * 2^31 -.word 3860699503 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 2551686^296 * 553543649 * 2^31 -.word 87686087 // zeta^360 * 2^31 = 2551686^360 * 2^31 = 73412535 * 2^31 -.word 1682613529 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 2551686^360 * 553543649 * 2^31 -.word 184000835 // zeta^280 * 2^31 = 2551686^280 * 2^31 = 26922876 * 2^31 -.word 1509904157 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 2551686^280 * 553543649 * 2^31 -.word 138595745 // zeta^344 * 2^31 = 2551686^344 * 2^31 = 18107895 * 2^31 -.word 3067051647 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 2551686^344 * 553543649 * 2^31 -.word 150171031 // zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 2502094153 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 174748651 // zeta^376 * 2^31 = 2551686^376 * 2^31 = 37105779 * 2^31 -.word 2695008629 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 2551686^376 * 553543649 * 2^31 -.word 175413707 // zeta^260 * 2^31 = 2551686^260 * 2^31 = 82983426 * 2^31 -.word 3996788629 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 2551686^260 * 553543649 * 2^31 -.word 34023461 // zeta^324 * 2^31 = 2551686^324 * 2^31 = 85535112 * 2^31 -.word 3789083771 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 2551686^324 * 553543649 * 2^31 -.word 94572981 // zeta^292 * 2^31 = 2551686^292 * 2^31 = 105935694 * 2^31 -.word 1039637739 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 2551686^292 * 553543649 * 2^31 -.word 16992049 // zeta^356 * 2^31 = 2551686^356 * 2^31 = 96086481 * 2^31 -.word 131839727 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 2551686^356 * 553543649 * 2^31 -.word 174915101 // zeta^276 * 2^31 = 2551686^276 * 2^31 = 44861748 * 2^31 -.word 993066371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 2551686^276 * 553543649 * 2^31 -.word 52337083 // zeta^340 * 2^31 = 2551686^340 * 2^31 = 94283933 * 2^31 -.word 179742117 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 2551686^340 * 553543649 * 2^31 -.word 196358203 // zeta^308 * 2^31 = 2551686^308 * 2^31 = 110207263 * 2^31 -.word 4239384357 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 2551686^308 * 553543649 * 2^31 -.word 61406879 // zeta^372 * 2^31 = 2551686^372 * 2^31 = 15801685 * 2^31 -.word 1212431681 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 2551686^372 * 553543649 * 2^31 -.word 56545049 // zeta^268 * 2^31 = 2551686^268 * 2^31 = 57431980 * 2^31 -.word 1233841159 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 2551686^268 * 553543649 * 2^31 -.word 166393409 // zeta^332 * 2^31 = 2551686^332 * 2^31 = 7593194 * 2^31 -.word 1256942047 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 2551686^332 * 553543649 * 2^31 -.word 122763159 // zeta^300 * 2^31 = 2551686^300 * 2^31 = 27346470 * 2^31 -.word 3698221897 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 2551686^300 * 553543649 * 2^31 -.word 214986799 // zeta^364 * 2^31 = 2551686^364 * 2^31 = 108058451 * 2^31 -.word 2970123697 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 2551686^364 * 553543649 * 2^31 -.word 10772021 // zeta^284 * 2^31 = 2551686^284 * 2^31 = 3853404 * 2^31 -.word 889970795 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 2551686^284 * 553543649 * 2^31 -.word 187747281 // zeta^348 * 2^31 = 2551686^348 * 2^31 = 29210862 * 2^31 -.word 784189007 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 2551686^348 * 553543649 * 2^31 -.word 150428233 // zeta^316 * 2^31 = 2551686^316 * 2^31 = 90552496 * 2^31 -.word 3839379159 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 2551686^316 * 553543649 * 2^31 -.word 83423689 // zeta^380 * 2^31 = 2551686^380 * 2^31 = 111692057 * 2^31 -.word 1920835415 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 2551686^380 * 553543649 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_114826273_2551686_incomplete_good_oop_scale -ntt_384_u32_114826273_2551686_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 28609785 // 1/96 -.word 3700025895 // 1/96 twisted -.data -roots: -.word 71938545 /// zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 42887727 /// zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 108828390 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 32481077 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 108828390 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 75931399 // zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 1420070803 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 41413738 // zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 774520698 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 32481077 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 106011292 // zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1982625667 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 79641225 // zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 1489452056 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // XX: zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 108828390 // XX: zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 32481077 // XX: zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 75931399 // XX: zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 1420070803 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 41413738 // XX: zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 774520698 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 106011292 // XX: zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1982625667 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 79641225 // XX: zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 1489452056 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 29291161 // XX: zeta^132 * 2^31 = 2551686^132 * 2^31 = 29291161 * 2^31 -.word 547803979 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 2551686^132 * 553543649 * 2^31 -.word 104977060 // XX: zeta^ 36 * 2^31 = 2551686^ 36 * 2^31 = 104977060 * 2^31 -.word 1963283436 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 36 * 553543649 * 2^31 -.word 44861748 // XX: zeta^276 * 2^31 = 2551686^276 * 2^31 = 44861748 * 2^31 -.word 839005462 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 2551686^276 * 553543649 * 2^31 -.word 99024588 // XX: zeta^180 * 2^31 = 2551686^180 * 2^31 = 99024588 * 2^31 -.word 1851960165 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 2551686^180 * 553543649 * 2^31 -.word 64987487 // XX: zeta^ 12 * 2^31 = 2551686^ 12 * 2^31 = 64987487 * 2^31 -.word 1215397505 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 12 * 553543649 * 2^31 -.word 27346470 // XX: zeta^300 * 2^31 = 2551686^300 * 2^31 = 27346470 * 2^31 -.word 511434323 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 2551686^300 * 553543649 * 2^31 -.word 85615411 // XX: zeta^156 * 2^31 = 2551686^156 * 2^31 = 85615411 * 2^31 -.word 1601181422 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 2551686^156 * 553543649 * 2^31 -.word 21139561 // XX: zeta^ 60 * 2^31 = 2551686^ 60 * 2^31 = 21139561 * 2^31 -.word 395352565 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 60 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_114826273_2551686_incomplete_good_oop, %function -.global ntt_384_u32_114826273_2551686_incomplete_good_oop -ntt_384_u32_114826273_2551686_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -114826273 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r7 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r6 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r9 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r6 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmla.s32 Q2, Q3, r9 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r11,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r11,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r11,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r11,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r11,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r11,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r11,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r11,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r11,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r11,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r10,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r11,#(0)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 3741423647 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3355 -// Instruction count: 2397 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input.s b/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input.s deleted file mode 100644 index 07ddee6..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,3075 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input_twiddles -ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 28609785 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 3700025895 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 41241721 // zeta^ 64 * 2^31 = 2551686^ 64 * 2^31 = 42887728 * 2^31 -.word 2769623719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 64 * 553543649 * 2^31 -.word 86448423 // zeta^ 32 * 2^31 = 2551686^ 32 * 2^31 = 10217507 * 2^31 -.word 3906089913 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 32 * 553543649 * 2^31 -.word 183227061 // zeta^ 96 * 2^31 = 2551686^ 96 * 2^31 = 11544119 * 2^31 -.word 3579771883 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 96 * 553543649 * 2^31 -.word 126234215 // zeta^ 16 * 2^31 = 2551686^ 16 * 2^31 = 35511129 * 2^31 -.word 2083767929 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 16 * 553543649 * 2^31 -.word 87914355 // zeta^ 80 * 2^31 = 2551686^ 80 * 2^31 = 29513246 * 2^31 -.word 3318791917 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 80 * 553543649 * 2^31 -.word 3386929 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 316682223 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 43010579 // zeta^112 * 2^31 = 2551686^112 * 2^31 = 91801134 * 2^31 -.word 3532079181 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 2551686^112 * 553543649 * 2^31 -.word 64379879 // zeta^ 8 * 2^31 = 2551686^ 8 * 2^31 = 107284677 * 2^31 -.word 1382893817 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 8 * 553543649 * 2^31 -.word 34806835 // zeta^ 72 * 2^31 = 2551686^ 72 * 2^31 = 38894874 * 2^31 -.word 574730797 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 72 * 553543649 * 2^31 -.word 38994743 // zeta^ 40 * 2^31 = 2551686^ 40 * 2^31 = 42274665 * 2^31 -.word 2116881321 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 40 * 553543649 * 2^31 -.word 66134929 // zeta^104 * 2^31 = 2551686^104 * 2^31 = 83688403 * 2^31 -.word 434267791 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 2551686^104 * 553543649 * 2^31 -.word 69421183 // zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1557147489 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 45651711 // zeta^ 88 * 2^31 = 2551686^ 88 * 2^31 = 87903397 * 2^31 -.word 2785063137 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 88 * 553543649 * 2^31 -.word 139403893 // zeta^ 56 * 2^31 = 2551686^ 56 * 2^31 = 72290827 * 2^31 -.word 192914475 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 56 * 553543649 * 2^31 -.word 79481515 // zeta^120 * 2^31 = 2551686^120 * 2^31 = 35185048 * 2^31 -.word 1792873141 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 2551686^120 * 553543649 * 2^31 -.word 203088573 // zeta^ 4 * 2^31 = 2551686^ 4 * 2^31 = 2551686 * 2^31 -.word 4087262435 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 4 * 553543649 * 2^31 -.word 54238839 // zeta^ 68 * 2^31 = 2551686^ 68 * 2^31 = 31842847 * 2^31 -.word 298178665 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 68 * 553543649 * 2^31 -.word 37245341 // zeta^ 36 * 2^31 = 2551686^ 36 * 2^31 = 104977060 * 2^31 -.word 3387169283 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 36 * 553543649 * 2^31 -.word 135079565 // zeta^100 * 2^31 = 2551686^100 * 2^31 = 8890579 * 2^31 -.word 3255329555 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 2551686^100 * 553543649 * 2^31 -.word 221900801 // zeta^ 20 * 2^31 = 2551686^ 20 * 2^31 = 49422185 * 2^31 -.word 3481643039 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 20 * 553543649 * 2^31 -.word 54737445 // zeta^ 84 * 2^31 = 2551686^ 84 * 2^31 = 69964525 * 2^31 -.word 3301900923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 84 * 553543649 * 2^31 -.word 209527495 // zeta^ 52 * 2^31 = 2551686^ 52 * 2^31 = 20420695 * 2^31 -.word 1268014617 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 52 * 553543649 * 2^31 -.word 33294343 // zeta^116 * 2^31 = 2551686^116 * 2^31 = 4619010 * 2^31 -.word 55582937 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 2551686^116 * 553543649 * 2^31 -.word 224674633 // zeta^ 12 * 2^31 = 2551686^ 12 * 2^31 = 64987487 * 2^31 -.word 23100887 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 12 * 553543649 * 2^31 -.word 173107497 // zeta^ 76 * 2^31 = 2551686^ 76 * 2^31 = 57394293 * 2^31 -.word 3061126135 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 76 * 553543649 * 2^31 -.word 207049913 // zeta^ 44 * 2^31 = 2551686^ 44 * 2^31 = 80711981 * 2^31 -.word 3566869095 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 44 * 553543649 * 2^31 -.word 106889387 // zeta^108 * 2^31 = 2551686^108 * 2^31 = 87479803 * 2^31 -.word 596745397 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 2551686^108 * 553543649 * 2^31 -.word 62148987 // zeta^ 28 * 2^31 = 2551686^ 28 * 2^31 = 25357458 * 2^31 -.word 4189185509 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 28 * 553543649 * 2^31 -.word 218880525 // zeta^ 92 * 2^31 = 2551686^ 92 * 2^31 = 110972869 * 2^31 -.word 3404996499 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 92 * 553543649 * 2^31 -.word 47821729 // zeta^ 60 * 2^31 = 2551686^ 60 * 2^31 = 21139561 * 2^31 -.word 2376423551 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 60 * 553543649 * 2^31 -.word 79224313 // zeta^124 * 2^31 = 2551686^124 * 2^31 = 24273777 * 2^31 -.word 455588135 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 2551686^124 * 553543649 * 2^31 -.word 127458209 // zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 3364565119 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 201042761 // zeta^192 * 2^31 = 2551686^192 * 2^31 = 114826272 * 2^31 -.word 594941399 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 2551686^192 * 553543649 * 2^31 -.word 211604911 // zeta^160 * 2^31 = 2551686^160 * 2^31 = 1326612 * 2^31 -.word 3968649265 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 2551686^160 * 553543649 * 2^31 -.word 143204123 // zeta^224 * 2^31 = 2551686^224 * 2^31 = 104608766 * 2^31 -.word 388877381 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 2551686^224 * 553543649 * 2^31 -.word 76506413 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 1235023987 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 103418331 // zeta^208 * 2^31 = 2551686^208 * 2^31 = 79315144 * 2^31 -.word 2211199365 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 2551686^208 * 553543649 * 2^31 -.word 154449923 // zeta^176 * 2^31 = 2551686^176 * 2^31 = 59320057 * 2^31 -.word 3215396957 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 2551686^176 * 553543649 * 2^31 -.word 226265617 // zeta^240 * 2^31 = 2551686^240 * 2^31 = 82345196 * 2^31 -.word 3978285071 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 2551686^240 * 553543649 * 2^31 -.word 85253229 // zeta^136 * 2^31 = 2551686^136 * 2^31 = 46436470 * 2^31 -.word 3486804275 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 2551686^136 * 553543649 * 2^31 -.word 165272667 // zeta^200 * 2^31 = 2551686^200 * 2^31 = 7541596 * 2^31 -.word 2912073477 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 2551686^200 * 553543649 * 2^31 -.word 141966459 // zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 2612353765 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 190657803 // zeta^232 * 2^31 = 2551686^232 * 2^31 = 72551608 * 2^31 -.word 2178085973 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 2551686^232 * 553543649 * 2^31 -.word 91056801 // zeta^152 * 2^31 = 2551686^152 * 2^31 = 96718378 * 2^31 -.word 1227915647 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 2551686^152 * 553543649 * 2^31 -.word 160231363 // zeta^216 * 2^31 = 2551686^216 * 2^31 = 8814981 * 2^31 -.word 2737819805 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 2551686^216 * 553543649 * 2^31 -.word 54903895 // zeta^184 * 2^31 = 2551686^184 * 2^31 = 77720494 * 2^31 -.word 1599958665 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 2551686^184 * 553543649 * 2^31 -.word 90248653 // zeta^248 * 2^31 = 2551686^248 * 2^31 = 42535446 * 2^31 -.word 4102052819 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 2551686^248 * 553543649 * 2^31 -.word 195629085 // zeta^132 * 2^31 = 2551686^132 * 2^31 = 29291161 * 2^31 -.word 505883523 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 2551686^132 * 553543649 * 2^31 -.word 26563973 // zeta^196 * 2^31 = 2551686^196 * 2^31 = 112274587 * 2^31 -.word 207704859 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 2551686^196 * 553543649 * 2^31 -.word 212660497 // zeta^164 * 2^31 = 2551686^164 * 2^31 = 18739792 * 2^31 -.word 4163127567 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 2551686^164 * 553543649 * 2^31 -.word 192407205 // zeta^228 * 2^31 = 2551686^228 * 2^31 = 9849213 * 2^31 -.word 907798011 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 2551686^228 * 553543649 * 2^31 -.word 177315463 // zeta^148 * 2^31 = 2551686^148 * 2^31 = 20542340 * 2^31 -.word 4115225177 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 2551686^148 * 553543649 * 2^31 -.word 7751745 // zeta^212 * 2^31 = 2551686^212 * 2^31 = 65404088 * 2^31 -.word 813324255 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 2551686^212 * 553543649 * 2^31 -.word 168245667 // zeta^180 * 2^31 = 2551686^180 * 2^31 = 99024588 * 2^31 -.word 3082535613 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 2551686^180 * 553543649 * 2^31 -.word 20125051 // zeta^244 * 2^31 = 2551686^244 * 2^31 = 94405578 * 2^31 -.word 3026952677 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 2551686^244 * 553543649 * 2^31 -.word 63259137 // zeta^140 * 2^31 = 2551686^140 * 2^31 = 107233079 * 2^31 -.word 3038025247 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 2551686^140 * 553543649 * 2^31 -.word 4977913 // zeta^204 * 2^31 = 2551686^204 * 2^31 = 49838786 * 2^31 -.word 4271866407 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 2551686^204 * 553543649 * 2^31 -.word 14665747 // zeta^172 * 2^31 = 2551686^172 * 2^31 = 6767822 * 2^31 -.word 1324843597 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 2551686^172 * 553543649 * 2^31 -.word 22602633 // zeta^236 * 2^31 = 2551686^236 * 2^31 = 34114292 * 2^31 -.word 728098199 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 2551686^236 * 553543649 * 2^31 -.word 41905265 // zeta^156 * 2^31 = 2551686^156 * 2^31 = 85615411 * 2^31 -.word 3510778287 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 2551686^156 * 553543649 * 2^31 -.word 167503559 // zeta^220 * 2^31 = 2551686^220 * 2^31 = 89468815 * 2^31 -.word 105781785 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 2551686^220 * 553543649 * 2^31 -.word 146228857 // zeta^188 * 2^31 = 2551686^188 * 2^31 = 3134216 * 2^31 -.word 2374131879 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 2551686^188 * 553543649 * 2^31 -.word 181830817 // zeta^252 * 2^31 = 2551686^252 * 2^31 = 93686712 * 2^31 -.word 1918543743 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 2551686^252 * 553543649 * 2^31 -.word 188410825 // zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1525343575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 102194337 // zeta^320 * 2^31 = 2551686^320 * 2^31 = 71938546 * 2^31 -.word 930402175 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 2551686^320 * 553543649 * 2^31 -.word 46425485 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 715195411 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 18047635 // zeta^352 * 2^31 = 2551686^352 * 2^31 = 113499661 * 2^31 -.word 326318029 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 2551686^352 * 553543649 * 2^31 -.word 141738191 // zeta^272 * 2^31 = 2551686^272 * 2^31 = 85313027 * 2^31 -.word 976175377 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 2551686^272 * 553543649 * 2^31 -.word 153146133 // zeta^336 * 2^31 = 2551686^336 * 2^31 = 5997883 * 2^31 -.word 3059943307 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 2551686^336 * 553543649 * 2^31 -.word 186641967 // zeta^304 * 2^31 = 2551686^304 * 2^31 = 23025139 * 2^31 -.word 762888113 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 2551686^304 * 553543649 * 2^31 -.word 75202623 // zeta^368 * 2^31 = 2551686^368 * 2^31 = 55506216 * 2^31 -.word 1079570337 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 2551686^368 * 553543649 * 2^31 -.word 194845711 // zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 3720236497 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 144399317 // zeta^328 * 2^31 = 2551686^328 * 2^31 = 68389803 * 2^31 -.word 808163019 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 2551686^328 * 553543649 * 2^31 -.word 163517617 // zeta^296 * 2^31 = 2551686^296 * 2^31 = 31137870 * 2^31 -.word 3860699503 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 2551686^296 * 553543649 * 2^31 -.word 87686087 // zeta^360 * 2^31 = 2551686^360 * 2^31 = 73412535 * 2^31 -.word 1682613529 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 2551686^360 * 553543649 * 2^31 -.word 184000835 // zeta^280 * 2^31 = 2551686^280 * 2^31 = 26922876 * 2^31 -.word 1509904157 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 2551686^280 * 553543649 * 2^31 -.word 138595745 // zeta^344 * 2^31 = 2551686^344 * 2^31 = 18107895 * 2^31 -.word 3067051647 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 2551686^344 * 553543649 * 2^31 -.word 150171031 // zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 2502094153 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 174748651 // zeta^376 * 2^31 = 2551686^376 * 2^31 = 37105779 * 2^31 -.word 2695008629 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 2551686^376 * 553543649 * 2^31 -.word 175413707 // zeta^260 * 2^31 = 2551686^260 * 2^31 = 82983426 * 2^31 -.word 3996788629 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 2551686^260 * 553543649 * 2^31 -.word 34023461 // zeta^324 * 2^31 = 2551686^324 * 2^31 = 85535112 * 2^31 -.word 3789083771 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 2551686^324 * 553543649 * 2^31 -.word 94572981 // zeta^292 * 2^31 = 2551686^292 * 2^31 = 105935694 * 2^31 -.word 1039637739 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 2551686^292 * 553543649 * 2^31 -.word 16992049 // zeta^356 * 2^31 = 2551686^356 * 2^31 = 96086481 * 2^31 -.word 131839727 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 2551686^356 * 553543649 * 2^31 -.word 174915101 // zeta^276 * 2^31 = 2551686^276 * 2^31 = 44861748 * 2^31 -.word 993066371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 2551686^276 * 553543649 * 2^31 -.word 52337083 // zeta^340 * 2^31 = 2551686^340 * 2^31 = 94283933 * 2^31 -.word 179742117 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 2551686^340 * 553543649 * 2^31 -.word 196358203 // zeta^308 * 2^31 = 2551686^308 * 2^31 = 110207263 * 2^31 -.word 4239384357 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 2551686^308 * 553543649 * 2^31 -.word 61406879 // zeta^372 * 2^31 = 2551686^372 * 2^31 = 15801685 * 2^31 -.word 1212431681 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 2551686^372 * 553543649 * 2^31 -.word 56545049 // zeta^268 * 2^31 = 2551686^268 * 2^31 = 57431980 * 2^31 -.word 1233841159 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 2551686^268 * 553543649 * 2^31 -.word 166393409 // zeta^332 * 2^31 = 2551686^332 * 2^31 = 7593194 * 2^31 -.word 1256942047 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 2551686^332 * 553543649 * 2^31 -.word 122763159 // zeta^300 * 2^31 = 2551686^300 * 2^31 = 27346470 * 2^31 -.word 3698221897 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 2551686^300 * 553543649 * 2^31 -.word 214986799 // zeta^364 * 2^31 = 2551686^364 * 2^31 = 108058451 * 2^31 -.word 2970123697 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 2551686^364 * 553543649 * 2^31 -.word 10772021 // zeta^284 * 2^31 = 2551686^284 * 2^31 = 3853404 * 2^31 -.word 889970795 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 2551686^284 * 553543649 * 2^31 -.word 187747281 // zeta^348 * 2^31 = 2551686^348 * 2^31 = 29210862 * 2^31 -.word 784189007 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 2551686^348 * 553543649 * 2^31 -.word 150428233 // zeta^316 * 2^31 = 2551686^316 * 2^31 = 90552496 * 2^31 -.word 3839379159 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 2551686^316 * 553543649 * 2^31 -.word 83423689 // zeta^380 * 2^31 = 2551686^380 * 2^31 = 111692057 * 2^31 -.word 1920835415 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 2551686^380 * 553543649 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input_scale -ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 28609785 // 1/96 -.word 3700025895 // 1/96 twisted -.data -roots: -.word 71938545 /// zeta^256 * 2^31 = 2551686^256 * 2^31 = 71938545 * 2^31 -.word 1345396354 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 2551686^256 * 553543649 * 2^31 -.word 42887727 /// zeta^128 * 2^31 = 2551686^128 * 2^31 = 42887727 * 2^31 -.word 802087275 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 2551686^128 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 103282154 // zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 108828390 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 32481077 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 108828390 // zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 75931399 // zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 1420070803 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 41413738 // zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 774520698 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 32481077 // zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 106011292 // zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1982625667 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 79641225 // zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 1489452056 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 2551686^ 0 * 2^31 = 1 * 2^31 -.word 19 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 0 * 553543649 * 2^31 -.word 103282154 // XX: zeta^288 * 2^31 = 2551686^288 * 2^31 = 103282154 * 2^31 -.word 1931585264 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 2551686^288 * 553543649 * 2^31 -.word 108828390 // XX: zeta^144 * 2^31 = 2551686^144 * 2^31 = 108828390 * 2^31 -.word 2035311100 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 2551686^144 * 553543649 * 2^31 -.word 32481077 // XX: zeta^ 48 * 2^31 = 2551686^ 48 * 2^31 = 32481077 * 2^31 -.word 607461863 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 48 * 553543649 * 2^31 -.word 75931399 // XX: zeta^264 * 2^31 = 2551686^264 * 2^31 = 75931399 * 2^31 -.word 1420070803 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 2551686^264 * 553543649 * 2^31 -.word 41413738 // XX: zeta^168 * 2^31 = 2551686^168 * 2^31 = 41413738 * 2^31 -.word 774520698 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 2551686^168 * 553543649 * 2^31 -.word 106011292 // XX: zeta^ 24 * 2^31 = 2551686^ 24 * 2^31 = 106011292 * 2^31 -.word 1982625667 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 24 * 553543649 * 2^31 -.word 79641225 // XX: zeta^312 * 2^31 = 2551686^312 * 2^31 = 79641225 * 2^31 -.word 1489452056 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 2551686^312 * 553543649 * 2^31 -.word 29291161 // XX: zeta^132 * 2^31 = 2551686^132 * 2^31 = 29291161 * 2^31 -.word 547803979 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 2551686^132 * 553543649 * 2^31 -.word 104977060 // XX: zeta^ 36 * 2^31 = 2551686^ 36 * 2^31 = 104977060 * 2^31 -.word 1963283436 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 36 * 553543649 * 2^31 -.word 44861748 // XX: zeta^276 * 2^31 = 2551686^276 * 2^31 = 44861748 * 2^31 -.word 839005462 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 2551686^276 * 553543649 * 2^31 -.word 99024588 // XX: zeta^180 * 2^31 = 2551686^180 * 2^31 = 99024588 * 2^31 -.word 1851960165 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 2551686^180 * 553543649 * 2^31 -.word 64987487 // XX: zeta^ 12 * 2^31 = 2551686^ 12 * 2^31 = 64987487 * 2^31 -.word 1215397505 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 12 * 553543649 * 2^31 -.word 27346470 // XX: zeta^300 * 2^31 = 2551686^300 * 2^31 = 27346470 * 2^31 -.word 511434323 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 2551686^300 * 553543649 * 2^31 -.word 85615411 // XX: zeta^156 * 2^31 = 2551686^156 * 2^31 = 85615411 * 2^31 -.word 1601181422 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 2551686^156 * 553543649 * 2^31 -.word 21139561 // XX: zeta^ 60 * 2^31 = 2551686^ 60 * 2^31 = 21139561 * 2^31 -.word 395352565 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 2551686^ 60 * 553543649 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input, %function -.global ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input -ntt_384_u32_114826273_2551686_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -114826273 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q0 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r6 -// input[4]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 4)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r9 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r11,#(-496)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(16)] -// Release input[0] from Q1 -// Release input[128] from Q0 -// input[4]: Already loaded as Q7 -// input[132]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmulh.s32 Q1, Q7, r6 -// input[8]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 8)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -// Release input[132] from Q6 -// Release input[4] from Q7 -// input[136]: Already loaded as Q4 -// input[8]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[140]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[136] from Q4 -vstrw.u32 Q2, [r11,#(48)] -vsub.s32 Q4, Q1, Q5 -// Release input[8] from Q5 -vstrw.u32 Q4, [r11,#(-464)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(32)] -// input[140]: Already loaded as Q6 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmulh.s32 Q1, Q6, r6 -// input[16]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-448)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release input[12] from Q3 -// Release input[140] from Q6 -// input[16]: Already loaded as Q7 -// input[144]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmulh.s32 Q1, Q7, r6 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(64)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(80)] -// Release input[144] from Q5 -// Release input[16] from Q7 -// input[148]: Already loaded as Q4 -// input[20]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[148] from Q4 -vstrw.u32 Q2, [r11,#(96)] -vsub.s32 Q4, Q1, Q6 -// Release input[20] from Q6 -vstrw.u32 Q4, [r11,#(-416)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(80)] -// input[152]: Already loaded as Q5 -// input[24]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[156]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmulh.s32 Q1, Q5, r6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-400)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(112)] -// Release input[24] from Q3 -// Release input[152] from Q5 -// input[28]: Already loaded as Q7 -// input[156]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmulh.s32 Q1, Q7, r6 -// input[32]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 32)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(112)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release input[156] from Q6 -// Release input[28] from Q7 -// input[160]: Already loaded as Q4 -// input[32]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[160] from Q4 -vstrw.u32 Q2, [r11,#(144)] -vsub.s32 Q4, Q1, Q5 -// Release input[32] from Q5 -vstrw.u32 Q4, [r11,#(-368)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(128)] -// input[164]: Already loaded as Q6 -// input[36]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmulh.s32 Q1, Q6, r6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-352)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(160)] -// Release input[36] from Q3 -// Release input[164] from Q6 -// input[40]: Already loaded as Q7 -// input[168]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmulh.s32 Q1, Q7, r6 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(176)] -// Release input[168] from Q5 -// Release input[40] from Q7 -// input[172]: Already loaded as Q4 -// input[44]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[172] from Q4 -vstrw.u32 Q2, [r11,#(192)] -vsub.s32 Q4, Q1, Q6 -// Release input[44] from Q6 -vstrw.u32 Q4, [r11,#(-320)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(176)] -// input[176]: Already loaded as Q5 -// input[48]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmulh.s32 Q1, Q5, r6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-304)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(208)] -// Release input[48] from Q3 -// Release input[176] from Q5 -// input[52]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmulh.s32 Q1, Q7, r6 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(224)] -// Release input[180] from Q6 -// Release input[52] from Q7 -// input[184]: Already loaded as Q4 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[188]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[184] from Q4 -vstrw.u32 Q2, [r11,#(240)] -vsub.s32 Q4, Q1, Q5 -// Release input[56] from Q5 -vstrw.u32 Q4, [r11,#(-272)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(224)] -// input[188]: Already loaded as Q6 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -vqrdmulh.s32 Q1, Q6, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-256)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release input[60] from Q3 -// Release input[188] from Q6 -// input[64]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -vneg.s32 Q1, Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q5, r6 -vstrw.u32 Q5, [r11,#(-240)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(256)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(272)] -// Release input[64] from Q5 -// input[68]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -vneg.s32 Q1, Q3 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q3, r6 -vstrw.u32 Q3, [r11,#(288)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(272)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[68] from Q3 -// input[72]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(288)] -vstrw.u32 Q4, [r11,#(304)] -vstrw.u32 Q4, [r11,#(-208)] -// Release input[72] from Q4 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-192)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(320)] -// Release input[76] from Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(336)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(320)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[80] from Q4 -// input[84]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(336)] -vstrw.u32 Q3, [r11,#(352)] -vstrw.u32 Q3, [r11,#(-160)] -// Release input[84] from Q3 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-144)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(368)] -// Release input[88] from Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(384)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(368)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[92] from Q4 -// input[96]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(384)] -vstrw.u32 Q3, [r11,#(400)] -vstrw.u32 Q3, [r11,#(-112)] -// Release input[96] from Q3 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-96)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(400)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release input[100] from Q0 -// input[104]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(432)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(416)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[104] from Q4 -// input[108]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(432)] -vstrw.u32 Q3, [r11,#(448)] -vstrw.u32 Q3, [r11,#(-64)] -// Release input[108] from Q3 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-48)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(448)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(464)] -// Release input[112] from Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(480)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(464)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[116] from Q4 -// input[120]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(480)] -vstrw.u32 Q3, [r11,#(496)] -vstrw.u32 Q3, [r11,#(-16)] -// Release input[120] from Q3 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(0)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(496)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[124] from Q0 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 3741423647 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3042 -// Instruction count: 2201 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good.s deleted file mode 100644 index 0c7fe99..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_128919937_4666088_incomplete_good_twiddles -ntt_384_u32_128919937_4666088_incomplete_good_twiddles: // For base multiplication -.word 11080701 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 3608555395 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 230921307 // zeta^ 64 * 2^31 = 4666088^ 64 * 2^31 = 126696090 * 2^31 -.word 719940645 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 64 * 1521161857 * 2^31 -.word 102515993 // zeta^ 32 * 2^31 = 4666088^ 32 * 2^31 = 35786897 * 2^31 -.word 2350778471 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 32 * 1521161857 * 2^31 -.word 70386737 // zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 2410982735 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 138595355 // zeta^ 16 * 2^31 = 4666088^ 16 * 2^31 = 84055869 * 2^31 -.word 1153684581 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 16 * 1521161857 * 2^31 -.word 217112191 // zeta^ 80 * 2^31 = 4666088^ 80 * 2^31 = 70545107 * 2^31 -.word 2192308225 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 80 * 1521161857 * 2^31 -.word 149523283 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 837200173 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 101017423 // zeta^112 * 2^31 = 4666088^112 * 2^31 = 83528165 * 2^31 -.word 3097569073 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4666088^112 * 1521161857 * 2^31 -.word 99656329 // zeta^ 8 * 2^31 = 4666088^ 8 * 2^31 = 120423310 * 2^31 -.word 2554548983 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 8 * 1521161857 * 2^31 -.word 162941809 // zeta^ 72 * 2^31 = 4666088^ 72 * 2^31 = 47897664 * 2^31 -.word 1378698767 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 72 * 1521161857 * 2^31 -.word 180768417 // zeta^ 40 * 2^31 = 4666088^ 40 * 2^31 = 69713041 * 2^31 -.word 1740999391 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 40 * 1521161857 * 2^31 -.word 197631563 // zeta^104 * 2^31 = 4666088^104 * 2^31 = 115031316 * 2^31 -.word 3468020277 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4666088^104 * 1521161857 * 2^31 -.word 87331641 // zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 2059921991 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 4770345 // zeta^ 88 * 2^31 = 4666088^ 88 * 2^31 = 124196042 * 2^31 -.word 646945623 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 88 * 1521161857 * 2^31 -.word 18419051 // zeta^ 56 * 2^31 = 4666088^ 56 * 2^31 = 80622849 * 2^31 -.word 3066039061 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 56 * 1521161857 * 2^31 -.word 64594065 // zeta^120 * 2^31 = 4666088^120 * 2^31 = 72023844 * 2^31 -.word 548315375 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4666088^120 * 1521161857 * 2^31 -.word 56313901 // zeta^ 4 * 2^31 = 4666088^ 4 * 2^31 = 4666088 * 2^31 -.word 1857930067 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 4 * 1521161857 * 2^31 -.word 100524275 // zeta^ 68 * 2^31 = 4666088^ 68 * 2^31 = 99928594 * 2^31 -.word 2037105549 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 68 * 1521161857 * 2^31 -.word 113982537 // zeta^ 36 * 2^31 = 4666088^ 36 * 2^31 = 101970253 * 2^31 -.word 765810999 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 36 * 1521161857 * 2^31 -.word 78770821 // zeta^100 * 2^31 = 4666088^100 * 2^31 = 60361599 * 2^31 -.word 2527485179 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4666088^100 * 1521161857 * 2^31 -.word 8006691 // zeta^ 20 * 2^31 = 4666088^ 20 * 2^31 = 117614805 * 2^31 -.word 2034483293 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 20 * 1521161857 * 2^31 -.word 250258415 // zeta^ 84 * 2^31 = 4666088^ 84 * 2^31 = 78048497 * 2^31 -.word 3406289553 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 84 * 1521161857 * 2^31 -.word 10390241 // zeta^ 52 * 2^31 = 4666088^ 52 * 2^31 = 49398437 * 2^31 -.word 836873887 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 52 * 1521161857 * 2^31 -.word 14031383 // zeta^116 * 2^31 = 4666088^116 * 2^31 = 46189616 * 2^31 -.word 2860638313 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4666088^116 * 1521161857 * 2^31 -.word 141427479 // zeta^ 12 * 2^31 = 4666088^ 12 * 2^31 = 94340749 * 2^31 -.word 636450665 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 12 * 1521161857 * 2^31 -.word 3413487 // zeta^ 76 * 2^31 = 4666088^ 76 * 2^31 = 14874791 * 2^31 -.word 3208210577 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 76 * 1521161857 * 2^31 -.word 92531221 // zeta^ 44 * 2^31 = 4666088^ 44 * 2^31 = 8773444 * 2^31 -.word 610418027 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 44 * 1521161857 * 2^31 -.word 21375489 // zeta^108 * 2^31 = 4666088^108 * 2^31 = 75066449 * 2^31 -.word 3948789631 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4666088^108 * 1521161857 * 2^31 -.word 152464147 // zeta^ 28 * 2^31 = 4666088^ 28 * 2^31 = 80969678 * 2^31 -.word 1255106925 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 28 * 1521161857 * 2^31 -.word 177837625 // zeta^ 92 * 2^31 = 4666088^ 92 * 2^31 = 105375752 * 2^31 -.word 329213767 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 92 * 1521161857 * 2^31 -.word 50081627 // zeta^ 60 * 2^31 = 4666088^ 60 * 2^31 = 33121106 * 2^31 -.word 856772901 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 60 * 1521161857 * 2^31 -.word 57615231 // zeta^124 * 2^31 = 4666088^124 * 2^31 = 71071176 * 2^31 -.word 1545417473 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4666088^124 * 1521161857 * 2^31 -.word 90920669 // zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 1406352547 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 246759173 // zeta^192 * 2^31 = 4666088^192 * 2^31 = 128919936 * 2^31 -.word 686411899 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4666088^192 * 1521161857 * 2^31 -.word 96790681 // zeta^160 * 2^31 = 4666088^160 * 2^31 = 107269247 * 2^31 -.word 60204263 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4666088^160 * 1521161857 * 2^31 -.word 155323881 // zeta^224 * 2^31 = 4666088^224 * 2^31 = 93133040 * 2^31 -.word 1944188823 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4666088^224 * 1521161857 * 2^31 -.word 207436773 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1038623643 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 119244519 // zeta^208 * 2^31 = 4666088^208 * 2^31 = 44864068 * 2^31 -.word 3141282713 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4666088^208 * 1521161857 * 2^31 -.word 80414077 // zeta^176 * 2^31 = 4666088^176 * 2^31 = 45315884 * 2^31 -.word 2260368899 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4666088^176 * 1521161857 * 2^31 -.word 108316591 // zeta^240 * 2^31 = 4666088^240 * 2^31 = 90707656 * 2^31 -.word 3457767121 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4666088^240 * 1521161857 * 2^31 -.word 192205417 // zeta^136 * 2^31 = 4666088^136 * 2^31 = 56394291 * 2^31 -.word 3119117079 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4666088^136 * 1521161857 * 2^31 -.word 158183545 // zeta^200 * 2^31 = 4666088^200 * 2^31 = 8496627 * 2^31 -.word 1740418311 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4666088^200 * 1521161857 * 2^31 -.word 145783083 // zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 1727020885 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 77071457 // zeta^232 * 2^31 = 4666088^232 * 2^31 = 59206896 * 2^31 -.word 2553967903 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4666088^232 * 1521161857 * 2^31 -.word 46358641 // zeta^152 * 2^31 = 4666088^152 * 2^31 = 49737683 * 2^31 -.word 2881990927 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4666088^152 * 1521161857 * 2^31 -.word 170508233 // zeta^216 * 2^31 = 4666088^216 * 2^31 = 54461578 * 2^31 -.word 2235045303 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4666088^216 * 1521161857 * 2^31 -.word 175094951 // zeta^184 * 2^31 = 4666088^184 * 2^31 = 120320932 * 2^31 -.word 1777243609 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4666088^184 * 1521161857 * 2^31 -.word 239420823 // zeta^248 * 2^31 = 4666088^248 * 2^31 = 48297088 * 2^31 -.word 1228928233 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4666088^248 * 1521161857 * 2^31 -.word 173130311 // zeta^132 * 2^31 = 4666088^132 * 2^31 = 95262506 * 2^31 -.word 179175481 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4666088^132 * 1521161857 * 2^31 -.word 201525973 // zeta^196 * 2^31 = 4666088^196 * 2^31 = 124253849 * 2^31 -.word 2437037227 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4666088^196 * 1521161857 * 2^31 -.word 93708221 // zeta^164 * 2^31 = 4666088^164 * 2^31 = 87311283 * 2^31 -.word 1761674179 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4666088^164 * 1521161857 * 2^31 -.word 143857337 // zeta^228 * 2^31 = 4666088^228 * 2^31 = 26949684 * 2^31 -.word 3529156295 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4666088^228 * 1521161857 * 2^31 -.word 113331787 // zeta^148 * 2^31 = 4666088^148 * 2^31 = 89353629 * 2^31 -.word 1371806261 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4666088^148 * 1521161857 * 2^31 -.word 249833183 // zeta^212 * 2^31 = 4666088^212 * 2^31 = 11305132 * 2^31 -.word 2260484001 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4666088^212 * 1521161857 * 2^31 -.word 132561079 // zeta^180 * 2^31 = 4666088^180 * 2^31 = 125711116 * 2^31 -.word 2023764425 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4666088^180 * 1521161857 * 2^31 -.word 247449633 // zeta^244 * 2^31 = 4666088^244 * 2^31 = 79521500 * 2^31 -.word 3458093407 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4666088^244 * 1521161857 * 2^31 -.word 248745819 // zeta^140 * 2^31 = 4666088^140 * 2^31 = 49453979 * 2^31 -.word 2571759909 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4666088^140 * 1521161857 * 2^31 -.word 116412395 // zeta^204 * 2^31 = 4666088^204 * 2^31 = 34579188 * 2^31 -.word 3658516629 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4666088^204 * 1521161857 * 2^31 -.word 57764205 // zeta^172 * 2^31 = 4666088^172 * 2^31 = 66293005 * 2^31 -.word 3338371603 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4666088^172 * 1521161857 * 2^31 -.word 165308653 // zeta^236 * 2^31 = 4666088^236 * 2^31 = 120146493 * 2^31 -.word 3684549267 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4666088^236 * 1521161857 * 2^31 -.word 154293415 // zeta^156 * 2^31 = 4666088^156 * 2^31 = 24406074 * 2^31 -.word 3369074137 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4666088^156 * 1521161857 * 2^31 -.word 105375727 // zeta^220 * 2^31 = 4666088^220 * 2^31 = 47950259 * 2^31 -.word 3039860369 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4666088^220 * 1521161857 * 2^31 -.word 136453541 // zeta^188 * 2^31 = 4666088^188 * 2^31 = 37950070 * 2^31 -.word 688644571 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4666088^188 * 1521161857 * 2^31 -.word 207758247 // zeta^252 * 2^31 = 4666088^252 * 2^31 = 95798831 * 2^31 -.word 3438194393 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4666088^252 * 1521161857 * 2^31 -.word 26918567 // zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 3575026649 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 166919205 // zeta^320 * 2^31 = 4666088^320 * 2^31 = 2223848 * 2^31 -.word 2888614747 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4666088^320 * 1521161857 * 2^31 -.word 187453137 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1883984559 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 161049193 // zeta^352 * 2^31 = 4666088^352 * 2^31 = 21650690 * 2^31 -.word 4234763031 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4666088^352 * 1521161857 * 2^31 -.word 40727683 // zeta^272 * 2^31 = 4666088^272 * 2^31 = 58374830 * 2^31 -.word 2102659069 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4666088^272 * 1521161857 * 2^31 -.word 50403101 // zeta^336 * 2^31 = 4666088^336 * 2^31 = 13510762 * 2^31 -.word 3256343651 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4666088^336 * 1521161857 * 2^31 -.word 156822451 // zeta^304 * 2^31 = 4666088^304 * 2^31 = 45391772 * 2^31 -.word 1197398221 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4666088^304 * 1521161857 * 2^31 -.word 177425797 // zeta^368 * 2^31 = 4666088^368 * 2^31 = 83604053 * 2^31 -.word 2034598395 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4666088^368 * 1521161857 * 2^31 -.word 94898065 // zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 2916268527 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 65634457 // zeta^328 * 2^31 = 4666088^328 * 2^31 = 72525646 * 2^31 -.word 1175850215 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4666088^328 * 1521161857 * 2^31 -.word 60208311 // zeta^296 * 2^31 = 4666088^296 * 2^31 = 13888621 * 2^31 -.word 826947017 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4666088^296 * 1521161857 * 2^31 -.word 112056791 // zeta^360 * 2^31 = 4666088^360 * 2^31 = 83601662 * 2^31 -.word 2567946409 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4666088^360 * 1521161857 * 2^31 -.word 253069529 // zeta^280 * 2^31 = 4666088^280 * 2^31 = 4723895 * 2^31 -.word 3648021671 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4666088^280 * 1521161857 * 2^31 -.word 211481233 // zeta^344 * 2^31 = 4666088^344 * 2^31 = 79182254 * 2^31 -.word 1412976367 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4666088^344 * 1521161857 * 2^31 -.word 193245809 // zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 3746651919 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 82744923 // zeta^376 * 2^31 = 4666088^376 * 2^31 = 8599005 * 2^31 -.word 2517723685 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4666088^376 * 1521161857 * 2^31 -.word 157315599 // zeta^260 * 2^31 = 4666088^260 * 2^31 = 28991343 * 2^31 -.word 2257861745 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4666088^260 * 1521161857 * 2^31 -.word 84709563 // zeta^324 * 2^31 = 4666088^324 * 2^31 = 33657431 * 2^31 -.word 4115791813 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4666088^324 * 1521161857 * 2^31 -.word 179069053 // zeta^292 * 2^31 = 4666088^292 * 2^31 = 68558338 * 2^31 -.word 1767482115 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4666088^292 * 1521161857 * 2^31 -.word 164131653 // zeta^356 * 2^31 = 4666088^356 * 2^31 = 41608654 * 2^31 -.word 2533293115 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4666088^356 * 1521161857 * 2^31 -.word 7581459 // zeta^276 * 2^31 = 4666088^276 * 2^31 = 50871440 * 2^31 -.word 888677741 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4666088^276 * 1521161857 * 2^31 -.word 144508087 // zeta^340 * 2^31 = 4666088^340 * 2^31 = 39566308 * 2^31 -.word 2923161033 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4666088^340 * 1521161857 * 2^31 -.word 243808491 // zeta^308 * 2^31 = 4666088^308 * 2^31 = 82730321 * 2^31 -.word 1434328981 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4666088^308 * 1521161857 * 2^31 -.word 125278795 // zeta^372 * 2^31 = 4666088^372 * 2^31 = 3208821 * 2^31 -.word 2271202869 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4666088^372 * 1521161857 * 2^31 -.word 254426387 // zeta^268 * 2^31 = 4666088^268 * 2^31 = 114045146 * 2^31 -.word 1086756717 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4666088^268 * 1521161857 * 2^31 -.word 9094055 // zeta^332 * 2^31 = 4666088^332 * 2^31 = 79465958 * 2^31 -.word 1723207385 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4666088^332 * 1521161857 * 2^31 -.word 236464385 // zeta^300 * 2^31 = 4666088^300 * 2^31 = 53853488 * 2^31 -.word 346177663 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4666088^300 * 1521161857 * 2^31 -.word 200075669 // zeta^364 * 2^31 = 4666088^364 * 2^31 = 62626932 * 2^31 -.word 956595691 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4666088^364 * 1521161857 * 2^31 -.word 80002249 // zeta^284 * 2^31 = 4666088^284 * 2^31 = 23544185 * 2^31 -.word 3965753527 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4666088^284 * 1521161857 * 2^31 -.word 103546459 // zeta^348 * 2^31 = 4666088^348 * 2^31 = 104513863 * 2^31 -.word 925893157 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4666088^348 * 1521161857 * 2^31 -.word 200224643 // zeta^316 * 2^31 = 4666088^316 * 2^31 = 57848761 * 2^31 -.word 2749549821 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4666088^316 * 1521161857 * 2^31 -.word 121386333 // zeta^380 * 2^31 = 4666088^380 * 2^31 = 90969867 * 2^31 -.word 3606322723 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4666088^380 * 1521161857 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_128919937_4666088_incomplete_good_scale -ntt_384_u32_128919937_4666088_incomplete_good_scale: // Constants for scaling by 1/N -.word 11080701 // 1/96 -.word 3608555395 // 1/96 twisted -.data -roots: -.word 2223847 /// zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 126696089 /// zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 115409175 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 38212281 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 115409175 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 81022273 // zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 1349628385 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 45318275 // zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 754889095 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 38212281 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 74458359 // zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 1240289998 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 56896093 // zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 947746580 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // XX: zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 115409175 // XX: zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 38212281 // XX: zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 81022273 // XX: zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 1349628385 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 45318275 // XX: zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 754889095 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 74458359 // XX: zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 1240289998 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 56896093 // XX: zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 947746580 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 95262506 // XX: zeta^132 * 2^31 = 4666088^132 * 2^31 = 95262506 * 2^31 -.word 1586835044 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4666088^132 * 1521161857 * 2^31 -.word 101970253 // XX: zeta^ 36 * 2^31 = 4666088^ 36 * 2^31 = 101970253 * 2^31 -.word 1698569329 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 36 * 1521161857 * 2^31 -.word 50871440 // XX: zeta^276 * 2^31 = 4666088^276 * 2^31 = 50871440 * 2^31 -.word 847390932 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4666088^276 * 1521161857 * 2^31 -.word 125711116 // XX: zeta^180 * 2^31 = 4666088^180 * 2^31 = 125711116 * 2^31 -.word 2094032717 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4666088^180 * 1521161857 * 2^31 -.word 94340749 // XX: zeta^ 12 * 2^31 = 4666088^ 12 * 2^31 = 94340749 * 2^31 -.word 1571480878 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 12 * 1521161857 * 2^31 -.word 53853488 // XX: zeta^300 * 2^31 = 4666088^300 * 2^31 = 53853488 * 2^31 -.word 897064392 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4666088^300 * 1521161857 * 2^31 -.word 24406074 // XX: zeta^156 * 2^31 = 4666088^156 * 2^31 = 24406074 * 2^31 -.word 406544139 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4666088^156 * 1521161857 * 2^31 -.word 33121106 // XX: zeta^ 60 * 2^31 = 4666088^ 60 * 2^31 = 33121106 * 2^31 -.word 551714771 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 60 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_128919937_4666088_incomplete_good, %function -.global ntt_384_u32_128919937_4666088_incomplete_good -ntt_384_u32_128919937_4666088_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -128919937 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 2773805439 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_bitrev.s deleted file mode 100644 index 4e001a3..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 126696089 /// zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 2223847 /// zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 14136207 // zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 235473846 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 14136207 // zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 235473846 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 14136207 // zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 235473846 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 90707656 // zeta^240 * 2^31 = 4666088^240 * 2^31 = 90707656 * 2^31 -.word 1510962637 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4666088^240 * 1521161857 * 2^31 -.word 13510762 // zeta^336 * 2^31 = 4666088^336 * 2^31 = 13510762 * 2^31 -.word 225055497 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4666088^336 * 1521161857 * 2^31 -.word 90707656 // zeta^240 * 2^31 = 4666088^240 * 2^31 = 90707656 * 2^31 -.word 1510962637 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4666088^240 * 1521161857 * 2^31 -.word 72023844 // zeta^120 * 2^31 = 4666088^120 * 2^31 = 72023844 * 2^31 -.word 1199737068 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4666088^120 * 1521161857 * 2^31 -.word 54461578 // zeta^216 * 2^31 = 4666088^216 * 2^31 = 54461578 * 2^31 -.word 907193650 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4666088^216 * 1521161857 * 2^31 -.word 13510762 // zeta^336 * 2^31 = 4666088^336 * 2^31 = 13510762 * 2^31 -.word 225055497 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4666088^336 * 1521161857 * 2^31 -.word 83601662 // zeta^360 * 2^31 = 4666088^360 * 2^31 = 83601662 * 2^31 -.word 1392594553 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4666088^360 * 1521161857 * 2^31 -.word 47897664 // zeta^ 72 * 2^31 = 4666088^ 72 * 2^31 = 47897664 * 2^31 -.word 797855263 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 72 * 1521161857 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 14136207 // XX: zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 235473846 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 90707656 // XX: zeta^240 * 2^31 = 4666088^240 * 2^31 = 90707656 * 2^31 -.word 1510962637 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4666088^240 * 1521161857 * 2^31 -.word 13510762 // XX: zeta^336 * 2^31 = 4666088^336 * 2^31 = 13510762 * 2^31 -.word 225055497 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4666088^336 * 1521161857 * 2^31 -.word 72023844 // XX: zeta^120 * 2^31 = 4666088^120 * 2^31 = 72023844 * 2^31 -.word 1199737068 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4666088^120 * 1521161857 * 2^31 -.word 54461578 // XX: zeta^216 * 2^31 = 4666088^216 * 2^31 = 54461578 * 2^31 -.word 907193650 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4666088^216 * 1521161857 * 2^31 -.word 83601662 // XX: zeta^360 * 2^31 = 4666088^360 * 2^31 = 83601662 * 2^31 -.word 1392594553 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4666088^360 * 1521161857 * 2^31 -.word 47897664 // XX: zeta^ 72 * 2^31 = 4666088^ 72 * 2^31 = 47897664 * 2^31 -.word 797855263 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 72 * 1521161857 * 2^31 -.word 95798831 // XX: zeta^252 * 2^31 = 4666088^252 * 2^31 = 95798831 * 2^31 -.word 1595768877 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4666088^252 * 1521161857 * 2^31 -.word 104513863 // XX: zeta^348 * 2^31 = 4666088^348 * 2^31 = 104513863 * 2^31 -.word 1740939509 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4666088^348 * 1521161857 * 2^31 -.word 75066449 // XX: zeta^108 * 2^31 = 4666088^108 * 2^31 = 75066449 * 2^31 -.word 1250419256 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4666088^108 * 1521161857 * 2^31 -.word 34579188 // XX: zeta^204 * 2^31 = 4666088^204 * 2^31 = 34579188 * 2^31 -.word 576002770 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4666088^204 * 1521161857 * 2^31 -.word 3208821 // XX: zeta^372 * 2^31 = 4666088^372 * 2^31 = 3208821 * 2^31 -.word 53450931 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4666088^372 * 1521161857 * 2^31 -.word 78048497 // XX: zeta^ 84 * 2^31 = 4666088^ 84 * 2^31 = 78048497 * 2^31 -.word 1300092716 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 84 * 1521161857 * 2^31 -.word 26949684 // XX: zeta^228 * 2^31 = 4666088^228 * 2^31 = 26949684 * 2^31 -.word 448914319 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4666088^228 * 1521161857 * 2^31 -.word 33657431 // XX: zeta^324 * 2^31 = 4666088^324 * 2^31 = 33657431 * 2^31 -.word 560648604 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4666088^324 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_128919937_4666088_incomplete_good_bitrev, %function -.global ntt_384_u32_128919937_4666088_incomplete_good_bitrev -ntt_384_u32_128919937_4666088_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -128919937 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 2773805439 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop.s b/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop.s deleted file mode 100644 index a65a757..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop.s +++ /dev/null @@ -1,3388 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_128919937_4666088_incomplete_good_oop_twiddles -ntt_384_u32_128919937_4666088_incomplete_good_oop_twiddles: // For base multiplication -.word 11080701 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 3608555395 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 230921307 // zeta^ 64 * 2^31 = 4666088^ 64 * 2^31 = 126696090 * 2^31 -.word 719940645 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 64 * 1521161857 * 2^31 -.word 102515993 // zeta^ 32 * 2^31 = 4666088^ 32 * 2^31 = 35786897 * 2^31 -.word 2350778471 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 32 * 1521161857 * 2^31 -.word 70386737 // zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 2410982735 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 138595355 // zeta^ 16 * 2^31 = 4666088^ 16 * 2^31 = 84055869 * 2^31 -.word 1153684581 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 16 * 1521161857 * 2^31 -.word 217112191 // zeta^ 80 * 2^31 = 4666088^ 80 * 2^31 = 70545107 * 2^31 -.word 2192308225 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 80 * 1521161857 * 2^31 -.word 149523283 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 837200173 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 101017423 // zeta^112 * 2^31 = 4666088^112 * 2^31 = 83528165 * 2^31 -.word 3097569073 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4666088^112 * 1521161857 * 2^31 -.word 99656329 // zeta^ 8 * 2^31 = 4666088^ 8 * 2^31 = 120423310 * 2^31 -.word 2554548983 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 8 * 1521161857 * 2^31 -.word 162941809 // zeta^ 72 * 2^31 = 4666088^ 72 * 2^31 = 47897664 * 2^31 -.word 1378698767 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 72 * 1521161857 * 2^31 -.word 180768417 // zeta^ 40 * 2^31 = 4666088^ 40 * 2^31 = 69713041 * 2^31 -.word 1740999391 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 40 * 1521161857 * 2^31 -.word 197631563 // zeta^104 * 2^31 = 4666088^104 * 2^31 = 115031316 * 2^31 -.word 3468020277 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4666088^104 * 1521161857 * 2^31 -.word 87331641 // zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 2059921991 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 4770345 // zeta^ 88 * 2^31 = 4666088^ 88 * 2^31 = 124196042 * 2^31 -.word 646945623 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 88 * 1521161857 * 2^31 -.word 18419051 // zeta^ 56 * 2^31 = 4666088^ 56 * 2^31 = 80622849 * 2^31 -.word 3066039061 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 56 * 1521161857 * 2^31 -.word 64594065 // zeta^120 * 2^31 = 4666088^120 * 2^31 = 72023844 * 2^31 -.word 548315375 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4666088^120 * 1521161857 * 2^31 -.word 56313901 // zeta^ 4 * 2^31 = 4666088^ 4 * 2^31 = 4666088 * 2^31 -.word 1857930067 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 4 * 1521161857 * 2^31 -.word 100524275 // zeta^ 68 * 2^31 = 4666088^ 68 * 2^31 = 99928594 * 2^31 -.word 2037105549 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 68 * 1521161857 * 2^31 -.word 113982537 // zeta^ 36 * 2^31 = 4666088^ 36 * 2^31 = 101970253 * 2^31 -.word 765810999 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 36 * 1521161857 * 2^31 -.word 78770821 // zeta^100 * 2^31 = 4666088^100 * 2^31 = 60361599 * 2^31 -.word 2527485179 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4666088^100 * 1521161857 * 2^31 -.word 8006691 // zeta^ 20 * 2^31 = 4666088^ 20 * 2^31 = 117614805 * 2^31 -.word 2034483293 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 20 * 1521161857 * 2^31 -.word 250258415 // zeta^ 84 * 2^31 = 4666088^ 84 * 2^31 = 78048497 * 2^31 -.word 3406289553 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 84 * 1521161857 * 2^31 -.word 10390241 // zeta^ 52 * 2^31 = 4666088^ 52 * 2^31 = 49398437 * 2^31 -.word 836873887 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 52 * 1521161857 * 2^31 -.word 14031383 // zeta^116 * 2^31 = 4666088^116 * 2^31 = 46189616 * 2^31 -.word 2860638313 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4666088^116 * 1521161857 * 2^31 -.word 141427479 // zeta^ 12 * 2^31 = 4666088^ 12 * 2^31 = 94340749 * 2^31 -.word 636450665 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 12 * 1521161857 * 2^31 -.word 3413487 // zeta^ 76 * 2^31 = 4666088^ 76 * 2^31 = 14874791 * 2^31 -.word 3208210577 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 76 * 1521161857 * 2^31 -.word 92531221 // zeta^ 44 * 2^31 = 4666088^ 44 * 2^31 = 8773444 * 2^31 -.word 610418027 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 44 * 1521161857 * 2^31 -.word 21375489 // zeta^108 * 2^31 = 4666088^108 * 2^31 = 75066449 * 2^31 -.word 3948789631 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4666088^108 * 1521161857 * 2^31 -.word 152464147 // zeta^ 28 * 2^31 = 4666088^ 28 * 2^31 = 80969678 * 2^31 -.word 1255106925 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 28 * 1521161857 * 2^31 -.word 177837625 // zeta^ 92 * 2^31 = 4666088^ 92 * 2^31 = 105375752 * 2^31 -.word 329213767 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 92 * 1521161857 * 2^31 -.word 50081627 // zeta^ 60 * 2^31 = 4666088^ 60 * 2^31 = 33121106 * 2^31 -.word 856772901 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 60 * 1521161857 * 2^31 -.word 57615231 // zeta^124 * 2^31 = 4666088^124 * 2^31 = 71071176 * 2^31 -.word 1545417473 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4666088^124 * 1521161857 * 2^31 -.word 90920669 // zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 1406352547 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 246759173 // zeta^192 * 2^31 = 4666088^192 * 2^31 = 128919936 * 2^31 -.word 686411899 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4666088^192 * 1521161857 * 2^31 -.word 96790681 // zeta^160 * 2^31 = 4666088^160 * 2^31 = 107269247 * 2^31 -.word 60204263 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4666088^160 * 1521161857 * 2^31 -.word 155323881 // zeta^224 * 2^31 = 4666088^224 * 2^31 = 93133040 * 2^31 -.word 1944188823 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4666088^224 * 1521161857 * 2^31 -.word 207436773 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1038623643 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 119244519 // zeta^208 * 2^31 = 4666088^208 * 2^31 = 44864068 * 2^31 -.word 3141282713 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4666088^208 * 1521161857 * 2^31 -.word 80414077 // zeta^176 * 2^31 = 4666088^176 * 2^31 = 45315884 * 2^31 -.word 2260368899 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4666088^176 * 1521161857 * 2^31 -.word 108316591 // zeta^240 * 2^31 = 4666088^240 * 2^31 = 90707656 * 2^31 -.word 3457767121 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4666088^240 * 1521161857 * 2^31 -.word 192205417 // zeta^136 * 2^31 = 4666088^136 * 2^31 = 56394291 * 2^31 -.word 3119117079 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4666088^136 * 1521161857 * 2^31 -.word 158183545 // zeta^200 * 2^31 = 4666088^200 * 2^31 = 8496627 * 2^31 -.word 1740418311 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4666088^200 * 1521161857 * 2^31 -.word 145783083 // zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 1727020885 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 77071457 // zeta^232 * 2^31 = 4666088^232 * 2^31 = 59206896 * 2^31 -.word 2553967903 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4666088^232 * 1521161857 * 2^31 -.word 46358641 // zeta^152 * 2^31 = 4666088^152 * 2^31 = 49737683 * 2^31 -.word 2881990927 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4666088^152 * 1521161857 * 2^31 -.word 170508233 // zeta^216 * 2^31 = 4666088^216 * 2^31 = 54461578 * 2^31 -.word 2235045303 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4666088^216 * 1521161857 * 2^31 -.word 175094951 // zeta^184 * 2^31 = 4666088^184 * 2^31 = 120320932 * 2^31 -.word 1777243609 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4666088^184 * 1521161857 * 2^31 -.word 239420823 // zeta^248 * 2^31 = 4666088^248 * 2^31 = 48297088 * 2^31 -.word 1228928233 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4666088^248 * 1521161857 * 2^31 -.word 173130311 // zeta^132 * 2^31 = 4666088^132 * 2^31 = 95262506 * 2^31 -.word 179175481 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4666088^132 * 1521161857 * 2^31 -.word 201525973 // zeta^196 * 2^31 = 4666088^196 * 2^31 = 124253849 * 2^31 -.word 2437037227 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4666088^196 * 1521161857 * 2^31 -.word 93708221 // zeta^164 * 2^31 = 4666088^164 * 2^31 = 87311283 * 2^31 -.word 1761674179 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4666088^164 * 1521161857 * 2^31 -.word 143857337 // zeta^228 * 2^31 = 4666088^228 * 2^31 = 26949684 * 2^31 -.word 3529156295 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4666088^228 * 1521161857 * 2^31 -.word 113331787 // zeta^148 * 2^31 = 4666088^148 * 2^31 = 89353629 * 2^31 -.word 1371806261 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4666088^148 * 1521161857 * 2^31 -.word 249833183 // zeta^212 * 2^31 = 4666088^212 * 2^31 = 11305132 * 2^31 -.word 2260484001 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4666088^212 * 1521161857 * 2^31 -.word 132561079 // zeta^180 * 2^31 = 4666088^180 * 2^31 = 125711116 * 2^31 -.word 2023764425 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4666088^180 * 1521161857 * 2^31 -.word 247449633 // zeta^244 * 2^31 = 4666088^244 * 2^31 = 79521500 * 2^31 -.word 3458093407 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4666088^244 * 1521161857 * 2^31 -.word 248745819 // zeta^140 * 2^31 = 4666088^140 * 2^31 = 49453979 * 2^31 -.word 2571759909 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4666088^140 * 1521161857 * 2^31 -.word 116412395 // zeta^204 * 2^31 = 4666088^204 * 2^31 = 34579188 * 2^31 -.word 3658516629 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4666088^204 * 1521161857 * 2^31 -.word 57764205 // zeta^172 * 2^31 = 4666088^172 * 2^31 = 66293005 * 2^31 -.word 3338371603 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4666088^172 * 1521161857 * 2^31 -.word 165308653 // zeta^236 * 2^31 = 4666088^236 * 2^31 = 120146493 * 2^31 -.word 3684549267 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4666088^236 * 1521161857 * 2^31 -.word 154293415 // zeta^156 * 2^31 = 4666088^156 * 2^31 = 24406074 * 2^31 -.word 3369074137 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4666088^156 * 1521161857 * 2^31 -.word 105375727 // zeta^220 * 2^31 = 4666088^220 * 2^31 = 47950259 * 2^31 -.word 3039860369 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4666088^220 * 1521161857 * 2^31 -.word 136453541 // zeta^188 * 2^31 = 4666088^188 * 2^31 = 37950070 * 2^31 -.word 688644571 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4666088^188 * 1521161857 * 2^31 -.word 207758247 // zeta^252 * 2^31 = 4666088^252 * 2^31 = 95798831 * 2^31 -.word 3438194393 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4666088^252 * 1521161857 * 2^31 -.word 26918567 // zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 3575026649 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 166919205 // zeta^320 * 2^31 = 4666088^320 * 2^31 = 2223848 * 2^31 -.word 2888614747 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4666088^320 * 1521161857 * 2^31 -.word 187453137 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1883984559 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 161049193 // zeta^352 * 2^31 = 4666088^352 * 2^31 = 21650690 * 2^31 -.word 4234763031 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4666088^352 * 1521161857 * 2^31 -.word 40727683 // zeta^272 * 2^31 = 4666088^272 * 2^31 = 58374830 * 2^31 -.word 2102659069 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4666088^272 * 1521161857 * 2^31 -.word 50403101 // zeta^336 * 2^31 = 4666088^336 * 2^31 = 13510762 * 2^31 -.word 3256343651 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4666088^336 * 1521161857 * 2^31 -.word 156822451 // zeta^304 * 2^31 = 4666088^304 * 2^31 = 45391772 * 2^31 -.word 1197398221 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4666088^304 * 1521161857 * 2^31 -.word 177425797 // zeta^368 * 2^31 = 4666088^368 * 2^31 = 83604053 * 2^31 -.word 2034598395 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4666088^368 * 1521161857 * 2^31 -.word 94898065 // zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 2916268527 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 65634457 // zeta^328 * 2^31 = 4666088^328 * 2^31 = 72525646 * 2^31 -.word 1175850215 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4666088^328 * 1521161857 * 2^31 -.word 60208311 // zeta^296 * 2^31 = 4666088^296 * 2^31 = 13888621 * 2^31 -.word 826947017 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4666088^296 * 1521161857 * 2^31 -.word 112056791 // zeta^360 * 2^31 = 4666088^360 * 2^31 = 83601662 * 2^31 -.word 2567946409 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4666088^360 * 1521161857 * 2^31 -.word 253069529 // zeta^280 * 2^31 = 4666088^280 * 2^31 = 4723895 * 2^31 -.word 3648021671 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4666088^280 * 1521161857 * 2^31 -.word 211481233 // zeta^344 * 2^31 = 4666088^344 * 2^31 = 79182254 * 2^31 -.word 1412976367 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4666088^344 * 1521161857 * 2^31 -.word 193245809 // zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 3746651919 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 82744923 // zeta^376 * 2^31 = 4666088^376 * 2^31 = 8599005 * 2^31 -.word 2517723685 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4666088^376 * 1521161857 * 2^31 -.word 157315599 // zeta^260 * 2^31 = 4666088^260 * 2^31 = 28991343 * 2^31 -.word 2257861745 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4666088^260 * 1521161857 * 2^31 -.word 84709563 // zeta^324 * 2^31 = 4666088^324 * 2^31 = 33657431 * 2^31 -.word 4115791813 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4666088^324 * 1521161857 * 2^31 -.word 179069053 // zeta^292 * 2^31 = 4666088^292 * 2^31 = 68558338 * 2^31 -.word 1767482115 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4666088^292 * 1521161857 * 2^31 -.word 164131653 // zeta^356 * 2^31 = 4666088^356 * 2^31 = 41608654 * 2^31 -.word 2533293115 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4666088^356 * 1521161857 * 2^31 -.word 7581459 // zeta^276 * 2^31 = 4666088^276 * 2^31 = 50871440 * 2^31 -.word 888677741 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4666088^276 * 1521161857 * 2^31 -.word 144508087 // zeta^340 * 2^31 = 4666088^340 * 2^31 = 39566308 * 2^31 -.word 2923161033 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4666088^340 * 1521161857 * 2^31 -.word 243808491 // zeta^308 * 2^31 = 4666088^308 * 2^31 = 82730321 * 2^31 -.word 1434328981 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4666088^308 * 1521161857 * 2^31 -.word 125278795 // zeta^372 * 2^31 = 4666088^372 * 2^31 = 3208821 * 2^31 -.word 2271202869 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4666088^372 * 1521161857 * 2^31 -.word 254426387 // zeta^268 * 2^31 = 4666088^268 * 2^31 = 114045146 * 2^31 -.word 1086756717 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4666088^268 * 1521161857 * 2^31 -.word 9094055 // zeta^332 * 2^31 = 4666088^332 * 2^31 = 79465958 * 2^31 -.word 1723207385 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4666088^332 * 1521161857 * 2^31 -.word 236464385 // zeta^300 * 2^31 = 4666088^300 * 2^31 = 53853488 * 2^31 -.word 346177663 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4666088^300 * 1521161857 * 2^31 -.word 200075669 // zeta^364 * 2^31 = 4666088^364 * 2^31 = 62626932 * 2^31 -.word 956595691 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4666088^364 * 1521161857 * 2^31 -.word 80002249 // zeta^284 * 2^31 = 4666088^284 * 2^31 = 23544185 * 2^31 -.word 3965753527 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4666088^284 * 1521161857 * 2^31 -.word 103546459 // zeta^348 * 2^31 = 4666088^348 * 2^31 = 104513863 * 2^31 -.word 925893157 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4666088^348 * 1521161857 * 2^31 -.word 200224643 // zeta^316 * 2^31 = 4666088^316 * 2^31 = 57848761 * 2^31 -.word 2749549821 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4666088^316 * 1521161857 * 2^31 -.word 121386333 // zeta^380 * 2^31 = 4666088^380 * 2^31 = 90969867 * 2^31 -.word 3606322723 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4666088^380 * 1521161857 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_128919937_4666088_incomplete_good_oop_scale -ntt_384_u32_128919937_4666088_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 11080701 // 1/96 -.word 3608555395 // 1/96 twisted -.data -roots: -.word 2223847 /// zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 126696089 /// zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 115409175 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 38212281 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 115409175 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 81022273 // zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 1349628385 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 45318275 // zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 754889095 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 38212281 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 74458359 // zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 1240289998 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 56896093 // zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 947746580 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // XX: zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 115409175 // XX: zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 38212281 // XX: zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 81022273 // XX: zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 1349628385 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 45318275 // XX: zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 754889095 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 74458359 // XX: zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 1240289998 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 56896093 // XX: zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 947746580 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 95262506 // XX: zeta^132 * 2^31 = 4666088^132 * 2^31 = 95262506 * 2^31 -.word 1586835044 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4666088^132 * 1521161857 * 2^31 -.word 101970253 // XX: zeta^ 36 * 2^31 = 4666088^ 36 * 2^31 = 101970253 * 2^31 -.word 1698569329 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 36 * 1521161857 * 2^31 -.word 50871440 // XX: zeta^276 * 2^31 = 4666088^276 * 2^31 = 50871440 * 2^31 -.word 847390932 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4666088^276 * 1521161857 * 2^31 -.word 125711116 // XX: zeta^180 * 2^31 = 4666088^180 * 2^31 = 125711116 * 2^31 -.word 2094032717 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4666088^180 * 1521161857 * 2^31 -.word 94340749 // XX: zeta^ 12 * 2^31 = 4666088^ 12 * 2^31 = 94340749 * 2^31 -.word 1571480878 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 12 * 1521161857 * 2^31 -.word 53853488 // XX: zeta^300 * 2^31 = 4666088^300 * 2^31 = 53853488 * 2^31 -.word 897064392 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4666088^300 * 1521161857 * 2^31 -.word 24406074 // XX: zeta^156 * 2^31 = 4666088^156 * 2^31 = 24406074 * 2^31 -.word 406544139 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4666088^156 * 1521161857 * 2^31 -.word 33121106 // XX: zeta^ 60 * 2^31 = 4666088^ 60 * 2^31 = 33121106 * 2^31 -.word 551714771 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 60 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_128919937_4666088_incomplete_good_oop, %function -.global ntt_384_u32_128919937_4666088_incomplete_good_oop -ntt_384_u32_128919937_4666088_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -128919937 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r7 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r6 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r9 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r6 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmla.s32 Q2, Q3, r9 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r11,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r11,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r11,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r11,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r11,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r11,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r11,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r11,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r11,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r11,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r10,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r11,#(0)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 2773805439 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3355 -// Instruction count: 2397 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input.s b/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input.s deleted file mode 100644 index c821bdd..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,3075 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input_twiddles -ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 11080701 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 3608555395 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 230921307 // zeta^ 64 * 2^31 = 4666088^ 64 * 2^31 = 126696090 * 2^31 -.word 719940645 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 64 * 1521161857 * 2^31 -.word 102515993 // zeta^ 32 * 2^31 = 4666088^ 32 * 2^31 = 35786897 * 2^31 -.word 2350778471 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 32 * 1521161857 * 2^31 -.word 70386737 // zeta^ 96 * 2^31 = 4666088^ 96 * 2^31 = 14136207 * 2^31 -.word 2410982735 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 96 * 1521161857 * 2^31 -.word 138595355 // zeta^ 16 * 2^31 = 4666088^ 16 * 2^31 = 84055869 * 2^31 -.word 1153684581 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 16 * 1521161857 * 2^31 -.word 217112191 // zeta^ 80 * 2^31 = 4666088^ 80 * 2^31 = 70545107 * 2^31 -.word 2192308225 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 80 * 1521161857 * 2^31 -.word 149523283 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 837200173 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 101017423 // zeta^112 * 2^31 = 4666088^112 * 2^31 = 83528165 * 2^31 -.word 3097569073 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4666088^112 * 1521161857 * 2^31 -.word 99656329 // zeta^ 8 * 2^31 = 4666088^ 8 * 2^31 = 120423310 * 2^31 -.word 2554548983 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 8 * 1521161857 * 2^31 -.word 162941809 // zeta^ 72 * 2^31 = 4666088^ 72 * 2^31 = 47897664 * 2^31 -.word 1378698767 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 72 * 1521161857 * 2^31 -.word 180768417 // zeta^ 40 * 2^31 = 4666088^ 40 * 2^31 = 69713041 * 2^31 -.word 1740999391 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 40 * 1521161857 * 2^31 -.word 197631563 // zeta^104 * 2^31 = 4666088^104 * 2^31 = 115031316 * 2^31 -.word 3468020277 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4666088^104 * 1521161857 * 2^31 -.word 87331641 // zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 2059921991 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 4770345 // zeta^ 88 * 2^31 = 4666088^ 88 * 2^31 = 124196042 * 2^31 -.word 646945623 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 88 * 1521161857 * 2^31 -.word 18419051 // zeta^ 56 * 2^31 = 4666088^ 56 * 2^31 = 80622849 * 2^31 -.word 3066039061 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 56 * 1521161857 * 2^31 -.word 64594065 // zeta^120 * 2^31 = 4666088^120 * 2^31 = 72023844 * 2^31 -.word 548315375 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4666088^120 * 1521161857 * 2^31 -.word 56313901 // zeta^ 4 * 2^31 = 4666088^ 4 * 2^31 = 4666088 * 2^31 -.word 1857930067 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 4 * 1521161857 * 2^31 -.word 100524275 // zeta^ 68 * 2^31 = 4666088^ 68 * 2^31 = 99928594 * 2^31 -.word 2037105549 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 68 * 1521161857 * 2^31 -.word 113982537 // zeta^ 36 * 2^31 = 4666088^ 36 * 2^31 = 101970253 * 2^31 -.word 765810999 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 36 * 1521161857 * 2^31 -.word 78770821 // zeta^100 * 2^31 = 4666088^100 * 2^31 = 60361599 * 2^31 -.word 2527485179 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4666088^100 * 1521161857 * 2^31 -.word 8006691 // zeta^ 20 * 2^31 = 4666088^ 20 * 2^31 = 117614805 * 2^31 -.word 2034483293 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 20 * 1521161857 * 2^31 -.word 250258415 // zeta^ 84 * 2^31 = 4666088^ 84 * 2^31 = 78048497 * 2^31 -.word 3406289553 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 84 * 1521161857 * 2^31 -.word 10390241 // zeta^ 52 * 2^31 = 4666088^ 52 * 2^31 = 49398437 * 2^31 -.word 836873887 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 52 * 1521161857 * 2^31 -.word 14031383 // zeta^116 * 2^31 = 4666088^116 * 2^31 = 46189616 * 2^31 -.word 2860638313 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4666088^116 * 1521161857 * 2^31 -.word 141427479 // zeta^ 12 * 2^31 = 4666088^ 12 * 2^31 = 94340749 * 2^31 -.word 636450665 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 12 * 1521161857 * 2^31 -.word 3413487 // zeta^ 76 * 2^31 = 4666088^ 76 * 2^31 = 14874791 * 2^31 -.word 3208210577 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 76 * 1521161857 * 2^31 -.word 92531221 // zeta^ 44 * 2^31 = 4666088^ 44 * 2^31 = 8773444 * 2^31 -.word 610418027 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 44 * 1521161857 * 2^31 -.word 21375489 // zeta^108 * 2^31 = 4666088^108 * 2^31 = 75066449 * 2^31 -.word 3948789631 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4666088^108 * 1521161857 * 2^31 -.word 152464147 // zeta^ 28 * 2^31 = 4666088^ 28 * 2^31 = 80969678 * 2^31 -.word 1255106925 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 28 * 1521161857 * 2^31 -.word 177837625 // zeta^ 92 * 2^31 = 4666088^ 92 * 2^31 = 105375752 * 2^31 -.word 329213767 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 92 * 1521161857 * 2^31 -.word 50081627 // zeta^ 60 * 2^31 = 4666088^ 60 * 2^31 = 33121106 * 2^31 -.word 856772901 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 60 * 1521161857 * 2^31 -.word 57615231 // zeta^124 * 2^31 = 4666088^124 * 2^31 = 71071176 * 2^31 -.word 1545417473 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4666088^124 * 1521161857 * 2^31 -.word 90920669 // zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 1406352547 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 246759173 // zeta^192 * 2^31 = 4666088^192 * 2^31 = 128919936 * 2^31 -.word 686411899 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4666088^192 * 1521161857 * 2^31 -.word 96790681 // zeta^160 * 2^31 = 4666088^160 * 2^31 = 107269247 * 2^31 -.word 60204263 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4666088^160 * 1521161857 * 2^31 -.word 155323881 // zeta^224 * 2^31 = 4666088^224 * 2^31 = 93133040 * 2^31 -.word 1944188823 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4666088^224 * 1521161857 * 2^31 -.word 207436773 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1038623643 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 119244519 // zeta^208 * 2^31 = 4666088^208 * 2^31 = 44864068 * 2^31 -.word 3141282713 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4666088^208 * 1521161857 * 2^31 -.word 80414077 // zeta^176 * 2^31 = 4666088^176 * 2^31 = 45315884 * 2^31 -.word 2260368899 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4666088^176 * 1521161857 * 2^31 -.word 108316591 // zeta^240 * 2^31 = 4666088^240 * 2^31 = 90707656 * 2^31 -.word 3457767121 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4666088^240 * 1521161857 * 2^31 -.word 192205417 // zeta^136 * 2^31 = 4666088^136 * 2^31 = 56394291 * 2^31 -.word 3119117079 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4666088^136 * 1521161857 * 2^31 -.word 158183545 // zeta^200 * 2^31 = 4666088^200 * 2^31 = 8496627 * 2^31 -.word 1740418311 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4666088^200 * 1521161857 * 2^31 -.word 145783083 // zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 1727020885 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 77071457 // zeta^232 * 2^31 = 4666088^232 * 2^31 = 59206896 * 2^31 -.word 2553967903 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4666088^232 * 1521161857 * 2^31 -.word 46358641 // zeta^152 * 2^31 = 4666088^152 * 2^31 = 49737683 * 2^31 -.word 2881990927 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4666088^152 * 1521161857 * 2^31 -.word 170508233 // zeta^216 * 2^31 = 4666088^216 * 2^31 = 54461578 * 2^31 -.word 2235045303 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4666088^216 * 1521161857 * 2^31 -.word 175094951 // zeta^184 * 2^31 = 4666088^184 * 2^31 = 120320932 * 2^31 -.word 1777243609 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4666088^184 * 1521161857 * 2^31 -.word 239420823 // zeta^248 * 2^31 = 4666088^248 * 2^31 = 48297088 * 2^31 -.word 1228928233 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4666088^248 * 1521161857 * 2^31 -.word 173130311 // zeta^132 * 2^31 = 4666088^132 * 2^31 = 95262506 * 2^31 -.word 179175481 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4666088^132 * 1521161857 * 2^31 -.word 201525973 // zeta^196 * 2^31 = 4666088^196 * 2^31 = 124253849 * 2^31 -.word 2437037227 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4666088^196 * 1521161857 * 2^31 -.word 93708221 // zeta^164 * 2^31 = 4666088^164 * 2^31 = 87311283 * 2^31 -.word 1761674179 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4666088^164 * 1521161857 * 2^31 -.word 143857337 // zeta^228 * 2^31 = 4666088^228 * 2^31 = 26949684 * 2^31 -.word 3529156295 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4666088^228 * 1521161857 * 2^31 -.word 113331787 // zeta^148 * 2^31 = 4666088^148 * 2^31 = 89353629 * 2^31 -.word 1371806261 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4666088^148 * 1521161857 * 2^31 -.word 249833183 // zeta^212 * 2^31 = 4666088^212 * 2^31 = 11305132 * 2^31 -.word 2260484001 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4666088^212 * 1521161857 * 2^31 -.word 132561079 // zeta^180 * 2^31 = 4666088^180 * 2^31 = 125711116 * 2^31 -.word 2023764425 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4666088^180 * 1521161857 * 2^31 -.word 247449633 // zeta^244 * 2^31 = 4666088^244 * 2^31 = 79521500 * 2^31 -.word 3458093407 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4666088^244 * 1521161857 * 2^31 -.word 248745819 // zeta^140 * 2^31 = 4666088^140 * 2^31 = 49453979 * 2^31 -.word 2571759909 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4666088^140 * 1521161857 * 2^31 -.word 116412395 // zeta^204 * 2^31 = 4666088^204 * 2^31 = 34579188 * 2^31 -.word 3658516629 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4666088^204 * 1521161857 * 2^31 -.word 57764205 // zeta^172 * 2^31 = 4666088^172 * 2^31 = 66293005 * 2^31 -.word 3338371603 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4666088^172 * 1521161857 * 2^31 -.word 165308653 // zeta^236 * 2^31 = 4666088^236 * 2^31 = 120146493 * 2^31 -.word 3684549267 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4666088^236 * 1521161857 * 2^31 -.word 154293415 // zeta^156 * 2^31 = 4666088^156 * 2^31 = 24406074 * 2^31 -.word 3369074137 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4666088^156 * 1521161857 * 2^31 -.word 105375727 // zeta^220 * 2^31 = 4666088^220 * 2^31 = 47950259 * 2^31 -.word 3039860369 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4666088^220 * 1521161857 * 2^31 -.word 136453541 // zeta^188 * 2^31 = 4666088^188 * 2^31 = 37950070 * 2^31 -.word 688644571 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4666088^188 * 1521161857 * 2^31 -.word 207758247 // zeta^252 * 2^31 = 4666088^252 * 2^31 = 95798831 * 2^31 -.word 3438194393 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4666088^252 * 1521161857 * 2^31 -.word 26918567 // zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 3575026649 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 166919205 // zeta^320 * 2^31 = 4666088^320 * 2^31 = 2223848 * 2^31 -.word 2888614747 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4666088^320 * 1521161857 * 2^31 -.word 187453137 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1883984559 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 161049193 // zeta^352 * 2^31 = 4666088^352 * 2^31 = 21650690 * 2^31 -.word 4234763031 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4666088^352 * 1521161857 * 2^31 -.word 40727683 // zeta^272 * 2^31 = 4666088^272 * 2^31 = 58374830 * 2^31 -.word 2102659069 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4666088^272 * 1521161857 * 2^31 -.word 50403101 // zeta^336 * 2^31 = 4666088^336 * 2^31 = 13510762 * 2^31 -.word 3256343651 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4666088^336 * 1521161857 * 2^31 -.word 156822451 // zeta^304 * 2^31 = 4666088^304 * 2^31 = 45391772 * 2^31 -.word 1197398221 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4666088^304 * 1521161857 * 2^31 -.word 177425797 // zeta^368 * 2^31 = 4666088^368 * 2^31 = 83604053 * 2^31 -.word 2034598395 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4666088^368 * 1521161857 * 2^31 -.word 94898065 // zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 2916268527 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 65634457 // zeta^328 * 2^31 = 4666088^328 * 2^31 = 72525646 * 2^31 -.word 1175850215 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4666088^328 * 1521161857 * 2^31 -.word 60208311 // zeta^296 * 2^31 = 4666088^296 * 2^31 = 13888621 * 2^31 -.word 826947017 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4666088^296 * 1521161857 * 2^31 -.word 112056791 // zeta^360 * 2^31 = 4666088^360 * 2^31 = 83601662 * 2^31 -.word 2567946409 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4666088^360 * 1521161857 * 2^31 -.word 253069529 // zeta^280 * 2^31 = 4666088^280 * 2^31 = 4723895 * 2^31 -.word 3648021671 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4666088^280 * 1521161857 * 2^31 -.word 211481233 // zeta^344 * 2^31 = 4666088^344 * 2^31 = 79182254 * 2^31 -.word 1412976367 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4666088^344 * 1521161857 * 2^31 -.word 193245809 // zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 3746651919 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 82744923 // zeta^376 * 2^31 = 4666088^376 * 2^31 = 8599005 * 2^31 -.word 2517723685 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4666088^376 * 1521161857 * 2^31 -.word 157315599 // zeta^260 * 2^31 = 4666088^260 * 2^31 = 28991343 * 2^31 -.word 2257861745 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4666088^260 * 1521161857 * 2^31 -.word 84709563 // zeta^324 * 2^31 = 4666088^324 * 2^31 = 33657431 * 2^31 -.word 4115791813 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4666088^324 * 1521161857 * 2^31 -.word 179069053 // zeta^292 * 2^31 = 4666088^292 * 2^31 = 68558338 * 2^31 -.word 1767482115 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4666088^292 * 1521161857 * 2^31 -.word 164131653 // zeta^356 * 2^31 = 4666088^356 * 2^31 = 41608654 * 2^31 -.word 2533293115 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4666088^356 * 1521161857 * 2^31 -.word 7581459 // zeta^276 * 2^31 = 4666088^276 * 2^31 = 50871440 * 2^31 -.word 888677741 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4666088^276 * 1521161857 * 2^31 -.word 144508087 // zeta^340 * 2^31 = 4666088^340 * 2^31 = 39566308 * 2^31 -.word 2923161033 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4666088^340 * 1521161857 * 2^31 -.word 243808491 // zeta^308 * 2^31 = 4666088^308 * 2^31 = 82730321 * 2^31 -.word 1434328981 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4666088^308 * 1521161857 * 2^31 -.word 125278795 // zeta^372 * 2^31 = 4666088^372 * 2^31 = 3208821 * 2^31 -.word 2271202869 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4666088^372 * 1521161857 * 2^31 -.word 254426387 // zeta^268 * 2^31 = 4666088^268 * 2^31 = 114045146 * 2^31 -.word 1086756717 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4666088^268 * 1521161857 * 2^31 -.word 9094055 // zeta^332 * 2^31 = 4666088^332 * 2^31 = 79465958 * 2^31 -.word 1723207385 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4666088^332 * 1521161857 * 2^31 -.word 236464385 // zeta^300 * 2^31 = 4666088^300 * 2^31 = 53853488 * 2^31 -.word 346177663 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4666088^300 * 1521161857 * 2^31 -.word 200075669 // zeta^364 * 2^31 = 4666088^364 * 2^31 = 62626932 * 2^31 -.word 956595691 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4666088^364 * 1521161857 * 2^31 -.word 80002249 // zeta^284 * 2^31 = 4666088^284 * 2^31 = 23544185 * 2^31 -.word 3965753527 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4666088^284 * 1521161857 * 2^31 -.word 103546459 // zeta^348 * 2^31 = 4666088^348 * 2^31 = 104513863 * 2^31 -.word 925893157 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4666088^348 * 1521161857 * 2^31 -.word 200224643 // zeta^316 * 2^31 = 4666088^316 * 2^31 = 57848761 * 2^31 -.word 2749549821 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4666088^316 * 1521161857 * 2^31 -.word 121386333 // zeta^380 * 2^31 = 4666088^380 * 2^31 = 90969867 * 2^31 -.word 3606322723 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4666088^380 * 1521161857 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input_scale -ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 11080701 // 1/96 -.word 3608555395 // 1/96 twisted -.data -roots: -.word 2223847 /// zeta^256 * 2^31 = 4666088^256 * 2^31 = 2223847 * 2^31 -.word 37043728 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4666088^256 * 1521161857 * 2^31 -.word 126696089 /// zeta^128 * 2^31 = 4666088^128 * 2^31 = 126696089 * 2^31 -.word 2110439903 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4666088^128 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 114783730 // zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 115409175 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 38212281 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 115409175 // zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 81022273 // zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 1349628385 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 45318275 // zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 754889095 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 38212281 // zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 74458359 // zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 1240289998 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 56896093 // zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 947746580 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4666088^ 0 * 2^31 = 1 * 2^31 -.word 17 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 0 * 1521161857 * 2^31 -.word 114783730 // XX: zeta^288 * 2^31 = 4666088^288 * 2^31 = 114783730 * 2^31 -.word 1912009802 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4666088^288 * 1521161857 * 2^31 -.word 115409175 // XX: zeta^144 * 2^31 = 4666088^144 * 2^31 = 115409175 * 2^31 -.word 1922428151 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4666088^144 * 1521161857 * 2^31 -.word 38212281 // XX: zeta^ 48 * 2^31 = 4666088^ 48 * 2^31 = 38212281 * 2^31 -.word 636521011 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 48 * 1521161857 * 2^31 -.word 81022273 // XX: zeta^264 * 2^31 = 4666088^264 * 2^31 = 81022273 * 2^31 -.word 1349628385 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4666088^264 * 1521161857 * 2^31 -.word 45318275 // XX: zeta^168 * 2^31 = 4666088^168 * 2^31 = 45318275 * 2^31 -.word 754889095 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4666088^168 * 1521161857 * 2^31 -.word 74458359 // XX: zeta^ 24 * 2^31 = 4666088^ 24 * 2^31 = 74458359 * 2^31 -.word 1240289998 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 24 * 1521161857 * 2^31 -.word 56896093 // XX: zeta^312 * 2^31 = 4666088^312 * 2^31 = 56896093 * 2^31 -.word 947746580 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4666088^312 * 1521161857 * 2^31 -.word 95262506 // XX: zeta^132 * 2^31 = 4666088^132 * 2^31 = 95262506 * 2^31 -.word 1586835044 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4666088^132 * 1521161857 * 2^31 -.word 101970253 // XX: zeta^ 36 * 2^31 = 4666088^ 36 * 2^31 = 101970253 * 2^31 -.word 1698569329 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 36 * 1521161857 * 2^31 -.word 50871440 // XX: zeta^276 * 2^31 = 4666088^276 * 2^31 = 50871440 * 2^31 -.word 847390932 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4666088^276 * 1521161857 * 2^31 -.word 125711116 // XX: zeta^180 * 2^31 = 4666088^180 * 2^31 = 125711116 * 2^31 -.word 2094032717 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4666088^180 * 1521161857 * 2^31 -.word 94340749 // XX: zeta^ 12 * 2^31 = 4666088^ 12 * 2^31 = 94340749 * 2^31 -.word 1571480878 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 12 * 1521161857 * 2^31 -.word 53853488 // XX: zeta^300 * 2^31 = 4666088^300 * 2^31 = 53853488 * 2^31 -.word 897064392 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4666088^300 * 1521161857 * 2^31 -.word 24406074 // XX: zeta^156 * 2^31 = 4666088^156 * 2^31 = 24406074 * 2^31 -.word 406544139 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4666088^156 * 1521161857 * 2^31 -.word 33121106 // XX: zeta^ 60 * 2^31 = 4666088^ 60 * 2^31 = 33121106 * 2^31 -.word 551714771 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4666088^ 60 * 1521161857 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input, %function -.global ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input -ntt_384_u32_128919937_4666088_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -128919937 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q0 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r6 -// input[4]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 4)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r9 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r11,#(-496)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(16)] -// Release input[0] from Q1 -// Release input[128] from Q0 -// input[4]: Already loaded as Q7 -// input[132]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmulh.s32 Q1, Q7, r6 -// input[8]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 8)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -// Release input[132] from Q6 -// Release input[4] from Q7 -// input[136]: Already loaded as Q4 -// input[8]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[140]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[136] from Q4 -vstrw.u32 Q2, [r11,#(48)] -vsub.s32 Q4, Q1, Q5 -// Release input[8] from Q5 -vstrw.u32 Q4, [r11,#(-464)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(32)] -// input[140]: Already loaded as Q6 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmulh.s32 Q1, Q6, r6 -// input[16]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-448)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release input[12] from Q3 -// Release input[140] from Q6 -// input[16]: Already loaded as Q7 -// input[144]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmulh.s32 Q1, Q7, r6 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(64)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(80)] -// Release input[144] from Q5 -// Release input[16] from Q7 -// input[148]: Already loaded as Q4 -// input[20]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[148] from Q4 -vstrw.u32 Q2, [r11,#(96)] -vsub.s32 Q4, Q1, Q6 -// Release input[20] from Q6 -vstrw.u32 Q4, [r11,#(-416)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(80)] -// input[152]: Already loaded as Q5 -// input[24]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[156]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmulh.s32 Q1, Q5, r6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-400)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(112)] -// Release input[24] from Q3 -// Release input[152] from Q5 -// input[28]: Already loaded as Q7 -// input[156]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmulh.s32 Q1, Q7, r6 -// input[32]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 32)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(112)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release input[156] from Q6 -// Release input[28] from Q7 -// input[160]: Already loaded as Q4 -// input[32]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[160] from Q4 -vstrw.u32 Q2, [r11,#(144)] -vsub.s32 Q4, Q1, Q5 -// Release input[32] from Q5 -vstrw.u32 Q4, [r11,#(-368)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(128)] -// input[164]: Already loaded as Q6 -// input[36]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmulh.s32 Q1, Q6, r6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-352)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(160)] -// Release input[36] from Q3 -// Release input[164] from Q6 -// input[40]: Already loaded as Q7 -// input[168]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmulh.s32 Q1, Q7, r6 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(176)] -// Release input[168] from Q5 -// Release input[40] from Q7 -// input[172]: Already loaded as Q4 -// input[44]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[172] from Q4 -vstrw.u32 Q2, [r11,#(192)] -vsub.s32 Q4, Q1, Q6 -// Release input[44] from Q6 -vstrw.u32 Q4, [r11,#(-320)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(176)] -// input[176]: Already loaded as Q5 -// input[48]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmulh.s32 Q1, Q5, r6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-304)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(208)] -// Release input[48] from Q3 -// Release input[176] from Q5 -// input[52]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmulh.s32 Q1, Q7, r6 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(224)] -// Release input[180] from Q6 -// Release input[52] from Q7 -// input[184]: Already loaded as Q4 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[188]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[184] from Q4 -vstrw.u32 Q2, [r11,#(240)] -vsub.s32 Q4, Q1, Q5 -// Release input[56] from Q5 -vstrw.u32 Q4, [r11,#(-272)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(224)] -// input[188]: Already loaded as Q6 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -vqrdmulh.s32 Q1, Q6, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-256)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release input[60] from Q3 -// Release input[188] from Q6 -// input[64]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -vneg.s32 Q1, Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q5, r6 -vstrw.u32 Q5, [r11,#(-240)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(256)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(272)] -// Release input[64] from Q5 -// input[68]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -vneg.s32 Q1, Q3 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q3, r6 -vstrw.u32 Q3, [r11,#(288)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(272)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[68] from Q3 -// input[72]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(288)] -vstrw.u32 Q4, [r11,#(304)] -vstrw.u32 Q4, [r11,#(-208)] -// Release input[72] from Q4 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-192)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(320)] -// Release input[76] from Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(336)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(320)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[80] from Q4 -// input[84]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(336)] -vstrw.u32 Q3, [r11,#(352)] -vstrw.u32 Q3, [r11,#(-160)] -// Release input[84] from Q3 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-144)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(368)] -// Release input[88] from Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(384)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(368)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[92] from Q4 -// input[96]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(384)] -vstrw.u32 Q3, [r11,#(400)] -vstrw.u32 Q3, [r11,#(-112)] -// Release input[96] from Q3 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-96)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(400)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release input[100] from Q0 -// input[104]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(432)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(416)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[104] from Q4 -// input[108]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(432)] -vstrw.u32 Q3, [r11,#(448)] -vstrw.u32 Q3, [r11,#(-64)] -// Release input[108] from Q3 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-48)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(448)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(464)] -// Release input[112] from Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(480)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(464)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[116] from Q4 -// input[120]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(480)] -vstrw.u32 Q3, [r11,#(496)] -vstrw.u32 Q3, [r11,#(-16)] -// Release input[120] from Q3 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(0)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(496)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[124] from Q0 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 2773805439 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3042 -// Instruction count: 2201 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good.s deleted file mode 100644 index e32f3e8..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_33556993_15047299_incomplete_good_twiddles -ntt_384_u32_33556993_15047299_incomplete_good_twiddles: // For base multiplication -.word 11579973 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 1431437243 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 49092333 // zeta^ 64 * 2^31 = 15047299^ 64 * 2^31 = 8518432 * 2^31 -.word 3982764819 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 64 * 375649793 * 2^31 -.word 42761787 // zeta^ 32 * 2^31 = 15047299^ 32 * 2^31 = 13841461 * 2^31 -.word 425054149 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 32 * 375649793 * 2^31 -.word 34538439 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 5947961 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 66309139 // zeta^ 16 * 2^31 = 15047299^ 16 * 2^31 = 940305 * 2^31 -.word 681112045 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 16 * 375649793 * 2^31 -.word 28356919 // zeta^ 80 * 2^31 = 15047299^ 80 * 2^31 = 4200632 * 2^31 -.word 4055856329 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 80 * 375649793 * 2^31 -.word 59288659 // zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 3771109805 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 7716537 // zeta^112 * 2^31 = 15047299^112 * 2^31 = 24511972 * 2^31 -.word 851016519 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 15047299^112 * 375649793 * 2^31 -.word 46836875 // zeta^ 8 * 2^31 = 15047299^ 8 * 2^31 = 24111745 * 2^31 -.word 2410070389 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 8 * 375649793 * 2^31 -.word 27581675 // zeta^ 72 * 2^31 = 15047299^ 72 * 2^31 = 26823146 * 2^31 -.word 4046475541 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 72 * 375649793 * 2^31 -.word 9436047 // zeta^ 40 * 2^31 = 15047299^ 40 * 2^31 = 33038085 * 2^31 -.word 292002417 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 40 * 375649793 * 2^31 -.word 17776663 // zeta^104 * 2^31 = 15047299^104 * 2^31 = 12390669 * 2^31 -.word 2490738153 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 15047299^104 * 375649793 * 2^31 -.word 11879829 // zeta^ 24 * 2^31 = 15047299^ 24 * 2^31 = 14745691 * 2^31 -.word 399412331 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 24 * 375649793 * 2^31 -.word 60844951 // zeta^ 88 * 2^31 = 15047299^ 88 * 2^31 = 32562828 * 2^31 -.word 1066891881 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 88 * 375649793 * 2^31 -.word 24769191 // zeta^ 56 * 2^31 = 15047299^ 56 * 2^31 = 20448273 * 2^31 -.word 2663682905 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 56 * 375649793 * 2^31 -.word 8635069 // zeta^120 * 2^31 = 15047299^120 * 2^31 = 20044445 * 2^31 -.word 3577978691 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 15047299^120 * 375649793 * 2^31 -.word 16277701 // zeta^ 4 * 2^31 = 15047299^ 4 * 2^31 = 22098973 * 2^31 -.word 4158345531 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 4 * 375649793 * 2^31 -.word 7436455 // zeta^ 68 * 2^31 = 15047299^ 68 * 2^31 = 8970055 * 2^31 -.word 4077456729 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 68 * 375649793 * 2^31 -.word 9212309 // zeta^ 36 * 2^31 = 15047299^ 36 * 2^31 = 14626653 * 2^31 -.word 3505340523 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 36 * 375649793 * 2^31 -.word 23812275 // zeta^100 * 2^31 = 15047299^100 * 2^31 = 7111893 * 2^31 -.word 673162573 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 15047299^100 * 375649793 * 2^31 -.word 55105631 // zeta^ 20 * 2^31 = 15047299^ 20 * 2^31 = 9575431 * 2^31 -.word 638508449 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 20 * 375649793 * 2^31 -.word 63845407 // zeta^ 84 * 2^31 = 15047299^ 84 * 2^31 = 3819232 * 2^31 -.word 69140961 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 84 * 375649793 * 2^31 -.word 45155211 // zeta^ 52 * 2^31 = 15047299^ 52 * 2^31 = 13583150 * 2^31 -.word 2468768373 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 52 * 375649793 * 2^31 -.word 31892597 // zeta^116 * 2^31 = 15047299^116 * 2^31 = 10311346 * 2^31 -.word 3033656715 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 15047299^116 * 375649793 * 2^31 -.word 44632483 // zeta^ 12 * 2^31 = 15047299^ 12 * 2^31 = 21289485 * 2^31 -.word 3523957853 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 12 * 375649793 * 2^31 -.word 20599243 // zeta^ 76 * 2^31 = 15047299^ 76 * 2^31 = 33421816 * 2^31 -.word 3769343029 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 76 * 375649793 * 2^31 -.word 34994515 // zeta^ 44 * 2^31 = 15047299^ 44 * 2^31 = 30222420 * 2^31 -.word 1396393133 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 44 * 375649793 * 2^31 -.word 50418895 // zeta^108 * 2^31 = 15047299^108 * 2^31 = 23642097 * 2^31 -.word 527614257 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 15047299^108 * 375649793 * 2^31 -.word 26517879 // zeta^ 28 * 2^31 = 15047299^ 28 * 2^31 = 17233810 * 2^31 -.word 2151548041 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 28 * 375649793 * 2^31 -.word 5031613 // zeta^ 92 * 2^31 = 15047299^ 92 * 2^31 = 6280499 * 2^31 -.word 530750275 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 92 * 375649793 * 2^31 -.word 67003163 // zeta^ 60 * 2^31 = 15047299^ 60 * 2^31 = 16204162 * 2^31 -.word 3813976805 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 60 * 375649793 * 2^31 -.word 20694533 // zeta^124 * 2^31 = 15047299^124 * 2^31 = 12410931 * 2^31 -.word 2358012923 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 15047299^124 * 375649793 * 2^31 -.word 3955367 // zeta^128 * 2^31 = 15047299^128 * 2^31 = 8518431 * 2^31 -.word 2551327577 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 15047299^128 * 375649793 * 2^31 -.word 55534013 // zeta^192 * 2^31 = 15047299^192 * 2^31 = 33556992 * 2^31 -.word 2863530051 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 15047299^192 * 375649793 * 2^31 -.word 25333645 // zeta^160 * 2^31 = 15047299^160 * 2^31 = 2013241 * 2^31 -.word 3875861107 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 15047299^160 * 375649793 * 2^31 -.word 24352199 // zeta^224 * 2^31 = 15047299^224 * 2^31 = 19715532 * 2^31 -.word 3869913145 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 15047299^224 * 375649793 * 2^31 -.word 62718759 // zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 3374744281 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 804847 // zeta^208 * 2^31 = 15047299^208 * 2^31 = 32616688 * 2^31 -.word 3613855249 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 15047299^208 * 375649793 * 2^31 -.word 49098857 // zeta^176 * 2^31 = 15047299^176 * 2^31 = 9932396 * 2^31 -.word 1374874007 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 15047299^176 * 375649793 * 2^31 -.word 7825327 // zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 523857489 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 14301793 // zeta^136 * 2^31 = 15047299^136 * 2^31 = 2711401 * 2^31 -.word 1636405151 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 15047299^136 * 375649793 * 2^31 -.word 20277111 // zeta^200 * 2^31 = 15047299^200 * 2^31 = 9445248 * 2^31 -.word 1884896905 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 15047299^200 * 375649793 * 2^31 -.word 41897609 // zeta^168 * 2^31 = 15047299^168 * 2^31 = 12909577 * 2^31 -.word 2198735735 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 15047299^168 * 375649793 * 2^31 -.word 57677939 // zeta^232 * 2^31 = 15047299^232 * 2^31 = 518908 * 2^31 -.word 4002964877 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 15047299^232 * 375649793 * 2^31 -.word 15408129 // zeta^152 * 2^31 = 15047299^152 * 2^31 = 17817137 * 2^31 -.word 667479551 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 15047299^152 * 375649793 * 2^31 -.word 55234157 // zeta^216 * 2^31 = 15047299^216 * 2^31 = 18811302 * 2^31 -.word 3895554963 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 15047299^216 * 375649793 * 2^31 -.word 17422871 // zeta^184 * 2^31 = 15047299^184 * 2^31 = 33153165 * 2^31 -.word 914295785 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 15047299^184 * 375649793 * 2^31 -.word 42344795 // zeta^248 * 2^31 = 15047299^248 * 2^31 = 13108720 * 2^31 -.word 1631284389 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 15047299^248 * 375649793 * 2^31 -.word 24715747 // zeta^132 * 2^31 = 15047299^132 * 2^31 = 20428075 * 2^31 -.word 4214078493 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 15047299^132 * 375649793 * 2^31 -.word 50836285 // zeta^196 * 2^31 = 15047299^196 * 2^31 = 11458020 * 2^31 -.word 136621763 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 15047299^196 * 375649793 * 2^31 -.word 48156959 // zeta^164 * 2^31 = 15047299^164 * 2^31 = 26042233 * 2^31 -.word 1462789345 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 15047299^164 * 375649793 * 2^31 -.word 57901677 // zeta^228 * 2^31 = 15047299^228 * 2^31 = 18930340 * 2^31 -.word 789626771 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 15047299^228 * 375649793 * 2^31 -.word 42296769 // zeta^148 * 2^31 = 15047299^148 * 2^31 = 27800794 * 2^31 -.word 3725599807 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 15047299^148 * 375649793 * 2^31 -.word 12008355 // zeta^212 * 2^31 = 15047299^212 * 2^31 = 23981562 * 2^31 -.word 3656458845 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 15047299^212 * 375649793 * 2^31 -.word 20294379 // zeta^180 * 2^31 = 15047299^180 * 2^31 = 30285189 * 2^31 -.word 564888341 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 15047299^180 * 375649793 * 2^31 -.word 21958775 // zeta^244 * 2^31 = 15047299^244 * 2^31 = 19973843 * 2^31 -.word 1826198921 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 15047299^244 * 375649793 * 2^31 -.word 9523753 // zeta^140 * 2^31 = 15047299^140 * 2^31 = 12132331 * 2^31 -.word 245385175 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 15047299^140 * 375649793 * 2^31 -.word 22481503 // zeta^204 * 2^31 = 15047299^204 * 2^31 = 12267508 * 2^31 -.word 771009441 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 15047299^204 * 375649793 * 2^31 -.word 48981373 // zeta^172 * 2^31 = 15047299^172 * 2^31 = 26976670 * 2^31 -.word 3426188419 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 15047299^172 * 375649793 * 2^31 -.word 32119471 // zeta^236 * 2^31 = 15047299^236 * 2^31 = 3334573 * 2^31 -.word 2898574161 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 15047299^236 * 375649793 * 2^31 -.word 12070727 // zeta^156 * 2^31 = 15047299^156 * 2^31 = 22603682 * 2^31 -.word 2674169529 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 15047299^156 * 375649793 * 2^31 -.word 40596107 // zeta^220 * 2^31 = 15047299^220 * 2^31 = 16323183 * 2^31 -.word 2143419253 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 15047299^220 * 375649793 * 2^31 -.word 54362349 // zeta^188 * 2^31 = 15047299^188 * 2^31 = 29763762 * 2^31 -.word 2839003411 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 15047299^188 * 375649793 * 2^31 -.word 110823 // zeta^252 * 2^31 = 15047299^252 * 2^31 = 17352831 * 2^31 -.word 480990489 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 15047299^252 * 375649793 * 2^31 -.word 18021653 // zeta^256 * 2^31 = 15047299^256 * 2^31 = 25038561 * 2^31 -.word 312202475 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 15047299^256 * 375649793 * 2^31 -.word 63158619 // zeta^320 * 2^31 = 15047299^320 * 2^31 = 25038562 * 2^31 -.word 1743639717 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 15047299^320 * 375649793 * 2^31 -.word 32575547 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 4289019333 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 41780341 // zeta^352 * 2^31 = 15047299^352 * 2^31 = 31543752 * 2^31 -.word 419106187 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 15047299^352 * 375649793 * 2^31 -.word 38757067 // zeta^272 * 2^31 = 15047299^272 * 2^31 = 29356361 * 2^31 -.word 239110965 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 15047299^272 * 375649793 * 2^31 -.word 4395227 // zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 920223013 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 59397449 // zeta^304 * 2^31 = 15047299^304 * 2^31 = 9045021 * 2^31 -.word 3443950775 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 15047299^304 * 375649793 * 2^31 -.word 18015129 // zeta^368 * 2^31 = 15047299^368 * 2^31 = 23624597 * 2^31 -.word 2920093287 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 15047299^368 * 375649793 * 2^31 -.word 39532311 // zeta^264 * 2^31 = 15047299^264 * 2^31 = 6733847 * 2^31 -.word 248491753 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 15047299^264 * 375649793 * 2^31 -.word 52812193 // zeta^328 * 2^31 = 15047299^328 * 2^31 = 30845592 * 2^31 -.word 2658562143 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 15047299^328 * 375649793 * 2^31 -.word 49337323 // zeta^296 * 2^31 = 15047299^296 * 2^31 = 21166324 * 2^31 -.word 1804229141 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 15047299^296 * 375649793 * 2^31 -.word 25216377 // zeta^360 * 2^31 = 15047299^360 * 2^31 = 20647416 * 2^31 -.word 2096231559 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 15047299^360 * 375649793 * 2^31 -.word 6269035 // zeta^280 * 2^31 = 15047299^280 * 2^31 = 994165 * 2^31 -.word 3228075413 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 15047299^280 * 375649793 * 2^31 -.word 51705857 // zeta^344 * 2^31 = 15047299^344 * 2^31 = 15739856 * 2^31 -.word 3627487743 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 15047299^344 * 375649793 * 2^31 -.word 58478917 // zeta^312 * 2^31 = 15047299^312 * 2^31 = 13512548 * 2^31 -.word 716988603 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 15047299^312 * 375649793 * 2^31 -.word 49691115 // zeta^376 * 2^31 = 15047299^376 * 2^31 = 403828 * 2^31 -.word 3380671509 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 15047299^376 * 375649793 * 2^31 -.word 59677531 // zeta^260 * 2^31 = 15047299^260 * 2^31 = 24586938 * 2^31 -.word 217510565 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 15047299^260 * 375649793 * 2^31 -.word 42398239 // zeta^324 * 2^31 = 15047299^324 * 2^31 = 13128918 * 2^31 -.word 80888801 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 15047299^324 * 375649793 * 2^31 -.word 43301711 // zeta^292 * 2^31 = 15047299^292 * 2^31 = 26445100 * 2^31 -.word 3621804721 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 15047299^292 * 375649793 * 2^31 -.word 18957027 // zeta^356 * 2^31 = 15047299^356 * 2^31 = 7514760 * 2^31 -.word 2832177949 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 15047299^356 * 375649793 * 2^31 -.word 3268579 // zeta^276 * 2^31 = 15047299^276 * 2^31 = 29737761 * 2^31 -.word 4225826333 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 15047299^276 * 375649793 * 2^31 -.word 24817217 // zeta^340 * 2^31 = 15047299^340 * 2^31 = 5756199 * 2^31 -.word 569367487 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 15047299^340 * 375649793 * 2^31 -.word 35221389 // zeta^308 * 2^31 = 15047299^308 * 2^31 = 23245647 * 2^31 -.word 1261310579 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 15047299^308 * 375649793 * 2^31 -.word 46819607 // zeta^372 * 2^31 = 15047299^372 * 2^31 = 3271804 * 2^31 -.word 3730078953 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 15047299^372 * 375649793 * 2^31 -.word 46514743 // zeta^268 * 2^31 = 15047299^268 * 2^31 = 135177 * 2^31 -.word 525624265 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 15047299^268 * 375649793 * 2^31 -.word 57590233 // zeta^332 * 2^31 = 15047299^332 * 2^31 = 21424662 * 2^31 -.word 4049582119 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 15047299^332 * 375649793 * 2^31 -.word 16695091 // zeta^300 * 2^31 = 15047299^300 * 2^31 = 9914896 * 2^31 -.word 3767353037 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 15047299^300 * 375649793 * 2^31 -.word 18132613 // zeta^364 * 2^31 = 15047299^364 * 2^31 = 6580323 * 2^31 -.word 868778875 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 15047299^364 * 375649793 * 2^31 -.word 62082373 // zeta^284 * 2^31 = 15047299^284 * 2^31 = 27276494 * 2^31 -.word 3764217019 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 15047299^284 * 375649793 * 2^31 -.word 55043259 // zeta^348 * 2^31 = 15047299^348 * 2^31 = 10953311 * 2^31 -.word 1620797765 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 15047299^348 * 375649793 * 2^31 -.word 46419453 // zeta^316 * 2^31 = 15047299^316 * 2^31 = 21146062 * 2^31 -.word 1936954371 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 15047299^316 * 375649793 * 2^31 -.word 12751637 // zeta^380 * 2^31 = 15047299^380 * 2^31 = 3793231 * 2^31 -.word 1455963883 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 15047299^380 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_33556993_15047299_incomplete_good_scale -ntt_384_u32_33556993_15047299_incomplete_good_scale: // Constants for scaling by 1/N -.word 11579973 // 1/96 -.word 1431437243 // 1/96 twisted -.data -roots: -.word 66384763 /// zeta^256 * 2^31 = 15047299^256 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 15047299^256 * 375649793 * 2^31 -.word 893127 /// zeta^128 * 2^31 = 15047299^128 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 15047299^128 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 29095681 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 14476917 // zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 43317805 // zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 14476917 // zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 18598075 // zeta^264 * 2^31 = 15047299^264 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 15047299^264 * 375649793 * 2^31 -.word 4885007 // zeta^168 * 2^31 = 15047299^168 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 15047299^168 * 375649793 * 2^31 -.word 43317805 // zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 64683161 // zeta^ 24 * 2^31 = 15047299^ 24 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 24 * 375649793 * 2^31 -.word 34427601 // zeta^312 * 2^31 = 15047299^312 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 15047299^312 * 375649793 * 2^31 -.word 33393089 // XX: zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 29095681 // XX: zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 14476917 // XX: zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 2356128651 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 43317805 // XX: zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 933021651 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 18598075 // XX: zeta^264 * 2^31 = 15047299^264 * 2^31 = 6733847 * 2^31 -.word 2578416965 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 15047299^264 * 375649793 * 2^31 -.word 4885007 // XX: zeta^168 * 2^31 = 15047299^168 * 2^31 = 12909577 * 2^31 -.word 2973633521 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 15047299^168 * 375649793 * 2^31 -.word 64683161 // XX: zeta^ 24 * 2^31 = 15047299^ 24 * 2^31 = 14745691 * 2^31 -.word 3091135847 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 24 * 375649793 * 2^31 -.word 34427601 // XX: zeta^312 * 2^31 = 15047299^312 * 2^31 = 13512548 * 2^31 -.word 864737071 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 15047299^312 * 375649793 * 2^31 -.word 39999747 // XX: zeta^132 * 2^31 = 15047299^132 * 2^31 = 20428075 * 2^31 -.word 3454780669 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 15047299^132 * 375649793 * 2^31 -.word 45317587 // XX: zeta^ 36 * 2^31 = 15047299^ 36 * 2^31 = 14626653 * 2^31 -.word 3083517997 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 36 * 375649793 * 2^31 -.word 48811299 // XX: zeta^276 * 2^31 = 15047299^276 * 2^31 = 29737761 * 2^31 -.word 4050555101 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 15047299^276 * 375649793 * 2^31 -.word 54571669 // XX: zeta^180 * 2^31 = 15047299^180 * 2^31 = 30285189 * 2^31 -.word 4085587819 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 15047299^180 * 375649793 * 2^31 -.word 59281651 // XX: zeta^ 12 * 2^31 = 15047299^ 12 * 2^31 = 21289485 * 2^31 -.word 3509906701 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 12 * 375649793 * 2^31 -.word 40500013 // XX: zeta^300 * 2^31 = 15047299^300 * 2^31 = 9914896 * 2^31 -.word 634504915 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 15047299^300 * 375649793 * 2^31 -.word 25917637 // XX: zeta^156 * 2^31 = 15047299^156 * 2^31 = 22603682 * 2^31 -.word 1446525243 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 15047299^156 * 375649793 * 2^31 -.word 8356523 // XX: zeta^ 60 * 2^31 = 15047299^ 60 * 2^31 = 16204162 * 2^31 -.word 1036987221 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 60 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_33556993_15047299_incomplete_good, %function -.global ntt_384_u32_33556993_15047299_incomplete_good -ntt_384_u32_33556993_15047299_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33556993 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vmul.u32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vmul.u32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vmul.u32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vmul.u32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 3919317503 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s deleted file mode 100644 index ebd349d..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 893127 /// zeta^128 * 2^31 = 15047299^128 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 15047299^128 * 375649793 * 2^31 -.word 66384763 /// zeta^256 * 2^31 = 15047299^256 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 15047299^256 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 38018305 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 23796181 // zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 52637069 // zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 23796181 // zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 32686385 // zeta^120 * 2^31 = 15047299^120 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 15047299^120 * 375649793 * 2^31 -.word 2430825 // zeta^216 * 2^31 = 15047299^216 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 15047299^216 * 375649793 * 2^31 -.word 52637069 // zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 62228979 // zeta^360 * 2^31 = 15047299^360 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 15047299^360 * 375649793 * 2^31 -.word 48515911 // zeta^ 72 * 2^31 = 15047299^ 72 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 72 * 375649793 * 2^31 -.word 33393089 // XX: zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 38018305 // XX: zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 23796181 // XX: zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 3361945643 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 52637069 // XX: zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 1938838643 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 32686385 // XX: zeta^120 * 2^31 = 15047299^120 * 2^31 = 20044445 * 2^31 -.word 3430230223 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 15047299^120 * 375649793 * 2^31 -.word 2430825 // XX: zeta^216 * 2^31 = 15047299^216 * 2^31 = 18811302 * 2^31 -.word 1203831447 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 15047299^216 * 375649793 * 2^31 -.word 62228979 // XX: zeta^360 * 2^31 = 15047299^360 * 2^31 = 20647416 * 2^31 -.word 1321333773 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 15047299^360 * 375649793 * 2^31 -.word 48515911 // XX: zeta^ 72 * 2^31 = 15047299^ 72 * 2^31 = 26823146 * 2^31 -.word 1716550329 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 72 * 375649793 * 2^31 -.word 58757463 // XX: zeta^252 * 2^31 = 15047299^252 * 2^31 = 17352831 * 2^31 -.word 3257980073 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 15047299^252 * 375649793 * 2^31 -.word 41196349 // XX: zeta^348 * 2^31 = 15047299^348 * 2^31 = 10953311 * 2^31 -.word 2848442051 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 15047299^348 * 375649793 * 2^31 -.word 26613973 // XX: zeta^108 * 2^31 = 15047299^108 * 2^31 = 23642097 * 2^31 -.word 3660462379 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 15047299^108 * 375649793 * 2^31 -.word 7832335 // XX: zeta^204 * 2^31 = 15047299^204 * 2^31 = 12267508 * 2^31 -.word 785060593 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 15047299^204 * 375649793 * 2^31 -.word 12542317 // XX: zeta^372 * 2^31 = 15047299^372 * 2^31 = 3271804 * 2^31 -.word 209379475 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 15047299^372 * 375649793 * 2^31 -.word 18302687 // XX: zeta^ 84 * 2^31 = 15047299^ 84 * 2^31 = 3819232 * 2^31 -.word 244412193 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 84 * 375649793 * 2^31 -.word 21796399 // XX: zeta^228 * 2^31 = 15047299^228 * 2^31 = 18930340 * 2^31 -.word 1211449297 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 15047299^228 * 375649793 * 2^31 -.word 27114239 // XX: zeta^324 * 2^31 = 15047299^324 * 2^31 = 13128918 * 2^31 -.word 840186625 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 15047299^324 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_33556993_15047299_incomplete_good_bitrev, %function -.global ntt_384_u32_33556993_15047299_incomplete_good_bitrev -ntt_384_u32_33556993_15047299_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33556993 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vmul.u32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 3919317503 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good.s deleted file mode 100644 index 6275846..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_45387457_923104_incomplete_good_twiddles -ntt_384_u32_45387457_923104_incomplete_good_twiddles: // For base multiplication -.word 69606647 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 685157961 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 62904337 // zeta^ 64 * 2^31 = 923104^ 64 * 2^31 = 18186381 * 2^31 -.word 1812533935 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 64 * 450429249 * 2^31 -.word 48768409 // zeta^ 32 * 2^31 = 923104^ 32 * 2^31 = 16376451 * 2^31 -.word 4063746855 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 32 * 450429249 * 2^31 -.word 30855129 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 2025087207 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 5368717 // zeta^ 16 * 2^31 = 923104^ 16 * 2^31 = 6955156 * 2^31 -.word 2630001715 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 16 * 450429249 * 2^31 -.word 35344777 // zeta^ 80 * 2^31 = 923104^ 80 * 2^31 = 38478475 * 2^31 -.word 1419625271 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 80 * 450429249 * 2^31 -.word 34054097 // zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 3259472623 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 35946385 // zeta^112 * 2^31 = 923104^112 * 2^31 = 16261595 * 2^31 -.word 1951599407 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 923104^112 * 450429249 * 2^31 -.word 54446789 // zeta^ 8 * 2^31 = 923104^ 8 * 2^31 = 16877098 * 2^31 -.word 189185787 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 8 * 450429249 * 2^31 -.word 39834949 // zeta^ 72 * 2^31 = 923104^ 72 * 2^31 = 21015440 * 2^31 -.word 1012438139 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 72 * 450429249 * 2^31 -.word 46558923 // zeta^ 40 * 2^31 = 923104^ 40 * 2^31 = 3630241 * 2^31 -.word 4246475637 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 40 * 450429249 * 2^31 -.word 81626031 // zeta^104 * 2^31 = 923104^104 * 2^31 = 33283422 * 2^31 -.word 2162614673 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 923104^104 * 450429249 * 2^31 -.word 66297913 // zeta^ 24 * 2^31 = 923104^ 24 * 2^31 = 38013065 * 2^31 -.word 1899174023 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 24 * 450429249 * 2^31 -.word 39269057 // zeta^ 88 * 2^31 = 923104^ 88 * 2^31 = 33248211 * 2^31 -.word 1896897535 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 88 * 450429249 * 2^31 -.word 90210255 // zeta^ 56 * 2^31 = 923104^ 56 * 2^31 = 31693324 * 2^31 -.word 3850943857 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 56 * 450429249 * 2^31 -.word 80761913 // zeta^120 * 2^31 = 923104^120 * 2^31 = 20563366 * 2^31 -.word 3363136647 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 923104^120 * 450429249 * 2^31 -.word 13759071 // zeta^ 4 * 2^31 = 923104^ 4 * 2^31 = 923104 * 2^31 -.word 3761772257 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 4 * 450429249 * 2^31 -.word 30402329 // zeta^ 68 * 2^31 = 923104^ 68 * 2^31 = 8451464 * 2^31 -.word 1277049255 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 68 * 450429249 * 2^31 -.word 83384231 // zeta^ 36 * 2^31 = 923104^ 36 * 2^31 = 12508371 * 2^31 -.word 2181765017 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 36 * 450429249 * 2^31 -.word 2847179 // zeta^100 * 2^31 = 923104^100 * 2^31 = 20823894 * 2^31 -.word 3061010549 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 923104^100 * 450429249 * 2^31 -.word 73095195 // zeta^ 20 * 2^31 = 923104^ 20 * 2^31 = 4206832 * 2^31 -.word 479430181 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 20 * 450429249 * 2^31 -.word 86175901 // zeta^ 84 * 2^31 = 923104^ 84 * 2^31 = 375141 * 2^31 -.word 2820360995 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 84 * 450429249 * 2^31 -.word 75051431 // zeta^ 52 * 2^31 = 923104^ 52 * 2^31 = 37944787 * 2^31 -.word 1467596185 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 52 * 450429249 * 2^31 -.word 72003281 // zeta^116 * 2^31 = 923104^116 * 2^31 = 13574899 * 2^31 -.word 892455919 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 923104^116 * 450429249 * 2^31 -.word 21266821 // zeta^ 12 * 2^31 = 923104^ 12 * 2^31 = 26669485 * 2^31 -.word 492607547 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 12 * 450429249 * 2^31 -.word 17786721 // zeta^ 76 * 2^31 = 923104^ 76 * 2^31 = 20629734 * 2^31 -.word 813064031 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 76 * 450429249 * 2^31 -.word 28787439 // zeta^ 44 * 2^31 = 923104^ 44 * 2^31 = 43262840 * 2^31 -.word 3600683601 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 44 * 450429249 * 2^31 -.word 9793529 // zeta^108 * 2^31 = 923104^108 * 2^31 = 19489792 * 2^31 -.word 277715143 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 923104^108 * 450429249 * 2^31 -.word 11700093 // zeta^ 28 * 2^31 = 923104^ 28 * 2^31 = 16210463 * 2^31 -.word 2502892611 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 28 * 450429249 * 2^31 -.word 50248023 // zeta^ 92 * 2^31 = 923104^ 92 * 2^31 = 13494060 * 2^31 -.word 1306171881 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 92 * 450429249 * 2^31 -.word 35962109 // zeta^ 60 * 2^31 = 923104^ 60 * 2^31 = 24024980 * 2^31 -.word 1803159235 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 60 * 450429249 * 2^31 -.word 68955489 // zeta^124 * 2^31 = 923104^124 * 2^31 = 1591696 * 2^31 -.word 2272401759 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 923104^124 * 450429249 * 2^31 -.word 38685147 // zeta^128 * 2^31 = 923104^128 * 2^31 = 18186380 * 2^31 -.word 1127375973 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 923104^128 * 450429249 * 2^31 -.word 21168267 // zeta^192 * 2^31 = 923104^192 * 2^31 = 45387456 * 2^31 -.word 3609809333 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 923104^192 * 450429249 * 2^31 -.word 27474177 // zeta^160 * 2^31 = 923104^160 * 2^31 = 43749424 * 2^31 -.word 2256307647 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 923104^160 * 450429249 * 2^31 -.word 42006505 // zeta^224 * 2^31 = 923104^224 * 2^31 = 29011006 * 2^31 -.word 231220439 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 923104^224 * 450429249 * 2^31 -.word 75363517 // zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3084590851 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 85406197 // zeta^208 * 2^31 = 923104^208 * 2^31 = 38432301 * 2^31 -.word 1664965579 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 923104^208 * 450429249 * 2^31 -.word 47279745 // zeta^176 * 2^31 = 923104^176 * 2^31 = 21308336 * 2^31 -.word 2987094079 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 923104^176 * 450429249 * 2^31 -.word 56720817 // zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 1035494671 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 30775617 // zeta^136 * 2^31 = 923104^136 * 2^31 = 4138342 * 2^31 -.word 823252351 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 923104^136 * 450429249 * 2^31 -.word 36328125 // zeta^200 * 2^31 = 923104^200 * 2^31 = 28510359 * 2^31 -.word 4105781507 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 923104^200 * 450429249 * 2^31 -.word 80454565 // zeta^168 * 2^31 = 923104^168 * 2^31 = 29653181 * 2^31 -.word 2211106331 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 923104^168 * 450429249 * 2^31 -.word 44215991 // zeta^232 * 2^31 = 923104^232 * 2^31 = 41757216 * 2^31 -.word 48491657 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 923104^232 * 450429249 * 2^31 -.word 18358601 // zeta^152 * 2^31 = 923104^152 * 2^31 = 40622603 * 2^31 -.word 4292690807 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 923104^152 * 450429249 * 2^31 -.word 24477001 // zeta^216 * 2^31 = 923104^216 * 2^31 = 7374392 * 2^31 -.word 2395793271 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 923104^216 * 450429249 * 2^31 -.word 35939115 // zeta^184 * 2^31 = 923104^184 * 2^31 = 34257499 * 2^31 -.word 3807160085 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 923104^184 * 450429249 * 2^31 -.word 564659 // zeta^248 * 2^31 = 923104^248 * 2^31 = 13694133 * 2^31 -.word 444023437 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 923104^248 * 450429249 * 2^31 -.word 62030715 // zeta^132 * 2^31 = 923104^132 * 2^31 = 7528360 * 2^31 -.word 1810244293 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 923104^132 * 450429249 * 2^31 -.word 77015843 // zeta^196 * 2^31 = 923104^196 * 2^31 = 44464353 * 2^31 -.word 533195037 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 923104^196 * 450429249 * 2^31 -.word 55625319 // zeta^164 * 2^31 = 923104^164 * 2^31 = 8315523 * 2^31 -.word 879245529 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 923104^164 * 450429249 * 2^31 -.word 7390683 // zeta^228 * 2^31 = 923104^228 * 2^31 = 32879086 * 2^31 -.word 2113202277 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 923104^228 * 450429249 * 2^31 -.word 58468163 // zeta^148 * 2^31 = 923104^148 * 2^31 = 41555766 * 2^31 -.word 2340930813 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 923104^148 * 450429249 * 2^31 -.word 17679719 // zeta^212 * 2^31 = 923104^212 * 2^31 = 41180625 * 2^31 -.word 3815537113 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 923104^212 * 450429249 * 2^31 -.word 42339307 // zeta^180 * 2^31 = 923104^180 * 2^31 = 21017569 * 2^31 -.word 3719827029 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 923104^180 * 450429249 * 2^31 -.word 15723483 // zeta^244 * 2^31 = 923104^244 * 2^31 = 7442670 * 2^31 -.word 2827371109 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 923104^244 * 450429249 * 2^31 -.word 41907357 // zeta^140 * 2^31 = 923104^140 * 2^31 = 39347706 * 2^31 -.word 320456483 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 923104^140 * 450429249 * 2^31 -.word 69508093 // zeta^204 * 2^31 = 923104^204 * 2^31 = 18717972 * 2^31 -.word 3802359747 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 923104^204 * 450429249 * 2^31 -.word 26393547 // zeta^172 * 2^31 = 923104^172 * 2^31 = 21614409 * 2^31 -.word 971998837 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 923104^172 * 450429249 * 2^31 -.word 61987475 // zeta^236 * 2^31 = 923104^236 * 2^31 = 2124617 * 2^31 -.word 694283693 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 923104^236 * 450429249 * 2^31 -.word 83935387 // zeta^156 * 2^31 = 923104^156 * 2^31 = 42671054 * 2^31 -.word 3098246565 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 923104^156 * 450429249 * 2^31 -.word 79074821 // zeta^220 * 2^31 = 923104^220 * 2^31 = 29176994 * 2^31 -.word 1792074683 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 923104^220 * 450429249 * 2^31 -.word 78380837 // zeta^188 * 2^31 = 923104^188 * 2^31 = 22954173 * 2^31 -.word 469242523 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 923104^188 * 450429249 * 2^31 -.word 54812805 // zeta^252 * 2^31 = 923104^252 * 2^31 = 21362477 * 2^31 -.word 2491808059 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 923104^252 * 450429249 * 2^31 -.word 27870577 // zeta^256 * 2^31 = 923104^256 * 2^31 = 27201076 * 2^31 -.word 2482433359 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 923104^256 * 450429249 * 2^31 -.word 52089767 // zeta^320 * 2^31 = 923104^320 * 2^31 = 27201077 * 2^31 -.word 3167591321 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 923104^320 * 450429249 * 2^31 -.word 59919785 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 2269880087 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 63300737 // zeta^352 * 2^31 = 923104^352 * 2^31 = 1638033 * 2^31 -.word 2038659647 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 923104^352 * 450429249 * 2^31 -.word 55430137 // zeta^272 * 2^31 = 923104^272 * 2^31 = 6908982 * 2^31 -.word 2875342023 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 923104^272 * 450429249 * 2^31 -.word 15411397 // zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 1210376443 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 54828529 // zeta^304 * 2^31 = 923104^304 * 2^31 = 29125862 * 2^31 -.word 2343367887 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 923104^304 * 450429249 * 2^31 -.word 43495169 // zeta^368 * 2^31 = 923104^368 * 2^31 = 24079121 * 2^31 -.word 1307873215 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 923104^368 * 450429249 * 2^31 -.word 50939965 // zeta^264 * 2^31 = 923104^264 * 2^31 = 24372017 * 2^31 -.word 3282529155 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 923104^264 * 450429249 * 2^31 -.word 59999297 // zeta^328 * 2^31 = 923104^328 * 2^31 = 41249115 * 2^31 -.word 3471714943 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 923104^328 * 450429249 * 2^31 -.word 9148883 // zeta^296 * 2^31 = 923104^296 * 2^31 = 12104035 * 2^31 -.word 2132352621 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 923104^296 * 450429249 * 2^31 -.word 10320349 // zeta^360 * 2^31 = 923104^360 * 2^31 = 15734276 * 2^31 -.word 2083860963 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 923104^360 * 450429249 * 2^31 -.word 51505857 // zeta^280 * 2^31 = 923104^280 * 2^31 = 12139246 * 2^31 -.word 2398069759 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 923104^280 * 450429249 * 2^31 -.word 72416313 // zeta^344 * 2^31 = 923104^344 * 2^31 = 4764854 * 2^31 -.word 2276487 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 923104^344 * 450429249 * 2^31 -.word 10013001 // zeta^312 * 2^31 = 923104^312 * 2^31 = 24824091 * 2^31 -.word 931830647 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 923104^312 * 450429249 * 2^31 -.word 54835799 // zeta^376 * 2^31 = 923104^376 * 2^31 = 11129958 * 2^31 -.word 487807209 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 923104^376 * 450429249 * 2^31 -.word 60372585 // zeta^260 * 2^31 = 923104^260 * 2^31 = 36935993 * 2^31 -.word 3017918039 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 923104^260 * 450429249 * 2^31 -.word 28744199 // zeta^324 * 2^31 = 923104^324 * 2^31 = 37859097 * 2^31 -.word 2484723001 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 923104^324 * 450429249 * 2^31 -.word 87927735 // zeta^292 * 2^31 = 923104^292 * 2^31 = 24563563 * 2^31 -.word 1233956745 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 923104^292 * 450429249 * 2^31 -.word 35149595 // zeta^356 * 2^31 = 923104^356 * 2^31 = 37071934 * 2^31 -.word 3415721765 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 923104^356 * 450429249 * 2^31 -.word 4599013 // zeta^276 * 2^31 = 923104^276 * 2^31 = 45012316 * 2^31 -.word 1474606299 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 923104^276 * 450429249 * 2^31 -.word 32306751 // zeta^340 * 2^31 = 923104^340 * 2^31 = 3831691 * 2^31 -.word 1954036481 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 923104^340 * 450429249 * 2^31 -.word 18771633 // zeta^308 * 2^31 = 923104^308 * 2^31 = 31812558 * 2^31 -.word 3402511375 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 923104^308 * 450429249 * 2^31 -.word 48435607 // zeta^372 * 2^31 = 923104^372 * 2^31 = 24369888 * 2^31 -.word 575140265 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 923104^372 * 450429249 * 2^31 -.word 72988193 // zeta^268 * 2^31 = 923104^268 * 2^31 = 24757723 * 2^31 -.word 3481903263 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 923104^268 * 450429249 * 2^31 -.word 48867557 // zeta^332 * 2^31 = 923104^332 * 2^31 = 6039751 * 2^31 -.word 3974510811 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 923104^332 * 450429249 * 2^31 -.word 80981385 // zeta^300 * 2^31 = 923104^300 * 2^31 = 25897665 * 2^31 -.word 4017252151 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 923104^300 * 450429249 * 2^31 -.word 64381367 // zeta^364 * 2^31 = 923104^364 * 2^31 = 23773048 * 2^31 -.word 3322968457 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 923104^364 * 450429249 * 2^31 -.word 40526891 // zeta^284 * 2^31 = 923104^284 * 2^31 = 31893397 * 2^31 -.word 2988795413 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 923104^284 * 450429249 * 2^31 -.word 6839527 // zeta^348 * 2^31 = 923104^348 * 2^31 = 2716403 * 2^31 -.word 1196720729 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 923104^348 * 450429249 * 2^31 -.word 21819425 // zeta^316 * 2^31 = 923104^316 * 2^31 = 43795761 * 2^31 -.word 2022565535 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 923104^316 * 450429249 * 2^31 -.word 12394077 // zeta^380 * 2^31 = 923104^380 * 2^31 = 22433284 * 2^31 -.word 3825724771 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 923104^380 * 450429249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_45387457_923104_incomplete_good_scale -ntt_384_u32_45387457_923104_incomplete_good_scale: // Constants for scaling by 1/N -.word 69606647 // 1/96 -.word 685157961 // 1/96 twisted -.data -roots: -.word 22090505 /// zeta^256 * 2^31 = 923104^256 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 923104^256 * 450429249 * 2^31 -.word 9023783 /// zeta^128 * 2^31 = 923104^128 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 923104^128 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 78782351 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 88323005 // zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 84188761 // zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 88323005 // zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 16804439 // zeta^264 * 2^31 = 923104^264 * 2^31 = 24372017 * 2^31 -.word 3300632809 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 923104^264 * 450429249 * 2^31 -.word 19157039 // zeta^168 * 2^31 = 923104^168 * 2^31 = 29653181 * 2^31 -.word 3550508305 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 923104^168 * 450429249 * 2^31 -.word 84188761 // zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 65804887 // zeta^ 24 * 2^31 = 923104^ 24 * 2^31 = 38013065 * 2^31 -.word 3946051817 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 24 * 450429249 * 2^31 -.word 82969997 // zeta^312 * 2^31 = 923104^312 * 2^31 = 24824091 * 2^31 -.word 3322022451 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 923104^312 * 450429249 * 2^31 -.word 14273169 // XX: zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 78782351 // XX: zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 88323005 // XX: zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3638992899 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 84188761 // XX: zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 1908699751 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 16804439 // XX: zeta^264 * 2^31 = 923104^264 * 2^31 = 24372017 * 2^31 -.word 3300632809 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 923104^264 * 450429249 * 2^31 -.word 19157039 // XX: zeta^168 * 2^31 = 923104^168 * 2^31 = 29653181 * 2^31 -.word 3550508305 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 923104^168 * 450429249 * 2^31 -.word 65804887 // XX: zeta^ 24 * 2^31 = 923104^ 24 * 2^31 = 38013065 * 2^31 -.word 3946051817 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 24 * 450429249 * 2^31 -.word 82969997 // XX: zeta^312 * 2^31 = 923104^312 * 2^31 = 24824091 * 2^31 -.word 3322022451 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 923104^312 * 450429249 * 2^31 -.word 66361593 // XX: zeta^132 * 2^31 = 923104^132 * 2^31 = 7528360 * 2^31 -.word 356200391 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 923104^132 * 450429249 * 2^31 -.word 80165521 // XX: zeta^ 36 * 2^31 = 923104^ 36 * 2^31 = 12508371 * 2^31 -.word 2739310639 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 36 * 450429249 * 2^31 -.word 88960289 // XX: zeta^276 * 2^31 = 923104^276 * 2^31 = 45012316 * 2^31 -.word 2129734047 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 923104^276 * 450429249 * 2^31 -.word 6563629 // XX: zeta^180 * 2^31 = 923104^180 * 2^31 = 21017569 * 2^31 -.word 3141918867 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 923104^180 * 450429249 * 2^31 -.word 482773 // XX: zeta^ 12 * 2^31 = 923104^ 12 * 2^31 = 26669485 * 2^31 -.word 3409336299 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 12 * 450429249 * 2^31 -.word 35973319 // XX: zeta^300 * 2^31 = 923104^300 * 2^31 = 25897665 * 2^31 -.word 3372818041 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 923104^300 * 450429249 * 2^31 -.word 11401659 // XX: zeta^156 * 2^31 = 923104^156 * 2^31 = 42671054 * 2^31 -.word 2018958469 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 923104^156 * 450429249 * 2^31 -.word 59173881 // XX: zeta^ 60 * 2^31 = 923104^ 60 * 2^31 = 24024980 * 2^31 -.word 1136729287 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 60 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_45387457_923104_incomplete_good, %function -.global ntt_384_u32_45387457_923104_incomplete_good -ntt_384_u32_45387457_923104_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 45387457 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vmul.u32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vmul.u32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vmul.u32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vmul.u32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 3844538047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s deleted file mode 100644 index 9484257..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 9023783 /// zeta^128 * 2^31 = 923104^128 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 923104^128 * 450429249 * 2^31 -.word 22090505 /// zeta^256 * 2^31 = 923104^256 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 923104^256 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 11992563 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 6586153 // zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 2451909 // zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 6586153 // zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 7804917 // zeta^120 * 2^31 = 923104^120 * 2^31 = 20563366 * 2^31 -.word 972944843 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 923104^120 * 450429249 * 2^31 -.word 24970027 // zeta^216 * 2^31 = 923104^216 * 2^31 = 7374392 * 2^31 -.word 348915477 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 923104^216 * 450429249 * 2^31 -.word 2451909 // zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 71617875 // zeta^360 * 2^31 = 923104^360 * 2^31 = 15734276 * 2^31 -.word 744458989 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 923104^360 * 450429249 * 2^31 -.word 73970475 // zeta^ 72 * 2^31 = 923104^ 72 * 2^31 = 21015440 * 2^31 -.word 994334485 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 72 * 450429249 * 2^31 -.word 14273169 // XX: zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 11992563 // XX: zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 6586153 // XX: zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 2386267543 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 2451909 // XX: zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 655974395 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 7804917 // XX: zeta^120 * 2^31 = 923104^120 * 2^31 = 20563366 * 2^31 -.word 972944843 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 923104^120 * 450429249 * 2^31 -.word 24970027 // XX: zeta^216 * 2^31 = 923104^216 * 2^31 = 7374392 * 2^31 -.word 348915477 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 923104^216 * 450429249 * 2^31 -.word 71617875 // XX: zeta^360 * 2^31 = 923104^360 * 2^31 = 15734276 * 2^31 -.word 744458989 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 923104^360 * 450429249 * 2^31 -.word 73970475 // XX: zeta^ 72 * 2^31 = 923104^ 72 * 2^31 = 21015440 * 2^31 -.word 994334485 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 72 * 450429249 * 2^31 -.word 31601033 // XX: zeta^252 * 2^31 = 923104^252 * 2^31 = 21362477 * 2^31 -.word 3158238007 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 923104^252 * 450429249 * 2^31 -.word 79373255 // XX: zeta^348 * 2^31 = 923104^348 * 2^31 = 2716403 * 2^31 -.word 2276008825 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 923104^348 * 450429249 * 2^31 -.word 54801595 // XX: zeta^108 * 2^31 = 923104^108 * 2^31 = 19489792 * 2^31 -.word 922149253 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 923104^108 * 450429249 * 2^31 -.word 90292141 // XX: zeta^204 * 2^31 = 923104^204 * 2^31 = 18717972 * 2^31 -.word 885630995 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 923104^204 * 450429249 * 2^31 -.word 84211285 // XX: zeta^372 * 2^31 = 923104^372 * 2^31 = 24369888 * 2^31 -.word 1153048427 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 923104^372 * 450429249 * 2^31 -.word 1814625 // XX: zeta^ 84 * 2^31 = 923104^ 84 * 2^31 = 375141 * 2^31 -.word 2165233247 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 84 * 450429249 * 2^31 -.word 10609393 // XX: zeta^228 * 2^31 = 923104^228 * 2^31 = 32879086 * 2^31 -.word 1555656655 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 923104^228 * 450429249 * 2^31 -.word 24413321 // XX: zeta^324 * 2^31 = 923104^324 * 2^31 = 37859097 * 2^31 -.word 3938766903 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 923104^324 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_45387457_923104_incomplete_good_bitrev, %function -.global ntt_384_u32_45387457_923104_incomplete_good_bitrev -ntt_384_u32_45387457_923104_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 45387457 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vmul.u32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 3844538047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good.s b/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good.s deleted file mode 100644 index 81acbd2..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_88299073_4883425_incomplete_good_twiddles -ntt_384_u32_88299073_4883425_incomplete_good_twiddles: // For base multiplication -.word 75231281 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 3951395343 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 15452769 // zeta^ 64 * 2^31 = 4883425^ 64 * 2^31 = 85764717 * 2^31 -.word 2033538015 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 64 * 2066201025 * 2^31 -.word 19987225 // zeta^ 32 * 2^31 = 4883425^ 32 * 2^31 = 19144749 * 2^31 -.word 1892589863 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 32 * 2066201025 * 2^31 -.word 50503029 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 2741681611 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 81982457 // zeta^ 16 * 2^31 = 4883425^ 16 * 2^31 = 76960665 * 2^31 -.word 2158501959 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 16 * 2066201025 * 2^31 -.word 20023469 // zeta^ 80 * 2^31 = 4883425^ 80 * 2^31 = 41822566 * 2^31 -.word 1552412819 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 80 * 2066201025 * 2^31 -.word 55876839 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 1939982041 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 43619891 // zeta^112 * 2^31 = 4883425^112 * 2^31 = 44400103 * 2^31 -.word 2850416781 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4883425^112 * 2066201025 * 2^31 -.word 172662323 // zeta^ 8 * 2^31 = 4883425^ 8 * 2^31 = 26094785 * 2^31 -.word 3064389773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 8 * 2066201025 * 2^31 -.word 71853543 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 4036378073 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 23697415 // zeta^ 40 * 2^31 = 4883425^ 40 * 2^31 = 55309930 * 2^31 -.word 443962297 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 40 * 2066201025 * 2^31 -.word 76499159 // zeta^104 * 2^31 = 4883425^104 * 2^31 = 78628712 * 2^31 -.word 2660611817 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4883425^104 * 2066201025 * 2^31 -.word 56990949 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 337656411 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 120013125 // zeta^ 88 * 2^31 = 4883425^ 88 * 2^31 = 20668553 * 2^31 -.word 616164859 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 88 * 2066201025 * 2^31 -.word 28856125 // zeta^ 56 * 2^31 = 4883425^ 56 * 2^31 = 41675533 * 2^31 -.word 561917443 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 56 * 2066201025 * 2^31 -.word 159401217 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 642203967 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 12190033 // zeta^ 4 * 2^31 = 4883425^ 4 * 2^31 = 4883425 * 2^31 -.word 3933894895 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 4 * 2066201025 * 2^31 -.word 108088419 // zeta^ 68 * 2^31 = 4883425^ 68 * 2^31 = 13818672 * 2^31 -.word 273473117 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 68 * 2066201025 * 2^31 -.word 142353279 // zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 2003400257 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 143392463 // zeta^100 * 2^31 = 4883425^100 * 2^31 = 35160276 * 2^31 -.word 482889457 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4883425^100 * 2066201025 * 2^31 -.word 119167385 // zeta^ 20 * 2^31 = 4883425^ 20 * 2^31 = 52712221 * 2^31 -.word 1897128615 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 20 * 2066201025 * 2^31 -.word 9268541 // zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 1847889923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 80397967 // zeta^ 52 * 2^31 = 4883425^ 52 * 2^31 = 81877099 * 2^31 -.word 3839489841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 52 * 2066201025 * 2^31 -.word 16520015 // zeta^116 * 2^31 = 4883425^116 * 2^31 = 18306165 * 2^31 -.word 838359665 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4883425^116 * 2066201025 * 2^31 -.word 115982427 // zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 3605477477 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 55226367 // zeta^ 76 * 2^31 = 4883425^ 76 * 2^31 = 50302558 * 2^31 -.word 917047745 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 76 * 2066201025 * 2^31 -.word 136968867 // zeta^ 44 * 2^31 = 4883425^ 44 * 2^31 = 63650411 * 2^31 -.word 40189981 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 44 * 2066201025 * 2^31 -.word 68313423 // zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 3720973425 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 117342749 // zeta^ 28 * 2^31 = 4883425^ 28 * 2^31 = 32879858 * 2^31 -.word 212726563 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 28 * 2066201025 * 2^31 -.word 64009947 // zeta^ 92 * 2^31 = 4883425^ 92 * 2^31 = 70872893 * 2^31 -.word 925164005 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 92 * 2066201025 * 2^31 -.word 55029279 // zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1315460001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.word 99141453 // zeta^124 * 2^31 = 4883425^124 * 2^31 = 15592642 * 2^31 -.word 4156561907 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4883425^124 * 2066201025 * 2^31 -.word 28520561 // zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2377109967 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 101366865 // zeta^192 * 2^31 = 4883425^192 * 2^31 = 88299072 * 2^31 -.word 343571951 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4883425^192 * 2066201025 * 2^31 -.word 118814877 // zeta^160 * 2^31 = 4883425^160 * 2^31 = 5579523 * 2^31 -.word 849091747 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4883425^160 * 2066201025 * 2^31 -.word 156610921 // zeta^224 * 2^31 = 4883425^224 * 2^31 = 69154324 * 2^31 -.word 2402377431 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4883425^224 * 2066201025 * 2^31 -.word 26340085 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 3688878155 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 94615689 // zeta^208 * 2^31 = 4883425^208 * 2^31 = 11338408 * 2^31 -.word 2136465335 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4883425^208 * 2066201025 * 2^31 -.word 76042125 // zeta^176 * 2^31 = 4883425^176 * 2^31 = 22220342 * 2^31 -.word 910434739 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4883425^176 * 2066201025 * 2^31 -.word 120721307 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 2354985253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 164088439 // zeta^136 * 2^31 = 4883425^136 * 2^31 = 32274711 * 2^31 -.word 971988297 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4883425^136 * 2066201025 * 2^31 -.word 3935823 // zeta^200 * 2^31 = 4883425^200 * 2^31 = 62204288 * 2^31 -.word 1230577521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4883425^200 * 2066201025 * 2^31 -.word 141100817 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 2216649519 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 152900731 // zeta^232 * 2^31 = 4883425^232 * 2^31 = 32989143 * 2^31 -.word 3851004997 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4883425^232 * 2066201025 * 2^31 -.word 151321249 // zeta^152 * 2^31 = 4883425^152 * 2^31 = 11170776 * 2^31 -.word 278508447 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4883425^152 * 2066201025 * 2^31 -.word 119607197 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 3957310883 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 42246019 // zeta^184 * 2^31 = 4883425^184 * 2^31 = 23363129 * 2^31 -.word 80286525 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4883425^184 * 2066201025 * 2^31 -.word 147742021 // zeta^248 * 2^31 = 4883425^248 * 2^31 = 46623540 * 2^31 -.word 3733049851 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4883425^248 * 2066201025 * 2^31 -.word 7599313 // zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 634545519 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 164408113 // zeta^196 * 2^31 = 4883425^196 * 2^31 = 83415648 * 2^31 -.word 361072399 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4883425^196 * 2066201025 * 2^31 -.word 89338257 // zeta^164 * 2^31 = 4883425^164 * 2^31 = 30758081 * 2^31 -.word 2774456495 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4883425^164 * 2066201025 * 2^31 -.word 34244867 // zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2291567037 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 154998375 // zeta^148 * 2^31 = 4883425^148 * 2^31 = 54723088 * 2^31 -.word 4245728601 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4883425^148 * 2066201025 * 2^31 -.word 57430761 // zeta^212 * 2^31 = 4883425^212 * 2^31 = 35586852 * 2^31 -.word 2397838679 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4883425^212 * 2066201025 * 2^31 -.word 24421121 // zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 1293837119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 96200179 // zeta^244 * 2^31 = 4883425^244 * 2^31 = 6421974 * 2^31 -.word 455477453 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4883425^244 * 2066201025 * 2^31 -.word 27543013 // zeta^140 * 2^31 = 4883425^140 * 2^31 = 22531438 * 2^31 -.word 1606537563 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4883425^140 * 2066201025 * 2^31 -.word 60615719 // zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 689489817 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 19643629 // zeta^172 * 2^31 = 4883425^172 * 2^31 = 5400389 * 2^31 -.word 3680783443 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4883425^172 * 2066201025 * 2^31 -.word 39629279 // zeta^236 * 2^31 = 4883425^236 * 2^31 = 24648662 * 2^31 -.word 4254777313 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4883425^236 * 2066201025 * 2^31 -.word 34966271 // zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 712437441 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 59255397 // zeta^220 * 2^31 = 4883425^220 * 2^31 = 55419215 * 2^31 -.word 4082240731 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4883425^220 * 2066201025 * 2^31 -.word 132411247 // zeta^188 * 2^31 = 4883425^188 * 2^31 = 61321868 * 2^31 -.word 2841101905 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4883425^188 * 2066201025 * 2^31 -.word 121568867 // zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 2979507293 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 161145377 // zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 2261429279 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 148077585 // zeta^320 * 2^31 = 4883425^320 * 2^31 = 2534357 * 2^31 -.word 1917857327 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4883425^320 * 2066201025 * 2^31 -.word 126095117 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1553285683 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 57783269 // zeta^352 * 2^31 = 4883425^352 * 2^31 = 82719550 * 2^31 -.word 3445875547 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4883425^352 * 2066201025 * 2^31 -.word 156574677 // zeta^272 * 2^31 = 4883425^272 * 2^31 = 46476507 * 2^31 -.word 2742554475 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4883425^272 * 2066201025 * 2^31 -.word 150258061 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 606089139 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 132978255 // zeta^304 * 2^31 = 4883425^304 * 2^31 = 43898970 * 2^31 -.word 1444550513 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4883425^304 * 2066201025 * 2^31 -.word 100556021 // zeta^368 * 2^31 = 4883425^368 * 2^31 = 66078731 * 2^31 -.word 3384532555 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4883425^368 * 2066201025 * 2^31 -.word 104744603 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 258589221 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 12509707 // zeta^328 * 2^31 = 4883425^328 * 2^31 = 56024362 * 2^31 -.word 3322978997 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4883425^328 * 2066201025 * 2^31 -.word 100098987 // zeta^296 * 2^31 = 4883425^296 * 2^31 = 9670361 * 2^31 -.word 1634355477 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4883425^296 * 2066201025 * 2^31 -.word 35497329 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 2078317775 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 56585021 // zeta^280 * 2^31 = 4883425^280 * 2^31 = 67630520 * 2^31 -.word 3678802435 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4883425^280 * 2066201025 * 2^31 -.word 25276897 // zeta^344 * 2^31 = 4883425^344 * 2^31 = 77128297 * 2^31 -.word 4016458847 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4883425^344 * 2066201025 * 2^31 -.word 17196929 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 3652763327 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 134352127 // zeta^376 * 2^31 = 4883425^376 * 2^31 = 64935944 * 2^31 -.word 4214680769 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4883425^376 * 2066201025 * 2^31 -.word 68509727 // zeta^260 * 2^31 = 4883425^260 * 2^31 = 74480401 * 2^31 -.word 4021494177 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4883425^260 * 2066201025 * 2^31 -.word 168998833 // zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 3660421775 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.word 33205683 // zeta^292 * 2^31 = 4883425^292 * 2^31 = 53138797 * 2^31 -.word 3812077837 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4883425^292 * 2066201025 * 2^31 -.word 87259889 // zeta^356 * 2^31 = 4883425^356 * 2^31 = 57540992 * 2^31 -.word 1520510799 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4883425^356 * 2066201025 * 2^31 -.word 167329605 // zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 2447077371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 21599771 // zeta^340 * 2^31 = 4883425^340 * 2^31 = 33575985 * 2^31 -.word 49238693 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4883425^340 * 2066201025 * 2^31 -.word 160078131 // zeta^308 * 2^31 = 4883425^308 * 2^31 = 69992908 * 2^31 -.word 3456607629 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4883425^308 * 2066201025 * 2^31 -.word 152177025 // zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 3001130175 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 121371779 // zeta^268 * 2^31 = 4883425^268 * 2^31 = 37996515 * 2^31 -.word 3377919549 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4883425^268 * 2066201025 * 2^31 -.word 149055133 // zeta^332 * 2^31 = 4883425^332 * 2^31 = 65767635 * 2^31 -.word 2688429731 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4883425^332 * 2066201025 * 2^31 -.word 108284723 // zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 573993869 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 156954517 // zeta^364 * 2^31 = 4883425^364 * 2^31 = 82898684 * 2^31 -.word 614183851 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4883425^364 * 2066201025 * 2^31 -.word 112588199 // zeta^284 * 2^31 = 4883425^284 * 2^31 = 17426180 * 2^31 -.word 3369803289 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4883425^284 * 2066201025 * 2^31 -.word 141631875 // zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 3582529853 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 77456693 // zeta^316 * 2^31 = 4883425^316 * 2^31 = 72706431 * 2^31 -.word 138405387 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4883425^316 * 2066201025 * 2^31 -.word 44186899 // zeta^380 * 2^31 = 4883425^380 * 2^31 = 26977205 * 2^31 -.word 1453865389 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4883425^380 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_88299073_4883425_incomplete_good_scale -ntt_384_u32_88299073_4883425_incomplete_good_scale: // Constants for scaling by 1/N -.word 75231281 // 1/96 -.word 3951395343 // 1/96 twisted -.data -roots: -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 29929577 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 9497777 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // XX: zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // XX: zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // XX: zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 29929577 // XX: zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // XX: zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 9497777 // XX: zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // XX: zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 8935247 // XX: zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 217310286 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 4402195 // XX: zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 107063885 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 69162837 // XX: zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 1682079511 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 24728139 // XX: zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 601402397 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 27771120 // XX: zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 675409425 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 19248273 // XX: zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 468128941 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 37993035 // XX: zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 924012208 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 42569847 // XX: zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1035322877 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good, %function -.global ntt_384_u32_88299073_4883425_incomplete_good -ntt_384_u32_88299073_4883425_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -88299073 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 2228766271 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s b/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s deleted file mode 100644 index e0ae9fe..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 24724272 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 66119312 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 35138099 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 66119312 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 65038662 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 1581777230 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 78801296 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 1916492312 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 35138099 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 64980291 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 1580357614 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 58369496 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 1419579322 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 24724272 // XX: zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 66119312 // XX: zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 1608059253 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 35138099 // XX: zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 854578542 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 65038662 // XX: zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 1581777230 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 78801296 // XX: zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 1916492312 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 64980291 // XX: zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 1580357614 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 58369496 // XX: zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 1419579322 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 45729226 // XX: zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 1112160771 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 50306038 // XX: zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 1223471440 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 69050800 // XX: zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 1679354707 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 60527953 // XX: zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 1472074223 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 63570934 // XX: zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 1546081251 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 19136236 // XX: zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 465404137 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 83896878 // XX: zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2040419763 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 79363826 // XX: zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 1930173362 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good_bitrev, %function -.global ntt_384_u32_88299073_4883425_incomplete_good_bitrev -ntt_384_u32_88299073_4883425_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -88299073 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 2228766271 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop.s b/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop.s deleted file mode 100644 index 43841df..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop.s +++ /dev/null @@ -1,3388 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_twiddles -ntt_384_u32_88299073_4883425_incomplete_good_oop_twiddles: // For base multiplication -.word 75231281 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 3951395343 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 15452769 // zeta^ 64 * 2^31 = 4883425^ 64 * 2^31 = 85764717 * 2^31 -.word 2033538015 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 64 * 2066201025 * 2^31 -.word 19987225 // zeta^ 32 * 2^31 = 4883425^ 32 * 2^31 = 19144749 * 2^31 -.word 1892589863 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 32 * 2066201025 * 2^31 -.word 50503029 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 2741681611 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 81982457 // zeta^ 16 * 2^31 = 4883425^ 16 * 2^31 = 76960665 * 2^31 -.word 2158501959 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 16 * 2066201025 * 2^31 -.word 20023469 // zeta^ 80 * 2^31 = 4883425^ 80 * 2^31 = 41822566 * 2^31 -.word 1552412819 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 80 * 2066201025 * 2^31 -.word 55876839 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 1939982041 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 43619891 // zeta^112 * 2^31 = 4883425^112 * 2^31 = 44400103 * 2^31 -.word 2850416781 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4883425^112 * 2066201025 * 2^31 -.word 172662323 // zeta^ 8 * 2^31 = 4883425^ 8 * 2^31 = 26094785 * 2^31 -.word 3064389773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 8 * 2066201025 * 2^31 -.word 71853543 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 4036378073 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 23697415 // zeta^ 40 * 2^31 = 4883425^ 40 * 2^31 = 55309930 * 2^31 -.word 443962297 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 40 * 2066201025 * 2^31 -.word 76499159 // zeta^104 * 2^31 = 4883425^104 * 2^31 = 78628712 * 2^31 -.word 2660611817 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4883425^104 * 2066201025 * 2^31 -.word 56990949 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 337656411 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 120013125 // zeta^ 88 * 2^31 = 4883425^ 88 * 2^31 = 20668553 * 2^31 -.word 616164859 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 88 * 2066201025 * 2^31 -.word 28856125 // zeta^ 56 * 2^31 = 4883425^ 56 * 2^31 = 41675533 * 2^31 -.word 561917443 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 56 * 2066201025 * 2^31 -.word 159401217 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 642203967 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 12190033 // zeta^ 4 * 2^31 = 4883425^ 4 * 2^31 = 4883425 * 2^31 -.word 3933894895 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 4 * 2066201025 * 2^31 -.word 108088419 // zeta^ 68 * 2^31 = 4883425^ 68 * 2^31 = 13818672 * 2^31 -.word 273473117 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 68 * 2066201025 * 2^31 -.word 142353279 // zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 2003400257 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 143392463 // zeta^100 * 2^31 = 4883425^100 * 2^31 = 35160276 * 2^31 -.word 482889457 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4883425^100 * 2066201025 * 2^31 -.word 119167385 // zeta^ 20 * 2^31 = 4883425^ 20 * 2^31 = 52712221 * 2^31 -.word 1897128615 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 20 * 2066201025 * 2^31 -.word 9268541 // zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 1847889923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 80397967 // zeta^ 52 * 2^31 = 4883425^ 52 * 2^31 = 81877099 * 2^31 -.word 3839489841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 52 * 2066201025 * 2^31 -.word 16520015 // zeta^116 * 2^31 = 4883425^116 * 2^31 = 18306165 * 2^31 -.word 838359665 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4883425^116 * 2066201025 * 2^31 -.word 115982427 // zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 3605477477 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 55226367 // zeta^ 76 * 2^31 = 4883425^ 76 * 2^31 = 50302558 * 2^31 -.word 917047745 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 76 * 2066201025 * 2^31 -.word 136968867 // zeta^ 44 * 2^31 = 4883425^ 44 * 2^31 = 63650411 * 2^31 -.word 40189981 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 44 * 2066201025 * 2^31 -.word 68313423 // zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 3720973425 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 117342749 // zeta^ 28 * 2^31 = 4883425^ 28 * 2^31 = 32879858 * 2^31 -.word 212726563 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 28 * 2066201025 * 2^31 -.word 64009947 // zeta^ 92 * 2^31 = 4883425^ 92 * 2^31 = 70872893 * 2^31 -.word 925164005 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 92 * 2066201025 * 2^31 -.word 55029279 // zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1315460001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.word 99141453 // zeta^124 * 2^31 = 4883425^124 * 2^31 = 15592642 * 2^31 -.word 4156561907 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4883425^124 * 2066201025 * 2^31 -.word 28520561 // zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2377109967 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 101366865 // zeta^192 * 2^31 = 4883425^192 * 2^31 = 88299072 * 2^31 -.word 343571951 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4883425^192 * 2066201025 * 2^31 -.word 118814877 // zeta^160 * 2^31 = 4883425^160 * 2^31 = 5579523 * 2^31 -.word 849091747 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4883425^160 * 2066201025 * 2^31 -.word 156610921 // zeta^224 * 2^31 = 4883425^224 * 2^31 = 69154324 * 2^31 -.word 2402377431 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4883425^224 * 2066201025 * 2^31 -.word 26340085 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 3688878155 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 94615689 // zeta^208 * 2^31 = 4883425^208 * 2^31 = 11338408 * 2^31 -.word 2136465335 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4883425^208 * 2066201025 * 2^31 -.word 76042125 // zeta^176 * 2^31 = 4883425^176 * 2^31 = 22220342 * 2^31 -.word 910434739 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4883425^176 * 2066201025 * 2^31 -.word 120721307 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 2354985253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 164088439 // zeta^136 * 2^31 = 4883425^136 * 2^31 = 32274711 * 2^31 -.word 971988297 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4883425^136 * 2066201025 * 2^31 -.word 3935823 // zeta^200 * 2^31 = 4883425^200 * 2^31 = 62204288 * 2^31 -.word 1230577521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4883425^200 * 2066201025 * 2^31 -.word 141100817 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 2216649519 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 152900731 // zeta^232 * 2^31 = 4883425^232 * 2^31 = 32989143 * 2^31 -.word 3851004997 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4883425^232 * 2066201025 * 2^31 -.word 151321249 // zeta^152 * 2^31 = 4883425^152 * 2^31 = 11170776 * 2^31 -.word 278508447 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4883425^152 * 2066201025 * 2^31 -.word 119607197 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 3957310883 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 42246019 // zeta^184 * 2^31 = 4883425^184 * 2^31 = 23363129 * 2^31 -.word 80286525 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4883425^184 * 2066201025 * 2^31 -.word 147742021 // zeta^248 * 2^31 = 4883425^248 * 2^31 = 46623540 * 2^31 -.word 3733049851 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4883425^248 * 2066201025 * 2^31 -.word 7599313 // zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 634545519 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 164408113 // zeta^196 * 2^31 = 4883425^196 * 2^31 = 83415648 * 2^31 -.word 361072399 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4883425^196 * 2066201025 * 2^31 -.word 89338257 // zeta^164 * 2^31 = 4883425^164 * 2^31 = 30758081 * 2^31 -.word 2774456495 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4883425^164 * 2066201025 * 2^31 -.word 34244867 // zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2291567037 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 154998375 // zeta^148 * 2^31 = 4883425^148 * 2^31 = 54723088 * 2^31 -.word 4245728601 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4883425^148 * 2066201025 * 2^31 -.word 57430761 // zeta^212 * 2^31 = 4883425^212 * 2^31 = 35586852 * 2^31 -.word 2397838679 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4883425^212 * 2066201025 * 2^31 -.word 24421121 // zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 1293837119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 96200179 // zeta^244 * 2^31 = 4883425^244 * 2^31 = 6421974 * 2^31 -.word 455477453 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4883425^244 * 2066201025 * 2^31 -.word 27543013 // zeta^140 * 2^31 = 4883425^140 * 2^31 = 22531438 * 2^31 -.word 1606537563 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4883425^140 * 2066201025 * 2^31 -.word 60615719 // zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 689489817 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 19643629 // zeta^172 * 2^31 = 4883425^172 * 2^31 = 5400389 * 2^31 -.word 3680783443 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4883425^172 * 2066201025 * 2^31 -.word 39629279 // zeta^236 * 2^31 = 4883425^236 * 2^31 = 24648662 * 2^31 -.word 4254777313 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4883425^236 * 2066201025 * 2^31 -.word 34966271 // zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 712437441 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 59255397 // zeta^220 * 2^31 = 4883425^220 * 2^31 = 55419215 * 2^31 -.word 4082240731 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4883425^220 * 2066201025 * 2^31 -.word 132411247 // zeta^188 * 2^31 = 4883425^188 * 2^31 = 61321868 * 2^31 -.word 2841101905 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4883425^188 * 2066201025 * 2^31 -.word 121568867 // zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 2979507293 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 161145377 // zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 2261429279 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 148077585 // zeta^320 * 2^31 = 4883425^320 * 2^31 = 2534357 * 2^31 -.word 1917857327 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4883425^320 * 2066201025 * 2^31 -.word 126095117 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1553285683 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 57783269 // zeta^352 * 2^31 = 4883425^352 * 2^31 = 82719550 * 2^31 -.word 3445875547 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4883425^352 * 2066201025 * 2^31 -.word 156574677 // zeta^272 * 2^31 = 4883425^272 * 2^31 = 46476507 * 2^31 -.word 2742554475 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4883425^272 * 2066201025 * 2^31 -.word 150258061 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 606089139 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 132978255 // zeta^304 * 2^31 = 4883425^304 * 2^31 = 43898970 * 2^31 -.word 1444550513 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4883425^304 * 2066201025 * 2^31 -.word 100556021 // zeta^368 * 2^31 = 4883425^368 * 2^31 = 66078731 * 2^31 -.word 3384532555 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4883425^368 * 2066201025 * 2^31 -.word 104744603 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 258589221 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 12509707 // zeta^328 * 2^31 = 4883425^328 * 2^31 = 56024362 * 2^31 -.word 3322978997 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4883425^328 * 2066201025 * 2^31 -.word 100098987 // zeta^296 * 2^31 = 4883425^296 * 2^31 = 9670361 * 2^31 -.word 1634355477 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4883425^296 * 2066201025 * 2^31 -.word 35497329 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 2078317775 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 56585021 // zeta^280 * 2^31 = 4883425^280 * 2^31 = 67630520 * 2^31 -.word 3678802435 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4883425^280 * 2066201025 * 2^31 -.word 25276897 // zeta^344 * 2^31 = 4883425^344 * 2^31 = 77128297 * 2^31 -.word 4016458847 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4883425^344 * 2066201025 * 2^31 -.word 17196929 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 3652763327 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 134352127 // zeta^376 * 2^31 = 4883425^376 * 2^31 = 64935944 * 2^31 -.word 4214680769 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4883425^376 * 2066201025 * 2^31 -.word 68509727 // zeta^260 * 2^31 = 4883425^260 * 2^31 = 74480401 * 2^31 -.word 4021494177 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4883425^260 * 2066201025 * 2^31 -.word 168998833 // zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 3660421775 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.word 33205683 // zeta^292 * 2^31 = 4883425^292 * 2^31 = 53138797 * 2^31 -.word 3812077837 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4883425^292 * 2066201025 * 2^31 -.word 87259889 // zeta^356 * 2^31 = 4883425^356 * 2^31 = 57540992 * 2^31 -.word 1520510799 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4883425^356 * 2066201025 * 2^31 -.word 167329605 // zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 2447077371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 21599771 // zeta^340 * 2^31 = 4883425^340 * 2^31 = 33575985 * 2^31 -.word 49238693 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4883425^340 * 2066201025 * 2^31 -.word 160078131 // zeta^308 * 2^31 = 4883425^308 * 2^31 = 69992908 * 2^31 -.word 3456607629 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4883425^308 * 2066201025 * 2^31 -.word 152177025 // zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 3001130175 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 121371779 // zeta^268 * 2^31 = 4883425^268 * 2^31 = 37996515 * 2^31 -.word 3377919549 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4883425^268 * 2066201025 * 2^31 -.word 149055133 // zeta^332 * 2^31 = 4883425^332 * 2^31 = 65767635 * 2^31 -.word 2688429731 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4883425^332 * 2066201025 * 2^31 -.word 108284723 // zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 573993869 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 156954517 // zeta^364 * 2^31 = 4883425^364 * 2^31 = 82898684 * 2^31 -.word 614183851 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4883425^364 * 2066201025 * 2^31 -.word 112588199 // zeta^284 * 2^31 = 4883425^284 * 2^31 = 17426180 * 2^31 -.word 3369803289 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4883425^284 * 2066201025 * 2^31 -.word 141631875 // zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 3582529853 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 77456693 // zeta^316 * 2^31 = 4883425^316 * 2^31 = 72706431 * 2^31 -.word 138405387 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4883425^316 * 2066201025 * 2^31 -.word 44186899 // zeta^380 * 2^31 = 4883425^380 * 2^31 = 26977205 * 2^31 -.word 1453865389 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4883425^380 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_scale -ntt_384_u32_88299073_4883425_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 75231281 // 1/96 -.word 3951395343 // 1/96 twisted -.data -roots: -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 29929577 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 9497777 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // XX: zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // XX: zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // XX: zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 29929577 // XX: zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // XX: zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 9497777 // XX: zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // XX: zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 8935247 // XX: zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 217310286 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 4402195 // XX: zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 107063885 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 69162837 // XX: zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 1682079511 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 24728139 // XX: zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 601402397 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 27771120 // XX: zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 675409425 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 19248273 // XX: zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 468128941 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 37993035 // XX: zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 924012208 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 42569847 // XX: zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1035322877 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good_oop, %function -.global ntt_384_u32_88299073_4883425_incomplete_good_oop -ntt_384_u32_88299073_4883425_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -88299073 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r7 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r6 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r9 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r6 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmla.s32 Q2, Q3, r9 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r11,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r11,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r11,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r11,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r11,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r11,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r11,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r11,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r11,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r11,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r10,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r11,#(0)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 2228766271 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3355 -// Instruction count: 2397 \ No newline at end of file diff --git a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s b/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s deleted file mode 100644 index d40cc4c..0000000 --- a/tests/ntt_384/auto/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,3075 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_twiddles -ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 75231281 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 3951395343 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 15452769 // zeta^ 64 * 2^31 = 4883425^ 64 * 2^31 = 85764717 * 2^31 -.word 2033538015 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 64 * 2066201025 * 2^31 -.word 19987225 // zeta^ 32 * 2^31 = 4883425^ 32 * 2^31 = 19144749 * 2^31 -.word 1892589863 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 32 * 2066201025 * 2^31 -.word 50503029 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 2741681611 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 81982457 // zeta^ 16 * 2^31 = 4883425^ 16 * 2^31 = 76960665 * 2^31 -.word 2158501959 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 16 * 2066201025 * 2^31 -.word 20023469 // zeta^ 80 * 2^31 = 4883425^ 80 * 2^31 = 41822566 * 2^31 -.word 1552412819 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 80 * 2066201025 * 2^31 -.word 55876839 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 1939982041 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 43619891 // zeta^112 * 2^31 = 4883425^112 * 2^31 = 44400103 * 2^31 -.word 2850416781 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4883425^112 * 2066201025 * 2^31 -.word 172662323 // zeta^ 8 * 2^31 = 4883425^ 8 * 2^31 = 26094785 * 2^31 -.word 3064389773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 8 * 2066201025 * 2^31 -.word 71853543 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 4036378073 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 23697415 // zeta^ 40 * 2^31 = 4883425^ 40 * 2^31 = 55309930 * 2^31 -.word 443962297 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 40 * 2066201025 * 2^31 -.word 76499159 // zeta^104 * 2^31 = 4883425^104 * 2^31 = 78628712 * 2^31 -.word 2660611817 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4883425^104 * 2066201025 * 2^31 -.word 56990949 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 337656411 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 120013125 // zeta^ 88 * 2^31 = 4883425^ 88 * 2^31 = 20668553 * 2^31 -.word 616164859 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 88 * 2066201025 * 2^31 -.word 28856125 // zeta^ 56 * 2^31 = 4883425^ 56 * 2^31 = 41675533 * 2^31 -.word 561917443 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 56 * 2066201025 * 2^31 -.word 159401217 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 642203967 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 12190033 // zeta^ 4 * 2^31 = 4883425^ 4 * 2^31 = 4883425 * 2^31 -.word 3933894895 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 4 * 2066201025 * 2^31 -.word 108088419 // zeta^ 68 * 2^31 = 4883425^ 68 * 2^31 = 13818672 * 2^31 -.word 273473117 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 68 * 2066201025 * 2^31 -.word 142353279 // zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 2003400257 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 143392463 // zeta^100 * 2^31 = 4883425^100 * 2^31 = 35160276 * 2^31 -.word 482889457 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4883425^100 * 2066201025 * 2^31 -.word 119167385 // zeta^ 20 * 2^31 = 4883425^ 20 * 2^31 = 52712221 * 2^31 -.word 1897128615 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 20 * 2066201025 * 2^31 -.word 9268541 // zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 1847889923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 80397967 // zeta^ 52 * 2^31 = 4883425^ 52 * 2^31 = 81877099 * 2^31 -.word 3839489841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 52 * 2066201025 * 2^31 -.word 16520015 // zeta^116 * 2^31 = 4883425^116 * 2^31 = 18306165 * 2^31 -.word 838359665 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4883425^116 * 2066201025 * 2^31 -.word 115982427 // zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 3605477477 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 55226367 // zeta^ 76 * 2^31 = 4883425^ 76 * 2^31 = 50302558 * 2^31 -.word 917047745 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 76 * 2066201025 * 2^31 -.word 136968867 // zeta^ 44 * 2^31 = 4883425^ 44 * 2^31 = 63650411 * 2^31 -.word 40189981 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 44 * 2066201025 * 2^31 -.word 68313423 // zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 3720973425 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 117342749 // zeta^ 28 * 2^31 = 4883425^ 28 * 2^31 = 32879858 * 2^31 -.word 212726563 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 28 * 2066201025 * 2^31 -.word 64009947 // zeta^ 92 * 2^31 = 4883425^ 92 * 2^31 = 70872893 * 2^31 -.word 925164005 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 92 * 2066201025 * 2^31 -.word 55029279 // zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1315460001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.word 99141453 // zeta^124 * 2^31 = 4883425^124 * 2^31 = 15592642 * 2^31 -.word 4156561907 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4883425^124 * 2066201025 * 2^31 -.word 28520561 // zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2377109967 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 101366865 // zeta^192 * 2^31 = 4883425^192 * 2^31 = 88299072 * 2^31 -.word 343571951 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4883425^192 * 2066201025 * 2^31 -.word 118814877 // zeta^160 * 2^31 = 4883425^160 * 2^31 = 5579523 * 2^31 -.word 849091747 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4883425^160 * 2066201025 * 2^31 -.word 156610921 // zeta^224 * 2^31 = 4883425^224 * 2^31 = 69154324 * 2^31 -.word 2402377431 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4883425^224 * 2066201025 * 2^31 -.word 26340085 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 3688878155 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 94615689 // zeta^208 * 2^31 = 4883425^208 * 2^31 = 11338408 * 2^31 -.word 2136465335 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4883425^208 * 2066201025 * 2^31 -.word 76042125 // zeta^176 * 2^31 = 4883425^176 * 2^31 = 22220342 * 2^31 -.word 910434739 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4883425^176 * 2066201025 * 2^31 -.word 120721307 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 2354985253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 164088439 // zeta^136 * 2^31 = 4883425^136 * 2^31 = 32274711 * 2^31 -.word 971988297 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4883425^136 * 2066201025 * 2^31 -.word 3935823 // zeta^200 * 2^31 = 4883425^200 * 2^31 = 62204288 * 2^31 -.word 1230577521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4883425^200 * 2066201025 * 2^31 -.word 141100817 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 2216649519 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 152900731 // zeta^232 * 2^31 = 4883425^232 * 2^31 = 32989143 * 2^31 -.word 3851004997 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4883425^232 * 2066201025 * 2^31 -.word 151321249 // zeta^152 * 2^31 = 4883425^152 * 2^31 = 11170776 * 2^31 -.word 278508447 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4883425^152 * 2066201025 * 2^31 -.word 119607197 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 3957310883 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 42246019 // zeta^184 * 2^31 = 4883425^184 * 2^31 = 23363129 * 2^31 -.word 80286525 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4883425^184 * 2066201025 * 2^31 -.word 147742021 // zeta^248 * 2^31 = 4883425^248 * 2^31 = 46623540 * 2^31 -.word 3733049851 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4883425^248 * 2066201025 * 2^31 -.word 7599313 // zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 634545519 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 164408113 // zeta^196 * 2^31 = 4883425^196 * 2^31 = 83415648 * 2^31 -.word 361072399 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4883425^196 * 2066201025 * 2^31 -.word 89338257 // zeta^164 * 2^31 = 4883425^164 * 2^31 = 30758081 * 2^31 -.word 2774456495 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4883425^164 * 2066201025 * 2^31 -.word 34244867 // zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2291567037 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 154998375 // zeta^148 * 2^31 = 4883425^148 * 2^31 = 54723088 * 2^31 -.word 4245728601 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4883425^148 * 2066201025 * 2^31 -.word 57430761 // zeta^212 * 2^31 = 4883425^212 * 2^31 = 35586852 * 2^31 -.word 2397838679 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4883425^212 * 2066201025 * 2^31 -.word 24421121 // zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 1293837119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 96200179 // zeta^244 * 2^31 = 4883425^244 * 2^31 = 6421974 * 2^31 -.word 455477453 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4883425^244 * 2066201025 * 2^31 -.word 27543013 // zeta^140 * 2^31 = 4883425^140 * 2^31 = 22531438 * 2^31 -.word 1606537563 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4883425^140 * 2066201025 * 2^31 -.word 60615719 // zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 689489817 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 19643629 // zeta^172 * 2^31 = 4883425^172 * 2^31 = 5400389 * 2^31 -.word 3680783443 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4883425^172 * 2066201025 * 2^31 -.word 39629279 // zeta^236 * 2^31 = 4883425^236 * 2^31 = 24648662 * 2^31 -.word 4254777313 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4883425^236 * 2066201025 * 2^31 -.word 34966271 // zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 712437441 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 59255397 // zeta^220 * 2^31 = 4883425^220 * 2^31 = 55419215 * 2^31 -.word 4082240731 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4883425^220 * 2066201025 * 2^31 -.word 132411247 // zeta^188 * 2^31 = 4883425^188 * 2^31 = 61321868 * 2^31 -.word 2841101905 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4883425^188 * 2066201025 * 2^31 -.word 121568867 // zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 2979507293 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 161145377 // zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 2261429279 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 148077585 // zeta^320 * 2^31 = 4883425^320 * 2^31 = 2534357 * 2^31 -.word 1917857327 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4883425^320 * 2066201025 * 2^31 -.word 126095117 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1553285683 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 57783269 // zeta^352 * 2^31 = 4883425^352 * 2^31 = 82719550 * 2^31 -.word 3445875547 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4883425^352 * 2066201025 * 2^31 -.word 156574677 // zeta^272 * 2^31 = 4883425^272 * 2^31 = 46476507 * 2^31 -.word 2742554475 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4883425^272 * 2066201025 * 2^31 -.word 150258061 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 606089139 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 132978255 // zeta^304 * 2^31 = 4883425^304 * 2^31 = 43898970 * 2^31 -.word 1444550513 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4883425^304 * 2066201025 * 2^31 -.word 100556021 // zeta^368 * 2^31 = 4883425^368 * 2^31 = 66078731 * 2^31 -.word 3384532555 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4883425^368 * 2066201025 * 2^31 -.word 104744603 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 258589221 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 12509707 // zeta^328 * 2^31 = 4883425^328 * 2^31 = 56024362 * 2^31 -.word 3322978997 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4883425^328 * 2066201025 * 2^31 -.word 100098987 // zeta^296 * 2^31 = 4883425^296 * 2^31 = 9670361 * 2^31 -.word 1634355477 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4883425^296 * 2066201025 * 2^31 -.word 35497329 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 2078317775 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 56585021 // zeta^280 * 2^31 = 4883425^280 * 2^31 = 67630520 * 2^31 -.word 3678802435 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4883425^280 * 2066201025 * 2^31 -.word 25276897 // zeta^344 * 2^31 = 4883425^344 * 2^31 = 77128297 * 2^31 -.word 4016458847 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4883425^344 * 2066201025 * 2^31 -.word 17196929 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 3652763327 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 134352127 // zeta^376 * 2^31 = 4883425^376 * 2^31 = 64935944 * 2^31 -.word 4214680769 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4883425^376 * 2066201025 * 2^31 -.word 68509727 // zeta^260 * 2^31 = 4883425^260 * 2^31 = 74480401 * 2^31 -.word 4021494177 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4883425^260 * 2066201025 * 2^31 -.word 168998833 // zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 3660421775 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.word 33205683 // zeta^292 * 2^31 = 4883425^292 * 2^31 = 53138797 * 2^31 -.word 3812077837 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4883425^292 * 2066201025 * 2^31 -.word 87259889 // zeta^356 * 2^31 = 4883425^356 * 2^31 = 57540992 * 2^31 -.word 1520510799 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4883425^356 * 2066201025 * 2^31 -.word 167329605 // zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 2447077371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 21599771 // zeta^340 * 2^31 = 4883425^340 * 2^31 = 33575985 * 2^31 -.word 49238693 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4883425^340 * 2066201025 * 2^31 -.word 160078131 // zeta^308 * 2^31 = 4883425^308 * 2^31 = 69992908 * 2^31 -.word 3456607629 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4883425^308 * 2066201025 * 2^31 -.word 152177025 // zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 3001130175 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 121371779 // zeta^268 * 2^31 = 4883425^268 * 2^31 = 37996515 * 2^31 -.word 3377919549 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4883425^268 * 2066201025 * 2^31 -.word 149055133 // zeta^332 * 2^31 = 4883425^332 * 2^31 = 65767635 * 2^31 -.word 2688429731 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4883425^332 * 2066201025 * 2^31 -.word 108284723 // zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 573993869 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 156954517 // zeta^364 * 2^31 = 4883425^364 * 2^31 = 82898684 * 2^31 -.word 614183851 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4883425^364 * 2066201025 * 2^31 -.word 112588199 // zeta^284 * 2^31 = 4883425^284 * 2^31 = 17426180 * 2^31 -.word 3369803289 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4883425^284 * 2066201025 * 2^31 -.word 141631875 // zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 3582529853 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 77456693 // zeta^316 * 2^31 = 4883425^316 * 2^31 = 72706431 * 2^31 -.word 138405387 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4883425^316 * 2066201025 * 2^31 -.word 44186899 // zeta^380 * 2^31 = 4883425^380 * 2^31 = 26977205 * 2^31 -.word 1453865389 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4883425^380 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_scale -ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 75231281 // 1/96 -.word 3951395343 // 1/96 twisted -.data -roots: -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 29929577 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 9497777 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // XX: zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // XX: zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // XX: zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 29929577 // XX: zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // XX: zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 9497777 // XX: zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // XX: zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 8935247 // XX: zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 217310286 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 4402195 // XX: zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 107063885 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 69162837 // XX: zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 1682079511 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 24728139 // XX: zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 601402397 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 27771120 // XX: zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 675409425 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 19248273 // XX: zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 468128941 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 37993035 // XX: zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 924012208 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 42569847 // XX: zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1035322877 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input, %function -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input -ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -88299073 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q0 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r6 -// input[4]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 4)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r9 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r11,#(-496)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(16)] -// Release input[0] from Q1 -// Release input[128] from Q0 -// input[4]: Already loaded as Q7 -// input[132]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmulh.s32 Q1, Q7, r6 -// input[8]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 8)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -// Release input[132] from Q6 -// Release input[4] from Q7 -// input[136]: Already loaded as Q4 -// input[8]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[140]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[136] from Q4 -vstrw.u32 Q2, [r11,#(48)] -vsub.s32 Q4, Q1, Q5 -// Release input[8] from Q5 -vstrw.u32 Q4, [r11,#(-464)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(32)] -// input[140]: Already loaded as Q6 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmulh.s32 Q1, Q6, r6 -// input[16]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-448)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release input[12] from Q3 -// Release input[140] from Q6 -// input[16]: Already loaded as Q7 -// input[144]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmulh.s32 Q1, Q7, r6 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(64)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(80)] -// Release input[144] from Q5 -// Release input[16] from Q7 -// input[148]: Already loaded as Q4 -// input[20]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[148] from Q4 -vstrw.u32 Q2, [r11,#(96)] -vsub.s32 Q4, Q1, Q6 -// Release input[20] from Q6 -vstrw.u32 Q4, [r11,#(-416)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(80)] -// input[152]: Already loaded as Q5 -// input[24]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[156]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmulh.s32 Q1, Q5, r6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-400)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(112)] -// Release input[24] from Q3 -// Release input[152] from Q5 -// input[28]: Already loaded as Q7 -// input[156]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmulh.s32 Q1, Q7, r6 -// input[32]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 32)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(112)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release input[156] from Q6 -// Release input[28] from Q7 -// input[160]: Already loaded as Q4 -// input[32]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[160] from Q4 -vstrw.u32 Q2, [r11,#(144)] -vsub.s32 Q4, Q1, Q5 -// Release input[32] from Q5 -vstrw.u32 Q4, [r11,#(-368)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(128)] -// input[164]: Already loaded as Q6 -// input[36]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmulh.s32 Q1, Q6, r6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-352)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(160)] -// Release input[36] from Q3 -// Release input[164] from Q6 -// input[40]: Already loaded as Q7 -// input[168]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmulh.s32 Q1, Q7, r6 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(176)] -// Release input[168] from Q5 -// Release input[40] from Q7 -// input[172]: Already loaded as Q4 -// input[44]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[172] from Q4 -vstrw.u32 Q2, [r11,#(192)] -vsub.s32 Q4, Q1, Q6 -// Release input[44] from Q6 -vstrw.u32 Q4, [r11,#(-320)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(176)] -// input[176]: Already loaded as Q5 -// input[48]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmulh.s32 Q1, Q5, r6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-304)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(208)] -// Release input[48] from Q3 -// Release input[176] from Q5 -// input[52]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmulh.s32 Q1, Q7, r6 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(224)] -// Release input[180] from Q6 -// Release input[52] from Q7 -// input[184]: Already loaded as Q4 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[188]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[184] from Q4 -vstrw.u32 Q2, [r11,#(240)] -vsub.s32 Q4, Q1, Q5 -// Release input[56] from Q5 -vstrw.u32 Q4, [r11,#(-272)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(224)] -// input[188]: Already loaded as Q6 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -vqrdmulh.s32 Q1, Q6, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-256)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release input[60] from Q3 -// Release input[188] from Q6 -// input[64]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -vneg.s32 Q1, Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q5, r6 -vstrw.u32 Q5, [r11,#(-240)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(256)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(272)] -// Release input[64] from Q5 -// input[68]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -vneg.s32 Q1, Q3 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q3, r6 -vstrw.u32 Q3, [r11,#(288)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(272)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[68] from Q3 -// input[72]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(288)] -vstrw.u32 Q4, [r11,#(304)] -vstrw.u32 Q4, [r11,#(-208)] -// Release input[72] from Q4 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-192)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(320)] -// Release input[76] from Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(336)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(320)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[80] from Q4 -// input[84]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(336)] -vstrw.u32 Q3, [r11,#(352)] -vstrw.u32 Q3, [r11,#(-160)] -// Release input[84] from Q3 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-144)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(368)] -// Release input[88] from Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(384)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(368)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[92] from Q4 -// input[96]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(384)] -vstrw.u32 Q3, [r11,#(400)] -vstrw.u32 Q3, [r11,#(-112)] -// Release input[96] from Q3 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-96)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(400)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release input[100] from Q0 -// input[104]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(432)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(416)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[104] from Q4 -// input[108]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(432)] -vstrw.u32 Q3, [r11,#(448)] -vstrw.u32 Q3, [r11,#(-64)] -// Release input[108] from Q3 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-48)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(448)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(464)] -// Release input[112] from Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(480)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(464)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[116] from Q4 -// input[120]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(480)] -vstrw.u32 Q3, [r11,#(496)] -vstrw.u32 Q3, [r11,#(-16)] -// Release input[120] from Q3 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(0)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(496)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[124] from Q0 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 2228766271 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3042 -// Instruction count: 2201 \ No newline at end of file diff --git a/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_complete.s b/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_complete.s deleted file mode 100644 index a521cf6..0000000 --- a/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_complete.s +++ /dev/null @@ -1,5720 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 7271765 // zeta^ 4 * 2^31 = 21224105^ 4 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 4 * 71292929 * 2^31 -.word 9232849 // zeta^132 * 2^31 = 21224105^132 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 21224105^132 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 5061807 // zeta^ 68 * 2^31 = 21224105^ 68 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 68 * 71292929 * 2^31 -.word 12062383 // zeta^196 * 2^31 = 21224105^196 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 21224105^196 * 71292929 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 26674607 // zeta^ 36 * 2^31 = 21224105^ 36 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 36 * 71292929 * 2^31 -.word 6369225 // zeta^164 * 2^31 = 21224105^164 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 21224105^164 * 71292929 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 13877423 // zeta^100 * 2^31 = 21224105^100 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 21224105^100 * 71292929 * 2^31 -.word 52182971 // zeta^228 * 2^31 = 21224105^228 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 21224105^228 * 71292929 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 26766019 // zeta^ 20 * 2^31 = 21224105^ 20 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 20 * 71292929 * 2^31 -.word 3049295 // zeta^148 * 2^31 = 21224105^148 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 21224105^148 * 71292929 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 27572075 // zeta^ 84 * 2^31 = 21224105^ 84 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 84 * 71292929 * 2^31 -.word 62852605 // zeta^212 * 2^31 = 21224105^212 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 21224105^212 * 71292929 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 41037815 // zeta^ 52 * 2^31 = 21224105^ 52 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 52 * 71292929 * 2^31 -.word 16612991 // zeta^180 * 2^31 = 21224105^180 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 21224105^180 * 71292929 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 32973157 // zeta^116 * 2^31 = 21224105^116 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 21224105^116 * 71292929 * 2^31 -.word 36139229 // zeta^244 * 2^31 = 21224105^244 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 21224105^244 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 61506475 // zeta^ 12 * 2^31 = 21224105^ 12 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 12 * 71292929 * 2^31 -.word 55340015 // zeta^140 * 2^31 = 21224105^140 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 21224105^140 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 12255067 // zeta^ 76 * 2^31 = 21224105^ 76 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 76 * 71292929 * 2^31 -.word 39251459 // zeta^204 * 2^31 = 21224105^204 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 21224105^204 * 71292929 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 13565749 // zeta^ 44 * 2^31 = 21224105^ 44 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 44 * 71292929 * 2^31 -.word 36826073 // zeta^172 * 2^31 = 21224105^172 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 21224105^172 * 71292929 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 34487347 // zeta^108 * 2^31 = 21224105^108 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 21224105^108 * 71292929 * 2^31 -.word 61222515 // zeta^236 * 2^31 = 21224105^236 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 21224105^236 * 71292929 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 62959157 // zeta^ 28 * 2^31 = 21224105^ 28 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 28 * 71292929 * 2^31 -.word 51158985 // zeta^156 * 2^31 = 21224105^156 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 21224105^156 * 71292929 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 59122583 // zeta^ 92 * 2^31 = 21224105^ 92 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 92 * 71292929 * 2^31 -.word 12915351 // zeta^220 * 2^31 = 21224105^220 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 21224105^220 * 71292929 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 32364195 // zeta^ 60 * 2^31 = 21224105^ 60 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 60 * 71292929 * 2^31 -.word 17635297 // zeta^188 * 2^31 = 21224105^188 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 21224105^188 * 71292929 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 38891533 // zeta^124 * 2^31 = 21224105^124 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 21224105^124 * 71292929 * 2^31 -.word 24452961 // zeta^252 * 2^31 = 21224105^252 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 21224105^252 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 -.word 2147483711 // zeta^ 0 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 387574637 // zeta^128 * (q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 1034331227 // zeta^ 64 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 260443775 // zeta^192 * (q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 -.word 2147483711 // zeta^ 0 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 1034331227 // zeta^ 64 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 225927717 // zeta^ 32 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 1061213519 // zeta^ 96 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 -.word 225927717 // zeta^ 32 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 2867950541 // zeta^160 * (q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 1061213519 // zeta^ 96 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 1485640269 // zeta^224 * (q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 -.word 510244013 // zeta^ 16 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 2068958813 // zeta^ 80 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 3798698919 // zeta^ 48 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 647594733 // zeta^112 * (q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 -.word 510244013 // zeta^ 16 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 2908863877 // zeta^144 * (q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 2068958813 // zeta^ 80 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 470281299 // zeta^208 * (q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 -.word 2011732375 // zeta^ 8 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 72843559 // zeta^ 72 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 974853061 // zeta^ 40 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 3736072529 // zeta^104 * (q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 -.word 3798698919 // zeta^ 48 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 3539370451 // zeta^176 * (q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 647594733 // zeta^112 * (q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 2984342153 // zeta^240 * (q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 -.word 3670574183 // zeta^ 24 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 2705987941 // zeta^ 88 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 113764189 // zeta^ 56 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 390962787 // zeta^120 * (q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 -.word 2011732375 // zeta^ 8 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 3751646015 // zeta^136 * (q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 72843559 // zeta^ 72 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 3932493349 // zeta^200 * (q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 7271765 // zeta^ 4 * 2^31 = 21224105^ 4 * 2^31 -.word 5061807 // zeta^ 68 * 2^31 = 21224105^ 68 * 2^31 -.word 26674607 // zeta^ 36 * 2^31 = 21224105^ 36 * 2^31 -.word 13877423 // zeta^100 * 2^31 = 21224105^100 * 2^31 -.word 2896581291 // zeta^ 4 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 4 * 71292929 * 2^31 -.word 695081809 // zeta^ 68 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 68 * 71292929 * 2^31 -.word 1871467089 // zeta^ 36 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 36 * 71292929 * 2^31 -.word 763860817 // zeta^100 * (q^(-1) mod 2^32) * 2^31 = 21224105^100 * 71292929 * 2^31 -.word 9232849 // zeta^132 * 2^31 = 21224105^132 * 2^31 -.word 12062383 // zeta^196 * 2^31 = 21224105^196 * 2^31 -.word 6369225 // zeta^164 * 2^31 = 21224105^164 * 2^31 -.word 52182971 // zeta^228 * 2^31 = 21224105^228 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 -.word 974853061 // zeta^ 40 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 4056128829 // zeta^168 * (q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 3736072529 // zeta^104 * (q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 1697242247 // zeta^232 * (q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 26766019 // zeta^ 20 * 2^31 = 21224105^ 20 * 2^31 -.word 27572075 // zeta^ 84 * 2^31 = 21224105^ 84 * 2^31 -.word 41037815 // zeta^ 52 * 2^31 = 21224105^ 52 * 2^31 -.word 32973157 // zeta^116 * 2^31 = 21224105^116 * 2^31 -.word 307629373 // zeta^ 20 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 20 * 71292929 * 2^31 -.word 876871829 // zeta^ 84 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 84 * 71292929 * 2^31 -.word 4216088585 // zeta^ 52 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 52 * 71292929 * 2^31 -.word 2919278235 // zeta^116 * (q^(-1) mod 2^32) * 2^31 = 21224105^116 * 71292929 * 2^31 -.word 3049295 // zeta^148 * 2^31 = 21224105^148 * 2^31 -.word 62852605 // zeta^212 * 2^31 = 21224105^212 * 2^31 -.word 16612991 // zeta^180 * 2^31 = 21224105^180 * 2^31 -.word 36139229 // zeta^244 * 2^31 = 21224105^244 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 -.word 3670574183 // zeta^ 24 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 1317277063 // zeta^152 * (q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 2705987941 // zeta^ 88 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 3756689085 // zeta^216 * (q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 61506475 // zeta^ 12 * 2^31 = 21224105^ 12 * 2^31 -.word 12255067 // zeta^ 76 * 2^31 = 21224105^ 76 * 2^31 -.word 13565749 // zeta^ 44 * 2^31 = 21224105^ 44 * 2^31 -.word 34487347 // zeta^108 * 2^31 = 21224105^108 * 2^31 -.word 170406997 // zeta^ 12 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 12 * 71292929 * 2^31 -.word 4100667557 // zeta^ 76 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 76 * 71292929 * 2^31 -.word 3050391755 // zeta^ 44 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 44 * 71292929 * 2^31 -.word 1587133389 // zeta^108 * (q^(-1) mod 2^32) * 2^31 = 21224105^108 * 71292929 * 2^31 -.word 55340015 // zeta^140 * 2^31 = 21224105^140 * 2^31 -.word 39251459 // zeta^204 * 2^31 = 21224105^204 * 2^31 -.word 36826073 // zeta^172 * 2^31 = 21224105^172 * 2^31 -.word 61222515 // zeta^236 * 2^31 = 21224105^236 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 -.word 113764189 // zeta^ 56 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 587045533 // zeta^184 * (q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 390962787 // zeta^120 * (q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 901308915 // zeta^248 * (q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 62959157 // zeta^ 28 * 2^31 = 21224105^ 28 * 2^31 -.word 59122583 // zeta^ 92 * 2^31 = 21224105^ 92 * 2^31 -.word 32364195 // zeta^ 60 * 2^31 = 21224105^ 60 * 2^31 -.word 38891533 // zeta^124 * 2^31 = 21224105^124 * 2^31 -.word 3057097163 // zeta^ 28 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 28 * 71292929 * 2^31 -.word 2957603945 // zeta^ 92 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 92 * 71292929 * 2^31 -.word 4074479965 // zeta^ 60 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 60 * 71292929 * 2^31 -.word 2146473971 // zeta^124 * (q^(-1) mod 2^32) * 2^31 = 21224105^124 * 71292929 * 2^31 -.word 51158985 // zeta^156 * 2^31 = 21224105^156 * 2^31 -.word 12915351 // zeta^220 * 2^31 = 21224105^220 * 2^31 -.word 17635297 // zeta^188 * 2^31 = 21224105^188 * 2^31 -.word 24452961 // zeta^252 * 2^31 = 21224105^252 * 2^31 -.word 7271765 // zeta^ 4 * 2^31 = 21224105^ 4 * 2^31 -.word 9232849 // zeta^132 * 2^31 = 21224105^132 * 2^31 -.word 5061807 // zeta^ 68 * 2^31 = 21224105^ 68 * 2^31 -.word 12062383 // zeta^196 * 2^31 = 21224105^196 * 2^31 -.word 2896581291 // zeta^ 4 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 4 * 71292929 * 2^31 -.word 1249625647 // zeta^132 * (q^(-1) mod 2^32) * 2^31 = 21224105^132 * 71292929 * 2^31 -.word 695081809 // zeta^ 68 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 68 * 71292929 * 2^31 -.word 1507019089 // zeta^196 * (q^(-1) mod 2^32) * 2^31 = 21224105^196 * 71292929 * 2^31 -.word 34173151 // zeta^ 2 * 2^31 = 21224105^ 2 * 2^31 -.word 40902341 // zeta^ 66 * 2^31 = 21224105^ 66 * 2^31 -.word 13754549 // zeta^ 34 * 2^31 = 21224105^ 34 * 2^31 -.word 5773819 // zeta^ 98 * 2^31 = 21224105^ 98 * 2^31 -.word 3285804833 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 2 * 71292929 * 2^31 -.word 3172903227 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 66 * 71292929 * 2^31 -.word 3372902219 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 34 * 71292929 * 2^31 -.word 1492590085 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 98 * 71292929 * 2^31 -.word 6702715 // zeta^130 * 2^31 = 21224105^130 * 2^31 -.word 11747093 // zeta^194 * 2^31 = 21224105^194 * 2^31 -.word 48295871 // zeta^162 * 2^31 = 21224105^162 * 2^31 -.word 40968961 // zeta^226 * 2^31 = 21224105^226 * 2^31 -.word 26674607 // zeta^ 36 * 2^31 = 21224105^ 36 * 2^31 -.word 6369225 // zeta^164 * 2^31 = 21224105^164 * 2^31 -.word 13877423 // zeta^100 * 2^31 = 21224105^100 * 2^31 -.word 52182971 // zeta^228 * 2^31 = 21224105^228 * 2^31 -.word 1871467089 // zeta^ 36 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 36 * 71292929 * 2^31 -.word 416692279 // zeta^164 * (q^(-1) mod 2^32) * 2^31 = 21224105^164 * 71292929 * 2^31 -.word 763860817 // zeta^100 * (q^(-1) mod 2^32) * 2^31 = 21224105^100 * 71292929 * 2^31 -.word 2350446661 // zeta^228 * (q^(-1) mod 2^32) * 2^31 = 21224105^228 * 71292929 * 2^31 -.word 64146459 // zeta^ 18 * 2^31 = 21224105^ 18 * 2^31 -.word 47277573 // zeta^ 82 * 2^31 = 21224105^ 82 * 2^31 -.word 378215 // zeta^ 50 * 2^31 = 21224105^ 50 * 2^31 -.word 50433925 // zeta^114 * 2^31 = 21224105^114 * 2^31 -.word 2035379173 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 18 * 71292929 * 2^31 -.word 534733307 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 82 * 71292929 * 2^31 -.word 4044509849 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 50 * 71292929 * 2^31 -.word 1177237627 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 21224105^114 * 71292929 * 2^31 -.word 469857 // zeta^146 * 2^31 = 21224105^146 * 2^31 -.word 23147541 // zeta^210 * 2^31 = 21224105^210 * 2^31 -.word 22747623 // zeta^178 * 2^31 = 21224105^178 * 2^31 -.word 12737503 // zeta^242 * 2^31 = 21224105^242 * 2^31 -.word 26766019 // zeta^ 20 * 2^31 = 21224105^ 20 * 2^31 -.word 3049295 // zeta^148 * 2^31 = 21224105^148 * 2^31 -.word 27572075 // zeta^ 84 * 2^31 = 21224105^ 84 * 2^31 -.word 62852605 // zeta^212 * 2^31 = 21224105^212 * 2^31 -.word 307629373 // zeta^ 20 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 20 * 71292929 * 2^31 -.word 892719281 // zeta^148 * (q^(-1) mod 2^32) * 2^31 = 21224105^148 * 71292929 * 2^31 -.word 876871829 // zeta^ 84 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 84 * 71292929 * 2^31 -.word 1664121347 // zeta^212 * (q^(-1) mod 2^32) * 2^31 = 21224105^212 * 71292929 * 2^31 -.word 20257187 // zeta^ 10 * 2^31 = 21224105^ 10 * 2^31 -.word 27954337 // zeta^ 74 * 2^31 = 21224105^ 74 * 2^31 -.word 13368597 // zeta^ 42 * 2^31 = 21224105^ 42 * 2^31 -.word 38893665 // zeta^106 * 2^31 = 21224105^106 * 2^31 -.word 1443651165 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 10 * 71292929 * 2^31 -.word 4161706847 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 74 * 71292929 * 2^31 -.word 1165970155 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 42 * 71292929 * 2^31 -.word 473804703 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 21224105^106 * 71292929 * 2^31 -.word 61186369 // zeta^138 * 2^31 = 21224105^138 * 2^31 -.word 65344259 // zeta^202 * 2^31 = 21224105^202 * 2^31 -.word 46956055 // zeta^170 * 2^31 = 21224105^170 * 2^31 -.word 50639193 // zeta^234 * 2^31 = 21224105^234 * 2^31 -.word 41037815 // zeta^ 52 * 2^31 = 21224105^ 52 * 2^31 -.word 16612991 // zeta^180 * 2^31 = 21224105^180 * 2^31 -.word 32973157 // zeta^116 * 2^31 = 21224105^116 * 2^31 -.word 36139229 // zeta^244 * 2^31 = 21224105^244 * 2^31 -.word 4216088585 // zeta^ 52 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 52 * 71292929 * 2^31 -.word 4278606209 // zeta^180 * (q^(-1) mod 2^32) * 2^31 = 21224105^180 * 71292929 * 2^31 -.word 2919278235 // zeta^116 * (q^(-1) mod 2^32) * 2^31 = 21224105^116 * 71292929 * 2^31 -.word 2084247331 // zeta^244 * (q^(-1) mod 2^32) * 2^31 = 21224105^244 * 71292929 * 2^31 -.word 18563127 // zeta^ 26 * 2^31 = 21224105^ 26 * 2^31 -.word 6808477 // zeta^ 90 * 2^31 = 21224105^ 90 * 2^31 -.word 49494815 // zeta^ 58 * 2^31 = 21224105^ 58 * 2^31 -.word 7177603 // zeta^122 * 2^31 = 21224105^122 * 2^31 -.word 1462589385 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 26 * 71292929 * 2^31 -.word 3756565603 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 90 * 71292929 * 2^31 -.word 3129580769 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 58 * 71292929 * 2^31 -.word 2947478141 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 21224105^122 * 71292929 * 2^31 -.word 13659269 // zeta^154 * 2^31 = 21224105^154 * 2^31 -.word 25156895 // zeta^218 * 2^31 = 21224105^218 * 2^31 -.word 40639053 // zeta^186 * 2^31 = 21224105^186 * 2^31 -.word 1950087 // zeta^250 * 2^31 = 21224105^250 * 2^31 -.word 61506475 // zeta^ 12 * 2^31 = 21224105^ 12 * 2^31 -.word 55340015 // zeta^140 * 2^31 = 21224105^140 * 2^31 -.word 12255067 // zeta^ 76 * 2^31 = 21224105^ 76 * 2^31 -.word 39251459 // zeta^204 * 2^31 = 21224105^204 * 2^31 -.word 170406997 // zeta^ 12 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 12 * 71292929 * 2^31 -.word 902884369 // zeta^140 * (q^(-1) mod 2^32) * 2^31 = 21224105^140 * 71292929 * 2^31 -.word 4100667557 // zeta^ 76 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 76 * 71292929 * 2^31 -.word 102337021 // zeta^204 * (q^(-1) mod 2^32) * 2^31 = 21224105^204 * 71292929 * 2^31 -.word 60705671 // zeta^ 6 * 2^31 = 21224105^ 6 * 2^31 -.word 23867373 // zeta^ 70 * 2^31 = 21224105^ 70 * 2^31 -.word 39782807 // zeta^ 38 * 2^31 = 21224105^ 38 * 2^31 -.word 29369949 // zeta^102 * 2^31 = 21224105^102 * 2^31 -.word 3127823481 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 6 * 71292929 * 2^31 -.word 919656467 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 70 * 71292929 * 2^31 -.word 358649449 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 38 * 71292929 * 2^31 -.word 4177420707 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 21224105^102 * 71292929 * 2^31 -.word 58406951 // zeta^134 * 2^31 = 21224105^134 * 2^31 -.word 26669715 // zeta^198 * 2^31 = 21224105^198 * 2^31 -.word 17705221 // zeta^166 * 2^31 = 21224105^166 * 2^31 -.word 49812459 // zeta^230 * 2^31 = 21224105^230 * 2^31 -.word 13565749 // zeta^ 44 * 2^31 = 21224105^ 44 * 2^31 -.word 36826073 // zeta^172 * 2^31 = 21224105^172 * 2^31 -.word 34487347 // zeta^108 * 2^31 = 21224105^108 * 2^31 -.word 61222515 // zeta^236 * 2^31 = 21224105^236 * 2^31 -.word 3050391755 // zeta^ 44 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 44 * 71292929 * 2^31 -.word 1885862951 // zeta^172 * (q^(-1) mod 2^32) * 2^31 = 21224105^172 * 71292929 * 2^31 -.word 1587133389 // zeta^108 * (q^(-1) mod 2^32) * 2^31 = 21224105^108 * 71292929 * 2^31 -.word 2329659789 // zeta^236 * (q^(-1) mod 2^32) * 2^31 = 21224105^236 * 71292929 * 2^31 -.word 4594083 // zeta^ 22 * 2^31 = 21224105^ 22 * 2^31 -.word 65137097 // zeta^ 86 * 2^31 = 21224105^ 86 * 2^31 -.word 38253055 // zeta^ 54 * 2^31 = 21224105^ 54 * 2^31 -.word 29082479 // zeta^118 * 2^31 = 21224105^118 * 2^31 -.word 4277886557 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 22 * 71292929 * 2^31 -.word 2992995895 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 86 * 71292929 * 2^31 -.word 3049793025 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 54 * 71292929 * 2^31 -.word 3171783825 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 21224105^118 * 71292929 * 2^31 -.word 7758757 // zeta^150 * 2^31 = 21224105^150 * 2^31 -.word 29507409 // zeta^214 * 2^31 = 21224105^214 * 2^31 -.word 39394541 // zeta^182 * 2^31 = 21224105^182 * 2^31 -.word 44583105 // zeta^246 * 2^31 = 21224105^246 * 2^31 -.word 62959157 // zeta^ 28 * 2^31 = 21224105^ 28 * 2^31 -.word 51158985 // zeta^156 * 2^31 = 21224105^156 * 2^31 -.word 59122583 // zeta^ 92 * 2^31 = 21224105^ 92 * 2^31 -.word 12915351 // zeta^220 * 2^31 = 21224105^220 * 2^31 -.word 3057097163 // zeta^ 28 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 28 * 71292929 * 2^31 -.word 3752511543 // zeta^156 * (q^(-1) mod 2^32) * 2^31 = 21224105^156 * 71292929 * 2^31 -.word 2957603945 // zeta^ 92 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 92 * 71292929 * 2^31 -.word 3361899881 // zeta^220 * (q^(-1) mod 2^32) * 2^31 = 21224105^220 * 71292929 * 2^31 -.word 30585257 // zeta^ 14 * 2^31 = 21224105^ 14 * 2^31 -.word 40572935 // zeta^ 78 * 2^31 = 21224105^ 78 * 2^31 -.word 39625501 // zeta^ 46 * 2^31 = 21224105^ 46 * 2^31 -.word 25272919 // zeta^110 * 2^31 = 21224105^110 * 2^31 -.word 3685725783 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 14 * 71292929 * 2^31 -.word 2610298873 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 78 * 71292929 * 2^31 -.word 1004528867 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 46 * 71292929 * 2^31 -.word 1310455209 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 21224105^110 * 71292929 * 2^31 -.word 15268201 // zeta^142 * 2^31 = 21224105^142 * 2^31 -.word 55301277 // zeta^206 * 2^31 = 21224105^206 * 2^31 -.word 5900879 // zeta^174 * 2^31 = 21224105^174 * 2^31 -.word 54885097 // zeta^238 * 2^31 = 21224105^238 * 2^31 -.word 32364195 // zeta^ 60 * 2^31 = 21224105^ 60 * 2^31 -.word 17635297 // zeta^188 * 2^31 = 21224105^188 * 2^31 -.word 38891533 // zeta^124 * 2^31 = 21224105^124 * 2^31 -.word 24452961 // zeta^252 * 2^31 = 21224105^252 * 2^31 -.word 4074479965 // zeta^ 60 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 60 * 71292929 * 2^31 -.word 2389577759 // zeta^188 * (q^(-1) mod 2^32) * 2^31 = 21224105^188 * 71292929 * 2^31 -.word 2146473971 // zeta^124 * (q^(-1) mod 2^32) * 2^31 = 21224105^124 * 71292929 * 2^31 -.word 4013033631 // zeta^252 * (q^(-1) mod 2^32) * 2^31 = 21224105^252 * 71292929 * 2^31 -.word 37675113 // zeta^ 30 * 2^31 = 21224105^ 30 * 2^31 -.word 8442215 // zeta^ 94 * 2^31 = 21224105^ 94 * 2^31 -.word 36750327 // zeta^ 62 * 2^31 = 21224105^ 62 * 2^31 -.word 30669833 // zeta^126 * 2^31 = 21224105^126 * 2^31 -.word 311527319 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 30 * 71292929 * 2^31 -.word 712459929 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 94 * 71292929 * 2^31 -.word 3266171913 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 62 * 71292929 * 2^31 -.word 4149046263 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 21224105^126 * 71292929 * 2^31 -.word 35767195 // zeta^158 * 2^31 = 21224105^158 * 2^31 -.word 45014229 // zeta^222 * 2^31 = 21224105^222 * 2^31 -.word 35947815 // zeta^190 * 2^31 = 21224105^190 * 2^31 -.word 20303881 // zeta^254 * 2^31 = 21224105^254 * 2^31 -.word 34173151 // zeta^ 2 * 2^31 = 21224105^ 2 * 2^31 -.word 6702715 // zeta^130 * 2^31 = 21224105^130 * 2^31 -.word 40902341 // zeta^ 66 * 2^31 = 21224105^ 66 * 2^31 -.word 11747093 // zeta^194 * 2^31 = 21224105^194 * 2^31 -.word 3285804833 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 2 * 71292929 * 2^31 -.word 1876750725 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 21224105^130 * 71292929 * 2^31 -.word 3172903227 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 66 * 71292929 * 2^31 -.word 3890743531 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 21224105^194 * 71292929 * 2^31 -.word 57364657 // zeta^ 1 * 2^31 = 21224105^ 1 * 2^31 -.word 65863923 // zeta^ 65 * 2^31 = 21224105^ 65 * 2^31 -.word 38999497 // zeta^ 33 * 2^31 = 21224105^ 33 * 2^31 -.word 39314409 // zeta^ 97 * 2^31 = 21224105^ 97 * 2^31 -.word 3505411919 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 1 * 71292929 * 2^31 -.word 4219008781 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 65 * 71292929 * 2^31 -.word 1658081847 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 33 * 71292929 * 2^31 -.word 453280791 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 97 * 71292929 * 2^31 -.word 17742663 // zeta^129 * 2^31 = 21224105^129 * 2^31 -.word 45275813 // zeta^193 * 2^31 = 21224105^193 * 2^31 -.word 64102957 // zeta^161 * 2^31 = 21224105^161 * 2^31 -.word 25400553 // zeta^225 * 2^31 = 21224105^225 * 2^31 -.word 13754549 // zeta^ 34 * 2^31 = 21224105^ 34 * 2^31 -.word 48295871 // zeta^162 * 2^31 = 21224105^162 * 2^31 -.word 5773819 // zeta^ 98 * 2^31 = 21224105^ 98 * 2^31 -.word 40968961 // zeta^226 * 2^31 = 21224105^226 * 2^31 -.word 3372902219 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 34 * 71292929 * 2^31 -.word 919922753 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 21224105^162 * 71292929 * 2^31 -.word 1492590085 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 98 * 71292929 * 2^31 -.word 3871802623 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 21224105^226 * 71292929 * 2^31 -.word 8546383 // zeta^ 17 * 2^31 = 21224105^ 17 * 2^31 -.word 46173583 // zeta^ 81 * 2^31 = 21224105^ 81 * 2^31 -.word 66816363 // zeta^ 49 * 2^31 = 21224105^ 49 * 2^31 -.word 45664163 // zeta^113 * 2^31 = 21224105^113 * 2^31 -.word 269086641 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 17 * 71292929 * 2^31 -.word 1939720817 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 81 * 71292929 * 2^31 -.word 1119694485 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 49 * 71292929 * 2^31 -.word 1740157021 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 21224105^113 * 71292929 * 2^31 -.word 47863765 // zeta^145 * 2^31 = 21224105^145 * 2^31 -.word 7553119 // zeta^209 * 2^31 = 21224105^209 * 2^31 -.word 9938685 // zeta^177 * 2^31 = 21224105^177 * 2^31 -.word 29035899 // zeta^241 * 2^31 = 21224105^241 * 2^31 -.word 64146459 // zeta^ 18 * 2^31 = 21224105^ 18 * 2^31 -.word 469857 // zeta^146 * 2^31 = 21224105^146 * 2^31 -.word 47277573 // zeta^ 82 * 2^31 = 21224105^ 82 * 2^31 -.word 23147541 // zeta^210 * 2^31 = 21224105^210 * 2^31 -.word 2035379173 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 18 * 71292929 * 2^31 -.word 3263167647 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 21224105^146 * 71292929 * 2^31 -.word 534733307 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 82 * 71292929 * 2^31 -.word 3582071787 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 21224105^210 * 71292929 * 2^31 -.word 4853311 // zeta^ 9 * 2^31 = 21224105^ 9 * 2^31 -.word 19847287 // zeta^ 73 * 2^31 = 21224105^ 73 * 2^31 -.word 47886143 // zeta^ 41 * 2^31 = 21224105^ 41 * 2^31 -.word 18266849 // zeta^105 * 2^31 = 21224105^105 * 2^31 -.word 103795137 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 9 * 71292929 * 2^31 -.word 1457766281 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 73 * 71292929 * 2^31 -.word 1556555969 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 41 * 71292929 * 2^31 -.word 1339845919 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 21224105^105 * 71292929 * 2^31 -.word 44071181 // zeta^137 * 2^31 = 21224105^137 * 2^31 -.word 56515763 // zeta^201 * 2^31 = 21224105^201 * 2^31 -.word 16559399 // zeta^169 * 2^31 = 21224105^169 * 2^31 -.word 20020769 // zeta^233 * 2^31 = 21224105^233 * 2^31 -.word 378215 // zeta^ 50 * 2^31 = 21224105^ 50 * 2^31 -.word 22747623 // zeta^178 * 2^31 = 21224105^178 * 2^31 -.word 50433925 // zeta^114 * 2^31 = 21224105^114 * 2^31 -.word 12737503 // zeta^242 * 2^31 = 21224105^242 * 2^31 -.word 4044509849 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 50 * 71292929 * 2^31 -.word 619773465 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 21224105^178 * 71292929 * 2^31 -.word 1177237627 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 21224105^114 * 71292929 * 2^31 -.word 3923278881 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 21224105^242 * 71292929 * 2^31 -.word 66706479 // zeta^ 25 * 2^31 = 21224105^ 25 * 2^31 -.word 65464889 // zeta^ 89 * 2^31 = 21224105^ 89 * 2^31 -.word 31632431 // zeta^ 57 * 2^31 = 21224105^ 57 * 2^31 -.word 16224217 // zeta^121 * 2^31 = 21224105^121 * 2^31 -.word 1051556817 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 25 * 71292929 * 2^31 -.word 2658270663 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 89 * 71292929 * 2^31 -.word 2705632209 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 57 * 71292929 * 2^31 -.word 1396856871 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 21224105^121 * 71292929 * 2^31 -.word 20007053 // zeta^153 * 2^31 = 21224105^153 * 2^31 -.word 9706713 // zeta^217 * 2^31 = 21224105^217 * 2^31 -.word 28622733 // zeta^185 * 2^31 = 21224105^185 * 2^31 -.word 47899901 // zeta^249 * 2^31 = 21224105^249 * 2^31 -.word 20257187 // zeta^ 10 * 2^31 = 21224105^ 10 * 2^31 -.word 61186369 // zeta^138 * 2^31 = 21224105^138 * 2^31 -.word 27954337 // zeta^ 74 * 2^31 = 21224105^ 74 * 2^31 -.word 65344259 // zeta^202 * 2^31 = 21224105^202 * 2^31 -.word 1443651165 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 10 * 71292929 * 2^31 -.word 2303493823 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 21224105^138 * 71292929 * 2^31 -.word 4161706847 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 74 * 71292929 * 2^31 -.word 4199769341 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 21224105^202 * 71292929 * 2^31 -.word 27282801 // zeta^ 5 * 2^31 = 21224105^ 5 * 2^31 -.word 61894293 // zeta^ 69 * 2^31 = 21224105^ 69 * 2^31 -.word 56460987 // zeta^ 37 * 2^31 = 21224105^ 37 * 2^31 -.word 37053313 // zeta^101 * 2^31 = 21224105^101 * 2^31 -.word 3929627279 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 5 * 71292929 * 2^31 -.word 2488719723 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 69 * 71292929 * 2^31 -.word 4277121349 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 37 * 71292929 * 2^31 -.word 1897317503 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 21224105^101 * 71292929 * 2^31 -.word 4482895 // zeta^133 * 2^31 = 21224105^133 * 2^31 -.word 15492289 // zeta^197 * 2^31 = 21224105^197 * 2^31 -.word 50954585 // zeta^165 * 2^31 = 21224105^165 * 2^31 -.word 51397001 // zeta^229 * 2^31 = 21224105^229 * 2^31 -.word 13368597 // zeta^ 42 * 2^31 = 21224105^ 42 * 2^31 -.word 46956055 // zeta^170 * 2^31 = 21224105^170 * 2^31 -.word 38893665 // zeta^106 * 2^31 = 21224105^106 * 2^31 -.word 50639193 // zeta^234 * 2^31 = 21224105^234 * 2^31 -.word 1165970155 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 42 * 71292929 * 2^31 -.word 254220777 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 21224105^170 * 71292929 * 2^31 -.word 473804703 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 21224105^106 * 71292929 * 2^31 -.word 4268832423 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 21224105^234 * 71292929 * 2^31 -.word 21989155 // zeta^ 21 * 2^31 = 21224105^ 21 * 2^31 -.word 59599627 // zeta^ 85 * 2^31 = 21224105^ 85 * 2^31 -.word 49109585 // zeta^ 53 * 2^31 = 21224105^ 53 * 2^31 -.word 31901721 // zeta^117 * 2^31 = 21224105^117 * 2^31 -.word 386789597 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 21 * 71292929 * 2^31 -.word 644631797 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 85 * 71292929 * 2^31 -.word 988761519 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 53 * 71292929 * 2^31 -.word 2736594919 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 21224105^117 * 71292929 * 2^31 -.word 27458873 // zeta^149 * 2^31 = 21224105^149 * 2^31 -.word 19221631 // zeta^213 * 2^31 = 21224105^213 * 2^31 -.word 49552591 // zeta^181 * 2^31 = 21224105^181 * 2^31 -.word 64086513 // zeta^245 * 2^31 = 21224105^245 * 2^31 -.word 18563127 // zeta^ 26 * 2^31 = 21224105^ 26 * 2^31 -.word 13659269 // zeta^154 * 2^31 = 21224105^154 * 2^31 -.word 6808477 // zeta^ 90 * 2^31 = 21224105^ 90 * 2^31 -.word 25156895 // zeta^218 * 2^31 = 21224105^218 * 2^31 -.word 1462589385 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 26 * 71292929 * 2^31 -.word 1524915067 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 21224105^154 * 71292929 * 2^31 -.word 3756565603 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 90 * 71292929 * 2^31 -.word 894237409 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 21224105^218 * 71292929 * 2^31 -.word 65803619 // zeta^ 13 * 2^31 = 21224105^ 13 * 2^31 -.word 41181789 // zeta^ 77 * 2^31 = 21224105^ 77 * 2^31 -.word 28235729 // zeta^ 45 * 2^31 = 21224105^ 45 * 2^31 -.word 57735669 // zeta^109 * 2^31 = 21224105^109 * 2^31 -.word 4205535901 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 13 * 71292929 * 2^31 -.word 564798883 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 77 * 71292929 * 2^31 -.word 399101999 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 45 * 71292929 * 2^31 -.word 1381846539 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 21224105^109 * 71292929 * 2^31 -.word 59515337 // zeta^141 * 2^31 = 21224105^141 * 2^31 -.word 23737507 // zeta^205 * 2^31 = 21224105^205 * 2^31 -.word 38742465 // zeta^173 * 2^31 = 21224105^173 * 2^31 -.word 7373007 // zeta^237 * 2^31 = 21224105^237 * 2^31 -.word 49494815 // zeta^ 58 * 2^31 = 21224105^ 58 * 2^31 -.word 40639053 // zeta^186 * 2^31 = 21224105^186 * 2^31 -.word 7177603 // zeta^122 * 2^31 = 21224105^122 * 2^31 -.word 1950087 // zeta^250 * 2^31 = 21224105^250 * 2^31 -.word 3129580769 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 58 * 71292929 * 2^31 -.word 443542963 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 21224105^186 * 71292929 * 2^31 -.word 2947478141 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 21224105^122 * 71292929 * 2^31 -.word 677336697 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 21224105^250 * 71292929 * 2^31 -.word 36649543 // zeta^ 29 * 2^31 = 21224105^ 29 * 2^31 -.word 16801927 // zeta^ 93 * 2^31 = 21224105^ 93 * 2^31 -.word 39975475 // zeta^ 61 * 2^31 = 21224105^ 61 * 2^31 -.word 34708039 // zeta^125 * 2^31 = 21224105^125 * 2^31 -.word 2972442041 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 29 * 71292929 * 2^31 -.word 3495212921 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 93 * 71292929 * 2^31 -.word 4092984781 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 61 * 71292929 * 2^31 -.word 273251769 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 21224105^125 * 71292929 * 2^31 -.word 57911317 // zeta^157 * 2^31 = 21224105^157 * 2^31 -.word 55856649 // zeta^221 * 2^31 = 21224105^221 * 2^31 -.word 32224601 // zeta^189 * 2^31 = 21224105^189 * 2^31 -.word 66666709 // zeta^253 * 2^31 = 21224105^253 * 2^31 -.word 60705671 // zeta^ 6 * 2^31 = 21224105^ 6 * 2^31 -.word 58406951 // zeta^134 * 2^31 = 21224105^134 * 2^31 -.word 23867373 // zeta^ 70 * 2^31 = 21224105^ 70 * 2^31 -.word 26669715 // zeta^198 * 2^31 = 21224105^198 * 2^31 -.word 3127823481 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 6 * 71292929 * 2^31 -.word 2542460889 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 21224105^134 * 71292929 * 2^31 -.word 919656467 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 70 * 71292929 * 2^31 -.word 2744124781 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 21224105^198 * 71292929 * 2^31 -.word 23814037 // zeta^ 3 * 2^31 = 21224105^ 3 * 2^31 -.word 18856687 // zeta^ 67 * 2^31 = 21224105^ 67 * 2^31 -.word 54338297 // zeta^ 35 * 2^31 = 21224105^ 35 * 2^31 -.word 56618763 // zeta^ 99 * 2^31 = 21224105^ 99 * 2^31 -.word 2353260651 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 3 * 71292929 * 2^31 -.word 2085985553 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 67 * 71292929 * 2^31 -.word 3891905799 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 35 * 71292929 * 2^31 -.word 188336373 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 99 * 71292929 * 2^31 -.word 26618141 // zeta^131 * 2^31 = 21224105^131 * 2^31 -.word 56282849 // zeta^195 * 2^31 = 21224105^195 * 2^31 -.word 53722505 // zeta^163 * 2^31 = 21224105^163 * 2^31 -.word 23316989 // zeta^227 * 2^31 = 21224105^227 * 2^31 -.word 39782807 // zeta^ 38 * 2^31 = 21224105^ 38 * 2^31 -.word 17705221 // zeta^166 * 2^31 = 21224105^166 * 2^31 -.word 29369949 // zeta^102 * 2^31 = 21224105^102 * 2^31 -.word 49812459 // zeta^230 * 2^31 = 21224105^230 * 2^31 -.word 358649449 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 38 * 71292929 * 2^31 -.word 3759841019 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 21224105^166 * 71292929 * 2^31 -.word 4177420707 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 21224105^102 * 71292929 * 2^31 -.word 426026005 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 21224105^230 * 71292929 * 2^31 -.word 42286697 // zeta^ 19 * 2^31 = 21224105^ 19 * 2^31 -.word 17252607 // zeta^ 83 * 2^31 = 21224105^ 83 * 2^31 -.word 14807241 // zeta^ 51 * 2^31 = 21224105^ 51 * 2^31 -.word 39617057 // zeta^115 * 2^31 = 21224105^115 * 2^31 -.word 2432379287 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 19 * 71292929 * 2^31 -.word 3848312577 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 83 * 71292929 * 2^31 -.word 4135417655 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 51 * 71292929 * 2^31 -.word 1706599903 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 21224105^115 * 71292929 * 2^31 -.word 28566647 // zeta^147 * 2^31 = 21224105^147 * 2^31 -.word 57621535 // zeta^211 * 2^31 = 21224105^211 * 2^31 -.word 57635731 // zeta^179 * 2^31 = 21224105^179 * 2^31 -.word 7226843 // zeta^243 * 2^31 = 21224105^243 * 2^31 -.word 4594083 // zeta^ 22 * 2^31 = 21224105^ 22 * 2^31 -.word 7758757 // zeta^150 * 2^31 = 21224105^150 * 2^31 -.word 65137097 // zeta^ 86 * 2^31 = 21224105^ 86 * 2^31 -.word 29507409 // zeta^214 * 2^31 = 21224105^214 * 2^31 -.word 4277886557 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 22 * 71292929 * 2^31 -.word 31155291 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 21224105^150 * 71292929 * 2^31 -.word 2992995895 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 86 * 71292929 * 2^31 -.word 1071802543 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 21224105^214 * 71292929 * 2^31 -.word 26740275 // zeta^ 11 * 2^31 = 21224105^ 11 * 2^31 -.word 42796923 // zeta^ 75 * 2^31 = 21224105^ 75 * 2^31 -.word 27010987 // zeta^ 43 * 2^31 = 21224105^ 43 * 2^31 -.word 39320695 // zeta^107 * 2^31 = 21224105^107 * 2^31 -.word 1721758157 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 11 * 71292929 * 2^31 -.word 1004417157 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 75 * 71292929 * 2^31 -.word 3453390933 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 43 * 71292929 * 2^31 -.word 3277495177 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 21224105^107 * 71292929 * 2^31 -.word 62169111 // zeta^139 * 2^31 = 21224105^139 * 2^31 -.word 56278767 // zeta^203 * 2^31 = 21224105^203 * 2^31 -.word 51999501 // zeta^171 * 2^31 = 21224105^171 * 2^31 -.word 41776143 // zeta^235 * 2^31 = 21224105^235 * 2^31 -.word 38253055 // zeta^ 54 * 2^31 = 21224105^ 54 * 2^31 -.word 39394541 // zeta^182 * 2^31 = 21224105^182 * 2^31 -.word 29082479 // zeta^118 * 2^31 = 21224105^118 * 2^31 -.word 44583105 // zeta^246 * 2^31 = 21224105^246 * 2^31 -.word 3049793025 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 54 * 71292929 * 2^31 -.word 4209765139 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 21224105^182 * 71292929 * 2^31 -.word 3171783825 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 21224105^118 * 71292929 * 2^31 -.word 343269183 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 21224105^246 * 71292929 * 2^31 -.word 67046997 // zeta^ 27 * 2^31 = 21224105^ 27 * 2^31 -.word 59339603 // zeta^ 91 * 2^31 = 21224105^ 91 * 2^31 -.word 47267443 // zeta^ 59 * 2^31 = 21224105^ 59 * 2^31 -.word 30867557 // zeta^123 * 2^31 = 21224105^123 * 2^31 -.word 3976083883 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 27 * 71292929 * 2^31 -.word 1438352557 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 91 * 71292929 * 2^31 -.word 1177598349 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 59 * 71292929 * 2^31 -.word 3908618139 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 21224105^123 * 71292929 * 2^31 -.word 26032801 // zeta^155 * 2^31 = 21224105^155 * 2^31 -.word 61420673 // zeta^219 * 2^31 = 21224105^219 * 2^31 -.word 14848525 // zeta^187 * 2^31 = 21224105^187 * 2^31 -.word 51582797 // zeta^251 * 2^31 = 21224105^251 * 2^31 -.word 30585257 // zeta^ 14 * 2^31 = 21224105^ 14 * 2^31 -.word 15268201 // zeta^142 * 2^31 = 21224105^142 * 2^31 -.word 40572935 // zeta^ 78 * 2^31 = 21224105^ 78 * 2^31 -.word 55301277 // zeta^206 * 2^31 = 21224105^206 * 2^31 -.word 3685725783 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 14 * 71292929 * 2^31 -.word 1741647511 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 21224105^142 * 71292929 * 2^31 -.word 2610298873 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 78 * 71292929 * 2^31 -.word 984396643 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 21224105^206 * 71292929 * 2^31 -.word 62478247 // zeta^ 7 * 2^31 = 21224105^ 7 * 2^31 -.word 13974447 // zeta^ 71 * 2^31 = 21224105^ 71 * 2^31 -.word 14999777 // zeta^ 39 * 2^31 = 21224105^ 39 * 2^31 -.word 59134963 // zeta^103 * 2^31 = 21224105^103 * 2^31 -.word 1815658585 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 7 * 71292929 * 2^31 -.word 2831031377 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 71 * 71292929 * 2^31 -.word 100550431 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 39 * 71292929 * 2^31 -.word 819438605 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 21224105^103 * 71292929 * 2^31 -.word 53918527 // zeta^135 * 2^31 = 21224105^135 * 2^31 -.word 10108243 // zeta^199 * 2^31 = 21224105^199 * 2^31 -.word 10961253 // zeta^167 * 2^31 = 21224105^167 * 2^31 -.word 23786629 // zeta^231 * 2^31 = 21224105^231 * 2^31 -.word 39625501 // zeta^ 46 * 2^31 = 21224105^ 46 * 2^31 -.word 5900879 // zeta^174 * 2^31 = 21224105^174 * 2^31 -.word 25272919 // zeta^110 * 2^31 = 21224105^110 * 2^31 -.word 54885097 // zeta^238 * 2^31 = 21224105^238 * 2^31 -.word 1004528867 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 46 * 71292929 * 2^31 -.word 1099058609 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 21224105^174 * 71292929 * 2^31 -.word 1310455209 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 21224105^110 * 71292929 * 2^31 -.word 2041507095 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 21224105^238 * 71292929 * 2^31 -.word 59164407 // zeta^ 23 * 2^31 = 21224105^ 23 * 2^31 -.word 66143065 // zeta^ 87 * 2^31 = 21224105^ 87 * 2^31 -.word 43155485 // zeta^ 55 * 2^31 = 21224105^ 55 * 2^31 -.word 17669861 // zeta^119 * 2^31 = 21224105^119 * 2^31 -.word 1909444873 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 23 * 71292929 * 2^31 -.word 1951704231 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 87 * 71292929 * 2^31 -.word 1714554851 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 55 * 71292929 * 2^31 -.word 3532007707 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 21224105^119 * 71292929 * 2^31 -.word 24091995 // zeta^151 * 2^31 = 21224105^151 * 2^31 -.word 16101757 // zeta^215 * 2^31 = 21224105^215 * 2^31 -.word 13774769 // zeta^183 * 2^31 = 21224105^183 * 2^31 -.word 36746905 // zeta^247 * 2^31 = 21224105^247 * 2^31 -.word 37675113 // zeta^ 30 * 2^31 = 21224105^ 30 * 2^31 -.word 35767195 // zeta^158 * 2^31 = 21224105^158 * 2^31 -.word 8442215 // zeta^ 94 * 2^31 = 21224105^ 94 * 2^31 -.word 45014229 // zeta^222 * 2^31 = 21224105^222 * 2^31 -.word 311527319 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 30 * 71292929 * 2^31 -.word 4054742117 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 21224105^158 * 71292929 * 2^31 -.word 712459929 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 94 * 71292929 * 2^31 -.word 3331484459 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 21224105^222 * 71292929 * 2^31 -.word 36697917 // zeta^ 15 * 2^31 = 21224105^ 15 * 2^31 -.word 58452265 // zeta^ 79 * 2^31 = 21224105^ 79 * 2^31 -.word 13961957 // zeta^ 47 * 2^31 = 21224105^ 47 * 2^31 -.word 61179875 // zeta^111 * 2^31 = 21224105^111 * 2^31 -.word 3107033283 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 15 * 71292929 * 2^31 -.word 1790082775 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 79 * 71292929 * 2^31 -.word 4221484315 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 47 * 71292929 * 2^31 -.word 1423306781 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 21224105^111 * 71292929 * 2^31 -.word 6463229 // zeta^143 * 2^31 = 21224105^143 * 2^31 -.word 13236309 // zeta^207 * 2^31 = 21224105^207 * 2^31 -.word 4183205 // zeta^175 * 2^31 = 21224105^175 * 2^31 -.word 45952127 // zeta^239 * 2^31 = 21224105^239 * 2^31 -.word 36750327 // zeta^ 62 * 2^31 = 21224105^ 62 * 2^31 -.word 35947815 // zeta^190 * 2^31 = 21224105^190 * 2^31 -.word 30669833 // zeta^126 * 2^31 = 21224105^126 * 2^31 -.word 20303881 // zeta^254 * 2^31 = 21224105^254 * 2^31 -.word 3266171913 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 62 * 71292929 * 2^31 -.word 3437859545 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 21224105^190 * 71292929 * 2^31 -.word 4149046263 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 21224105^126 * 71292929 * 2^31 -.word 1091278839 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 21224105^254 * 71292929 * 2^31 -.word 18340771 // zeta^ 31 * 2^31 = 21224105^ 31 * 2^31 -.word 29457983 // zeta^ 95 * 2^31 = 21224105^ 95 * 2^31 -.word 11263143 // zeta^ 63 * 2^31 = 21224105^ 63 * 2^31 -.word 47890357 // zeta^127 * 2^31 = 21224105^127 * 2^31 -.word 1148820573 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 31 * 71292929 * 2^31 -.word 2922928577 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 95 * 71292929 * 2^31 -.word 336477017 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 21224105^ 63 * 71292929 * 2^31 -.word 1775863883 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 21224105^127 * 71292929 * 2^31 -.word 64730193 // zeta^159 * 2^31 = 21224105^159 * 2^31 -.word 58025703 // zeta^223 * 2^31 = 21224105^223 * 2^31 -.word 7013271 // zeta^191 * 2^31 = 21224105^191 * 2^31 -.word 34564147 // zeta^255 * 2^31 = 21224105^255 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_512_u32_33564673_21224105, %function -.global ntt_512_u32_33564673_21224105 -ntt_512_u32_33564673_21224105: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33564673 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vqrdmulh.s32 Q4, Q2, r9 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r11 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r11 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r9 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r11 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r8 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[456]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[468]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[480]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r8 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r8 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r8 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r8 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r8 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q1, Q1, r8 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q1, Q1, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r8 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q2, Q2, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r8 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r8 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r8 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r8 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r8 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r8 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r8 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q2, Q2, r8 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r8 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r8 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[12]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q2, Q2, r8 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q2, Q2, r8 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[76]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q1, Q1, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q0, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[80]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[108]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(320)] -// Release input[80] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q0, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[172]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q0, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[204]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q2, Q2, r8 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q1, Q1, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q2, Q2, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[268]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q1, Q1, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[300]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[300]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q2, Q2, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(192)] -// Release input[300] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q1, Q1, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[332]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q0, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q2, Q2, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[364]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[364]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q1, Q1, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(336)] -// Release input[336] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(448)] -// Release input[364] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q0, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[396]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q2, Q2, r8 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[412]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[412]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[428]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-368)] -// Release input[412] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[428]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r8 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-416)] -// Release input[400] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[416]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-304)] -// Release input[428] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q2, Q2, r8 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-352)] -// Release input[416] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r8 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q0, Q0, r8 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[492]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q2, Q2, r8 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q1, Q1, r8 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -vqrdmulh.s32 Q0, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -.equ modulus_inv, 4223674367 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r10, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q5, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r11 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r10], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r10, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r11 -vldrw.s32 Q3, [r10, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r11 -vldrw.s32 Q2, [r10, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r11 -vldrw.s32 Q6, [r10, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r9 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r11 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 5688 -// Instruction count: 4819 \ No newline at end of file diff --git a/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete.s b/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete.s deleted file mode 100644 index 5ce5008..0000000 --- a/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete.s +++ /dev/null @@ -1,3992 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 7271765 // zeta^ 4 * 2^31 = 21224105^ 4 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 4 * 71292929 * 2^31 -.word 9232849 // zeta^132 * 2^31 = 21224105^132 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 21224105^132 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 5061807 // zeta^ 68 * 2^31 = 21224105^ 68 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 68 * 71292929 * 2^31 -.word 12062383 // zeta^196 * 2^31 = 21224105^196 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 21224105^196 * 71292929 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 26674607 // zeta^ 36 * 2^31 = 21224105^ 36 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 36 * 71292929 * 2^31 -.word 6369225 // zeta^164 * 2^31 = 21224105^164 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 21224105^164 * 71292929 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 13877423 // zeta^100 * 2^31 = 21224105^100 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 21224105^100 * 71292929 * 2^31 -.word 52182971 // zeta^228 * 2^31 = 21224105^228 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 21224105^228 * 71292929 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 26766019 // zeta^ 20 * 2^31 = 21224105^ 20 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 20 * 71292929 * 2^31 -.word 3049295 // zeta^148 * 2^31 = 21224105^148 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 21224105^148 * 71292929 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 27572075 // zeta^ 84 * 2^31 = 21224105^ 84 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 84 * 71292929 * 2^31 -.word 62852605 // zeta^212 * 2^31 = 21224105^212 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 21224105^212 * 71292929 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 41037815 // zeta^ 52 * 2^31 = 21224105^ 52 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 52 * 71292929 * 2^31 -.word 16612991 // zeta^180 * 2^31 = 21224105^180 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 21224105^180 * 71292929 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 32973157 // zeta^116 * 2^31 = 21224105^116 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 21224105^116 * 71292929 * 2^31 -.word 36139229 // zeta^244 * 2^31 = 21224105^244 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 21224105^244 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 61506475 // zeta^ 12 * 2^31 = 21224105^ 12 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 12 * 71292929 * 2^31 -.word 55340015 // zeta^140 * 2^31 = 21224105^140 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 21224105^140 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 12255067 // zeta^ 76 * 2^31 = 21224105^ 76 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 76 * 71292929 * 2^31 -.word 39251459 // zeta^204 * 2^31 = 21224105^204 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 21224105^204 * 71292929 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 13565749 // zeta^ 44 * 2^31 = 21224105^ 44 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 44 * 71292929 * 2^31 -.word 36826073 // zeta^172 * 2^31 = 21224105^172 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 21224105^172 * 71292929 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 34487347 // zeta^108 * 2^31 = 21224105^108 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 21224105^108 * 71292929 * 2^31 -.word 61222515 // zeta^236 * 2^31 = 21224105^236 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 21224105^236 * 71292929 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 62959157 // zeta^ 28 * 2^31 = 21224105^ 28 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 28 * 71292929 * 2^31 -.word 51158985 // zeta^156 * 2^31 = 21224105^156 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 21224105^156 * 71292929 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 59122583 // zeta^ 92 * 2^31 = 21224105^ 92 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 92 * 71292929 * 2^31 -.word 12915351 // zeta^220 * 2^31 = 21224105^220 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 21224105^220 * 71292929 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 32364195 // zeta^ 60 * 2^31 = 21224105^ 60 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 60 * 71292929 * 2^31 -.word 17635297 // zeta^188 * 2^31 = 21224105^188 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 21224105^188 * 71292929 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 38891533 // zeta^124 * 2^31 = 21224105^124 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 21224105^124 * 71292929 * 2^31 -.word 24452961 // zeta^252 * 2^31 = 21224105^252 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 21224105^252 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_512_u32_33564673_21224105_incomplete, %function -.global ntt_512_u32_33564673_21224105_incomplete -ntt_512_u32_33564673_21224105_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33564673 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vqrdmulh.s32 Q4, Q2, r9 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r11 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r11 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r9 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r11 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r8 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[456]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[468]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[480]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r8 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r8 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r8 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r8 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r8 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q1, Q1, r8 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q1, Q1, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r8 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q2, Q2, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r8 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r8 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r8 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r8 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r8 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r8 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r8 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q2, Q2, r8 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r8 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r8 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[12]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q2, Q2, r8 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[16]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(64)] -// Release input[16] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q2, Q2, r8 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[76]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q1, Q1, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q0, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[80]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[108]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(320)] -// Release input[80] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q0, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[172]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-320)] -// Release input[172] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q0, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[204]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q2, Q2, r8 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q1, Q1, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q2, Q2, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[268]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q1, Q1, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[272]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[300]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[300]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q2, Q2, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(80)] -// Release input[272] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(192)] -// Release input[300] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q1, Q1, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[332]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q0, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q2, Q2, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[364]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[364]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q1, Q1, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(336)] -// Release input[336] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(448)] -// Release input[364] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q0, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[396]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q2, Q2, r8 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[412]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[412]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[428]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-368)] -// Release input[412] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[428]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q0, Q0, r8 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-416)] -// Release input[400] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[416]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-304)] -// Release input[428] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q2, Q2, r8 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-352)] -// Release input[416] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[460]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r8 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q0, Q0, r8 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[492]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q2, Q2, r8 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[508]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[508]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q1, Q1, r8 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -vqrdmulh.s32 Q0, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(16)] -// Release input[508] from Q1 -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -.equ modulus_inv, 4223674367 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3960 -// Instruction count: 3091 \ No newline at end of file diff --git a/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete_double.s b/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete_double.s deleted file mode 100644 index e6a600c..0000000 --- a/tests/ntt_512/auto/ntt_512_u32_33564673_21224105_incomplete_double.s +++ /dev/null @@ -1,4571 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 32909249 // zeta^ 0 * 2^31 = 21224105^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 0 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 35458195 // zeta^128 * 2^31 = 21224105^128 * 2^31 = 6057702 * 2^31 -.word 387574637 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 21224105^128 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 44770213 // zeta^ 64 * 2^31 = 21224105^ 64 * 2^31 = 16166358 * 2^31 -.word 1034331227 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 64 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 3545473 // zeta^192 * 2^31 = 21224105^192 * 2^31 = 4070676 * 2^31 -.word 260443775 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 21224105^192 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 20108763 // zeta^ 32 * 2^31 = 21224105^ 32 * 2^31 = 3531198 * 2^31 -.word 225927717 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 32 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 16155699 // zeta^160 * 2^31 = 21224105^160 * 2^31 = 11260731 * 2^31 -.word 2867950541 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 21224105^160 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 23777969 // zeta^ 96 * 2^31 = 21224105^ 96 * 2^31 = 16586522 * 2^31 -.word 1061213519 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 96 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 43443635 // zeta^224 * 2^31 = 21224105^224 * 2^31 = 23220214 * 2^31 -.word 1485640269 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 21224105^224 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 56312659 // zeta^ 16 * 2^31 = 21224105^ 16 * 2^31 = 7974996 * 2^31 -.word 510244013 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 16 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 50428539 // zeta^144 * 2^31 = 21224105^144 * 2^31 = 11900197 * 2^31 -.word 2908863877 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 21224105^144 * 71292929 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 40872355 // zeta^ 80 * 2^31 = 21224105^ 80 * 2^31 = 32337348 * 2^31 -.word 2068958813 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 80 * 71292929 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 17505197 // zeta^208 * 2^31 = 21224105^208 * 2^31 = 7350388 * 2^31 -.word 470281299 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 21224105^208 * 71292929 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 29514841 // zeta^ 48 * 2^31 = 21224105^ 48 * 2^31 = 25808113 * 2^31 -.word 3798698919 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 48 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 46171693 // zeta^176 * 2^31 = 21224105^176 * 2^31 = 21754869 * 2^31 -.word 3539370451 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 21224105^176 * 71292929 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 49378579 // zeta^112 * 2^31 = 21224105^112 * 2^31 = 10121756 * 2^31 -.word 647594733 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 21224105^112 * 71292929 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 37299575 // zeta^240 * 2^31 = 21224105^240 * 2^31 = 13079905 * 2^31 -.word 2984342153 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 21224105^240 * 71292929 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 35114601 // zeta^ 8 * 2^31 = 21224105^ 8 * 2^31 = 31442912 * 2^31 -.word 2011732375 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 8 * 71292929 * 2^31 -.word 7271765 // zeta^ 4 * 2^31 = 21224105^ 4 * 2^31 = 11708223 * 2^31 -.word 2896581291 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 4 * 71292929 * 2^31 -.word 9232849 // zeta^132 * 2^31 = 21224105^132 * 2^31 = 19531360 * 2^31 -.word 1249625647 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 21224105^132 * 71292929 * 2^31 -.word 56661185 // zeta^136 * 2^31 = 21224105^136 * 2^31 = 25072687 * 2^31 -.word 3751646015 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 21224105^136 * 71292929 * 2^31 -.word 5061807 // zeta^ 68 * 2^31 = 21224105^ 68 * 2^31 = 10863968 * 2^31 -.word 695081809 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 68 * 71292929 * 2^31 -.word 12062383 // zeta^196 * 2^31 = 21224105^196 * 2^31 = 23554360 * 2^31 -.word 1507019089 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 21224105^196 * 71292929 * 2^31 -.word 24798937 // zeta^ 72 * 2^31 = 21224105^ 72 * 2^31 = 1138528 * 2^31 -.word 72843559 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 72 * 71292929 * 2^31 -.word 26674607 // zeta^ 36 * 2^31 = 21224105^ 36 * 2^31 = 29250598 * 2^31 -.word 1871467089 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 36 * 71292929 * 2^31 -.word 6369225 // zeta^164 * 2^31 = 21224105^164 * 2^31 = 6512804 * 2^31 -.word 416692279 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 21224105^164 * 71292929 * 2^31 -.word 2433499 // zeta^200 * 2^31 = 21224105^200 * 2^31 = 27899289 * 2^31 -.word 3932493349 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 21224105^200 * 71292929 * 2^31 -.word 13877423 // zeta^100 * 2^31 = 21224105^100 * 2^31 = 11938968 * 2^31 -.word 763860817 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 21224105^100 * 71292929 * 2^31 -.word 52182971 // zeta^228 * 2^31 = 21224105^228 * 2^31 = 3172265 * 2^31 -.word 2350446661 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 21224105^228 * 71292929 * 2^31 -.word 13509691 // zeta^ 40 * 2^31 = 21224105^ 40 * 2^31 = 15236728 * 2^31 -.word 974853061 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 40 * 71292929 * 2^31 -.word 26766019 // zeta^ 20 * 2^31 = 21224105^ 20 * 2^31 = 4808176 * 2^31 -.word 307629373 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 20 * 71292929 * 2^31 -.word 3049295 // zeta^148 * 2^31 = 21224105^148 * 2^31 = 13952996 * 2^31 -.word 892719281 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 21224105^148 * 71292929 * 2^31 -.word 61528771 // zeta^168 * 2^31 = 21224105^168 * 2^31 = 29831683 * 2^31 -.word 4056128829 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 21224105^168 * 71292929 * 2^31 -.word 27572075 // zeta^ 84 * 2^31 = 21224105^ 84 * 2^31 = 13705304 * 2^31 -.word 876871829 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 84 * 71292929 * 2^31 -.word 62852605 // zeta^212 * 2^31 = 21224105^212 * 2^31 = 26009832 * 2^31 -.word 1664121347 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 21224105^212 * 71292929 * 2^31 -.word 26961583 // zeta^104 * 2^31 = 21224105^104 * 2^31 = 24829277 * 2^31 -.word 3736072529 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 21224105^104 * 71292929 * 2^31 -.word 41037815 // zeta^ 52 * 2^31 = 21224105^ 52 * 2^31 = 32331817 * 2^31 -.word 4216088585 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 52 * 71292929 * 2^31 -.word 16612991 // zeta^180 * 2^31 = 21224105^180 * 2^31 = 33308953 * 2^31 -.word 4278606209 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 21224105^180 * 71292929 * 2^31 -.word 39914361 // zeta^232 * 2^31 = 21224105^232 * 2^31 = 26527504 * 2^31 -.word 1697242247 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 21224105^232 * 71292929 * 2^31 -.word 32973157 // zeta^116 * 2^31 = 21224105^116 * 2^31 = 12062971 * 2^31 -.word 2919278235 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 21224105^116 * 71292929 * 2^31 -.word 36139229 // zeta^244 * 2^31 = 21224105^244 * 2^31 = 32576304 * 2^31 -.word 2084247331 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 21224105^244 * 71292929 * 2^31 -.word 42427289 // zeta^ 24 * 2^31 = 21224105^ 24 * 2^31 = 23805553 * 2^31 -.word 3670574183 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 24 * 71292929 * 2^31 -.word 61506475 // zeta^ 12 * 2^31 = 21224105^ 12 * 2^31 = 2663422 * 2^31 -.word 170406997 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 12 * 71292929 * 2^31 -.word 55340015 // zeta^140 * 2^31 = 21224105^140 * 2^31 = 14111874 * 2^31 -.word 902884369 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 21224105^140 * 71292929 * 2^31 -.word 22993529 // zeta^152 * 2^31 = 21224105^152 * 2^31 = 20588736 * 2^31 -.word 1317277063 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 21224105^152 * 71292929 * 2^31 -.word 12255067 // zeta^ 76 * 2^31 = 21224105^ 76 * 2^31 = 30527813 * 2^31 -.word 4100667557 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 76 * 71292929 * 2^31 -.word 39251459 // zeta^204 * 2^31 = 21224105^204 * 2^31 = 1599504 * 2^31 -.word 102337021 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 21224105^204 * 71292929 * 2^31 -.word 12459675 // zeta^ 88 * 2^31 = 21224105^ 88 * 2^31 = 8729293 * 2^31 -.word 2705987941 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 88 * 71292929 * 2^31 -.word 13565749 // zeta^ 44 * 2^31 = 21224105^ 44 * 2^31 = 14112245 * 2^31 -.word 3050391755 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 44 * 71292929 * 2^31 -.word 36826073 // zeta^172 * 2^31 = 21224105^172 * 2^31 = 29475602 * 2^31 -.word 1885862951 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 21224105^172 * 71292929 * 2^31 -.word 17297731 // zeta^216 * 2^31 = 21224105^216 * 2^31 = 25151509 * 2^31 -.word 3756689085 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 21224105^216 * 71292929 * 2^31 -.word 34487347 // zeta^108 * 2^31 = 21224105^108 * 2^31 = 24806528 * 2^31 -.word 1587133389 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 21224105^108 * 71292929 * 2^31 -.word 61222515 // zeta^236 * 2^31 = 21224105^236 * 2^31 = 2847371 * 2^31 -.word 2329659789 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 21224105^236 * 71292929 * 2^31 -.word 51482787 // zeta^ 56 * 2^31 = 21224105^ 56 * 2^31 = 1778108 * 2^31 -.word 113764189 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 56 * 71292929 * 2^31 -.word 62959157 // zeta^ 28 * 2^31 = 21224105^ 28 * 2^31 = 14217049 * 2^31 -.word 3057097163 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 28 * 71292929 * 2^31 -.word 51158985 // zeta^156 * 2^31 = 21224105^156 * 2^31 = 25086215 * 2^31 -.word 3752511543 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 21224105^156 * 71292929 * 2^31 -.word 47832419 // zeta^184 * 2^31 = 21224105^184 * 2^31 = 9175386 * 2^31 -.word 587045533 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 21224105^184 * 71292929 * 2^31 -.word 59122583 // zeta^ 92 * 2^31 = 21224105^ 92 * 2^31 = 12661993 * 2^31 -.word 2957603945 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 92 * 71292929 * 2^31 -.word 12915351 // zeta^220 * 2^31 = 21224105^220 * 2^31 = 18981045 * 2^31 -.word 3361899881 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 21224105^220 * 71292929 * 2^31 -.word 32696733 // zeta^120 * 2^31 = 21224105^120 * 2^31 = 6110658 * 2^31 -.word 390962787 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 21224105^120 * 71292929 * 2^31 -.word 32364195 // zeta^ 60 * 2^31 = 21224105^ 60 * 2^31 = 30118507 * 2^31 -.word 4074479965 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 21224105^ 60 * 71292929 * 2^31 -.word 17635297 // zeta^188 * 2^31 = 21224105^188 * 2^31 = 3783875 * 2^31 -.word 2389577759 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 21224105^188 * 71292929 * 2^31 -.word 16328205 // zeta^248 * 2^31 = 21224105^248 * 2^31 = 14087250 * 2^31 -.word 901308915 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 21224105^248 * 71292929 * 2^31 -.word 38891533 // zeta^124 * 2^31 = 21224105^124 * 2^31 = 33548892 * 2^31 -.word 2146473971 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 21224105^124 * 71292929 * 2^31 -.word 24452961 // zeta^252 * 2^31 = 21224105^252 * 2^31 = 29158115 * 2^31 -.word 4013033631 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 21224105^252 * 71292929 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_512_u32_33564673_21224105_incomplete_double, %function -.global ntt_512_u32_33564673_21224105_incomplete_double -ntt_512_u32_33564673_21224105_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33564673 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vqrdmulh.s32 Q4, Q2, r9 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r11 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r11 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r9 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r11 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r8 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[456]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[468]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[480]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r8 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q1, Q1, r8 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q0, Q0, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q1, Q1, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[112]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q0, Q0, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r8 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q0, Q0, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r8 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r8 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q1, Q1, r8 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q0, Q0, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q1, Q1, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q0, Q0, r8 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q2, Q2, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q1, Q1, r8 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q0, Q0, r8 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[372]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r8 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r8 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q0, Q0, r8 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[388]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r8 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r8 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[496]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q0, Q0, r8 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q2, Q2, r8 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r9 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r8 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r11 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r11 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r11 -// input[508]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[508]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r9 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q0, Q0, r8 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r11 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r11 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(16)] -// Release input[508] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[12]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r9 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q2, Q2, r8 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q0, Q2, r11 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vstrw.u32 Q2, [r1,#(96)] -vqrdmulh.s32 Q7, Q2, r5 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(64)] -vqrdmlah.s32 Q7, Q2, r11 -// Release input[12] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q3, r5 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q4, [r1,#(32)] -vqrdmlah.s32 Q7, Q3, r11 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q4, Q4, r6 -vstrw.u32 Q0, [r1,#(0)]! -vqrdmlah.s32 Q7, Q4, r11 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q2, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q2, [r1,#(16)] -// Release input[0] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q1, r11 -vqrdmulh.s32 Q4, Q2, r9 -vsub.s32 Q1, Q3, Q0 -vmul.u32 Q2, Q2, r8 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q4, Q2, r11 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q1, r11 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q3, r7 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q3, Q3, r6 -vstrw.u32 Q1, [r1,#(224)] -vqrdmulh.s32 Q7, Q1, r5 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(192)] -vqrdmlah.s32 Q7, Q1, r11 -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q3, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q2, r5 -vsub.s32 Q3, Q0, Q6 -vmul.u32 Q2, Q2, r4 -vstrw.u32 Q3, [r1,#(160)] -vqrdmlah.s32 Q7, Q2, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q2 -vqrdmulh.s32 Q7, Q3, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q3, Q3, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q3, r11 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[16] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q4, Q4, r8 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q4, Q4, r8 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[252] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[248] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[268] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[264] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[260] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[256] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[272]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[284] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[280] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[276] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[272] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[300]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[300] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[296] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[292] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[288] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[316]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[316] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[312] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[308] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[304] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[332] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[328] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[324] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[320] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[348] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[344] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[340] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[336] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[364] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[360] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[356] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[352] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[380] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[376] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[372] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[368] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[396]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[396] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[392] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[388] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[384] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[412]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[408]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -96)] -vmul.u32 Q3, Q3, r8 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[412] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[408] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[404] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[400] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[428]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[424]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[428] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[424] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[420] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[416] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[444]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[444] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[440] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[436] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[432] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[460]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vmul.u32 Q4, Q4, r8 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[460] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[456] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[452] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[448] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[476]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vmul.u32 Q3, Q3, r8 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r11 -// Release input[476] from Q3 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[472] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[468] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[464] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[492]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vmul.u32 Q4, Q4, r8 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmlah.s32 Q0, Q4, r11 -vqrdmulh.s32 Q3, Q1, r9 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r11 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q4, r5 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r4 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r11 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r7 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r11 -// Release input[492] from Q4 -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r5 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r11 -vstrw.u32 Q7, [r1,#(208)] -// Release input[488] from Q1 -vqrdmulh.s32 Q7, Q2, r7 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r11 -vneg.s32 Q7, Q7 -// Release input[484] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[480] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vmul.u32 Q3, Q3, r8 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q3, r11 -vqrdmulh.s32 Q4, Q1, r9 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r11 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q3, r5 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r4 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r11 -vqrdmulh.s32 Q4, Q2, r7 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q6, Q3, r5 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r4 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q6, Q3, r11 -// Release input[508] from Q3 -vqrdmlah.s32 Q4, Q2, r11 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1,#(240)] -vqrdmulh.s32 Q6, Q1, r5 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r4 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q6, Q1, r11 -vstrw.u32 Q6, [r1,#(208)] -// Release input[504] from Q1 -vqrdmulh.s32 Q6, Q2, r7 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r6 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q6, Q2, r11 -vneg.s32 Q6, Q6 -// Release input[500] from Q2 -vqrdmulh.s32 Q1, Q0, r7 -vstrw.u32 Q6, [r1,#(48)] -vmul.u32 Q0, Q0, r6 -ldrd r9, r8, [r10], #+8 -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q1, [r1,#(16)] -// Release input[496] from Q0 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -.equ modulus_inv, 4223674367 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 4539 -// Instruction count: 3670 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete.s deleted file mode 100644 index a8de1ee..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete.s +++ /dev/null @@ -1,7499 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40872659 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 5033605 // zeta^204 * 2^31 = 299353^204 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 299353^204 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 50479773 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 58797193 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 59392861 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 9383201 // zeta^252 * 2^31 = 299353^252 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 299353^252 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 63329695 // zeta^156 * 2^31 = 299353^156 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 299353^156 * 375649793 * 2^31 -.word 57130935 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 65797823 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 10391631 // zeta^228 * 2^31 = 299353^228 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 299353^228 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 31719253 // zeta^132 * 2^31 = 299353^132 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 299353^132 * 375649793 * 2^31 -.word 12271567 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 21111903 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 12778219 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 35733845 // zeta^180 * 2^31 = 299353^180 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 299353^180 * 375649793 * 2^31 -.word 6014597 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 65310821 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 3561709979 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 22138503 // zeta^ 4 * 2^31 = 299353^ 4 * 2^31 = 27792935 * 2^31 -.word 3926095737 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 4 * 375649793 * 2^31 -.word 33080685 // zeta^196 * 2^31 = 299353^196 * 2^31 = 14985834 * 2^31 -.word 959020179 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 299353^196 * 375649793 * 2^31 -.word 1555569 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 2602610063 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 2867655 // zeta^100 * 2^31 = 299353^100 * 2^31 = 27701331 * 2^31 -.word 3920233529 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 299353^100 * 375649793 * 2^31 -.word 16116991 // zeta^292 * 2^31 = 299353^292 * 2^31 = 7520866 * 2^31 -.word 481298689 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 299353^292 * 375649793 * 2^31 -.word 6082985 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 869255255 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 62987623 // zeta^ 52 * 2^31 = 299353^ 52 * 2^31 = 12887930 * 2^31 -.word 824764569 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 52 * 375649793 * 2^31 -.word 21603065 // zeta^244 * 2^31 = 299353^244 * 2^31 = 20924057 * 2^31 -.word 3486521095 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 299353^244 * 375649793 * 2^31 -.word 9182701 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1779115027 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 52075375 // zeta^148 * 2^31 = 299353^148 * 2^31 = 18248795 * 2^31 -.word 3315317393 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 299353^148 * 375649793 * 2^31 -.word 41362929 // zeta^340 * 2^31 = 299353^340 * 2^31 = 7570258 * 2^31 -.word 484459535 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 299353^340 * 375649793 * 2^31 -.word 36605521 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1102879663 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 32011901 // zeta^ 28 * 2^31 = 299353^ 28 * 2^31 = 3117724 * 2^31 -.word 199519107 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 28 * 375649793 * 2^31 -.word 794339 // zeta^220 * 2^31 = 299353^220 * 2^31 = 26868479 * 2^31 -.word 3866935069 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 299353^220 * 375649793 * 2^31 -.word 57238029 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 2941722611 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 56716901 // zeta^124 * 2^31 = 299353^124 * 2^31 = 32895965 * 2^31 -.word 4252664731 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 299353^124 * 375649793 * 2^31 -.word 37067083 // zeta^316 * 2^31 = 299353^316 * 2^31 = 17429125 * 2^31 -.word 3262862517 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 299353^316 * 375649793 * 2^31 -.word 41992621 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 2138832979 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 37641785 // zeta^ 76 * 2^31 = 299353^ 76 * 2^31 = 14988263 * 2^31 -.word 3106659271 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 76 * 375649793 * 2^31 -.word 9036599 // zeta^268 * 2^31 = 299353^268 * 2^31 = 26964245 * 2^31 -.word 3873063625 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 299353^268 * 375649793 * 2^31 -.word 53062965 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1726375115 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 33892281 // zeta^172 * 2^31 = 299353^172 * 2^31 = 18683355 * 2^31 -.word 3343127111 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 299353^172 * 375649793 * 2^31 -.word 27392067 // zeta^364 * 2^31 = 299353^364 * 2^31 = 5739597 * 2^31 -.word 2514789821 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 299353^364 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 42676979 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 2760264461 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 27097661 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 659875779 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 4639589 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2721523355 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 56908961 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 3814059359 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 7108001 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 1934087263 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 16267963 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 2923893573 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 28966165 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2549404907 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 15324513 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 1904735391 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 65310821 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 3561709979 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 1555569 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 2602610063 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 6082985 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 869255255 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 9182701 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1779115027 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 36605521 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1102879663 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 57238029 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 2941722611 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 41992621 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 2138832979 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 53062965 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1726375115 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 42676979 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 2760264461 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 5740163 // zeta^ 20 * 2^31 = 299353^ 20 * 2^31 = 24739198 * 2^31 -.word 1583187837 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 20 * 375649793 * 2^31 -.word 28917839 // zeta^212 * 2^31 = 299353^212 * 2^31 = 21478846 * 2^31 -.word 1374541233 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 299353^212 * 375649793 * 2^31 -.word 27097661 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 659875779 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 49145461 // zeta^116 * 2^31 = 299353^116 * 2^31 = 13729478 * 2^31 -.word 878619531 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 299353^116 * 375649793 * 2^31 -.word 6303215 // zeta^308 * 2^31 = 299353^308 * 2^31 = 18367002 * 2^31 -.word 1175398417 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 299353^308 * 375649793 * 2^31 -.word 4639589 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2721523355 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 54366111 // zeta^ 68 * 2^31 = 299353^ 68 * 2^31 = 8457503 * 2^31 -.word 2688722529 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 68 * 375649793 * 2^31 -.word 43137743 // zeta^260 * 2^31 = 299353^260 * 2^31 = 29589567 * 2^31 -.word 4041071409 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 299353^260 * 375649793 * 2^31 -.word 56908961 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 3814059359 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 48357821 // zeta^164 * 2^31 = 299353^164 * 2^31 = 26244564 * 2^31 -.word 1679523907 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 299353^164 * 375649793 * 2^31 -.word 41080969 // zeta^356 * 2^31 = 299353^356 * 2^31 = 7994472 * 2^31 -.word 511607159 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 299353^356 * 375649793 * 2^31 -.word 7108001 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 1934087263 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 8652081 // zeta^ 44 * 2^31 = 299353^ 44 * 2^31 = 27932647 * 2^31 -.word 3935036623 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 44 * 375649793 * 2^31 -.word 44314847 // zeta^236 * 2^31 = 299353^236 * 2^31 = 10003728 * 2^31 -.word 640189729 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 299353^236 * 375649793 * 2^31 -.word 16267963 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 2923893573 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 16352265 // zeta^140 * 2^31 = 299353^140 * 2^31 = 26391350 * 2^31 -.word 1688917495 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 299353^140 * 375649793 * 2^31 -.word 948813 // zeta^332 * 2^31 = 299353^332 * 2^31 = 11703708 * 2^31 -.word 748980147 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 299353^332 * 375649793 * 2^31 -.word 28966165 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2549404907 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 44334383 // zeta^ 92 * 2^31 = 299353^ 92 * 2^31 = 31954666 * 2^31 -.word 2044942545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 92 * 375649793 * 2^31 -.word 64874787 // zeta^284 * 2^31 = 299353^284 * 2^31 = 5130075 * 2^31 -.word 2475783389 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 299353^284 * 375649793 * 2^31 -.word 15324513 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 1904735391 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 62902951 // zeta^188 * 2^31 = 299353^188 * 2^31 = 22872479 * 2^31 -.word 3611210585 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 299353^188 * 375649793 * 2^31 -.word 53337279 // zeta^380 * 2^31 = 299353^380 * 2^31 = 9132318 * 2^31 -.word 584423745 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 299353^380 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete, %function -.global ntt_768_u32_33556993_299353_incomplete -ntt_768_u32_33556993_299353_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[512] from Q1 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vadd.s32 Q6, Q4, Q3 -// input[516]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 12)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r12,#(32)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[260]: Already loaded as Q1 -// input[516]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[260] from Q1 -vmul.u32 Q3, Q0, r7 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[516] from Q7 -// input[264]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 12)] -vadd.s32 Q6, Q3, Q2 -// input[520]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(32)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r12,#(48)] -vadd.s32 Q4, Q4, Q1 -// Release input[4] from Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[264]: Already loaded as Q5 -// input[520]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[264] from Q5 -vmul.u32 Q2, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[520] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[524]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(48)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(64)] -vadd.s32 Q3, Q3, Q4 -// Release input[8] from Q4 -vstrw.u32 Q3, [r0,#(32)] -// input[268]: Already loaded as Q5 -// input[524]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[524] from Q7 -// input[272]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[528]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[272]: Already loaded as Q5 -// input[528]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[272] from Q5 -vmul.u32 Q2, Q0, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[528] from Q7 -// input[276]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[532]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(96)] -vadd.s32 Q3, Q3, Q4 -// Release input[16] from Q4 -vstrw.u32 Q3, [r0,#(64)] -// input[276]: Already loaded as Q5 -// input[532]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[276] from Q5 -vmul.u32 Q2, Q0, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[532] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[536]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(112)] -vadd.s32 Q3, Q3, Q4 -// Release input[20] from Q4 -vstrw.u32 Q3, [r0,#(80)] -// input[280]: Already loaded as Q5 -// input[536]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[536] from Q7 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[540]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[284]: Already loaded as Q5 -// input[540]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[284] from Q5 -vmul.u32 Q2, Q0, r7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[540] from Q7 -// input[288]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[544]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(144)] -vadd.s32 Q3, Q3, Q4 -// Release input[28] from Q4 -vstrw.u32 Q3, [r0,#(112)] -// input[288]: Already loaded as Q5 -// input[544]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[288] from Q5 -vmul.u32 Q2, Q0, r7 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[544] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[548]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(144)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(160)] -vadd.s32 Q3, Q3, Q4 -// Release input[32] from Q4 -vstrw.u32 Q3, [r0,#(128)] -// input[292]: Already loaded as Q5 -// input[548]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[548] from Q7 -// input[296]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[552]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[296]: Already loaded as Q5 -// input[552]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[296] from Q5 -vmul.u32 Q2, Q0, r7 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[552] from Q7 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[556]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(192)] -vadd.s32 Q3, Q3, Q4 -// Release input[40] from Q4 -vstrw.u32 Q3, [r0,#(160)] -// input[300]: Already loaded as Q5 -// input[556]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[300] from Q5 -vmul.u32 Q2, Q0, r7 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[556] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[560]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(192)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(208)] -vadd.s32 Q3, Q3, Q4 -// Release input[44] from Q4 -vstrw.u32 Q3, [r0,#(176)] -// input[304]: Already loaded as Q5 -// input[560]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[560] from Q7 -// input[308]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[564]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[308]: Already loaded as Q5 -// input[564]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[308] from Q5 -vmul.u32 Q2, Q0, r7 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[564] from Q7 -// input[312]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[568]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(240)] -vadd.s32 Q3, Q3, Q4 -// Release input[52] from Q4 -vstrw.u32 Q3, [r0,#(208)] -// input[312]: Already loaded as Q5 -// input[568]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[312] from Q5 -vmul.u32 Q2, Q0, r7 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[568] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[572]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(240)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(256)] -vadd.s32 Q3, Q3, Q4 -// Release input[56] from Q4 -vstrw.u32 Q3, [r0,#(224)] -// input[316]: Already loaded as Q5 -// input[572]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[572] from Q7 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[576]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[320]: Already loaded as Q5 -// input[576]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[320] from Q5 -vmul.u32 Q2, Q0, r7 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[576] from Q7 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[580]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(288)] -vadd.s32 Q3, Q3, Q4 -// Release input[64] from Q4 -vstrw.u32 Q3, [r0,#(256)] -// input[324]: Already loaded as Q5 -// input[580]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[324] from Q5 -vmul.u32 Q2, Q0, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[580] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[584]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(288)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(304)] -vadd.s32 Q3, Q3, Q4 -// Release input[68] from Q4 -vstrw.u32 Q3, [r0,#(272)] -// input[328]: Already loaded as Q5 -// input[584]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[584] from Q7 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[588]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[332]: Already loaded as Q5 -// input[588]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[332] from Q5 -vmul.u32 Q2, Q0, r7 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[588] from Q7 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[592]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(336)] -vadd.s32 Q3, Q3, Q4 -// Release input[76] from Q4 -vstrw.u32 Q3, [r0,#(304)] -// input[336]: Already loaded as Q5 -// input[592]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[336] from Q5 -vmul.u32 Q2, Q0, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[592] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(336)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(352)] -vadd.s32 Q3, Q3, Q4 -// Release input[80] from Q4 -vstrw.u32 Q3, [r0,#(320)] -// input[340]: Already loaded as Q5 -// input[596]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[596] from Q7 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[600]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[344]: Already loaded as Q5 -// input[600]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[344] from Q5 -vmul.u32 Q2, Q0, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[600] from Q7 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[604]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(384)] -vadd.s32 Q3, Q3, Q4 -// Release input[88] from Q4 -vstrw.u32 Q3, [r0,#(352)] -// input[348]: Already loaded as Q5 -// input[604]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[348] from Q5 -vmul.u32 Q2, Q0, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[604] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[608]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(400)] -vadd.s32 Q3, Q3, Q4 -// Release input[92] from Q4 -vstrw.u32 Q3, [r0,#(368)] -// input[352]: Already loaded as Q5 -// input[608]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[608] from Q7 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[612]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[356]: Already loaded as Q5 -// input[612]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[356] from Q5 -vmul.u32 Q2, Q0, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[612] from Q7 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[616]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(432)] -vadd.s32 Q3, Q3, Q4 -// Release input[100] from Q4 -vstrw.u32 Q3, [r0,#(400)] -// input[360]: Already loaded as Q5 -// input[616]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[360] from Q5 -vmul.u32 Q2, Q0, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[616] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(432)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(448)] -vadd.s32 Q3, Q3, Q4 -// Release input[104] from Q4 -vstrw.u32 Q3, [r0,#(416)] -// input[364]: Already loaded as Q5 -// input[620]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[620] from Q7 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[624]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[368]: Already loaded as Q5 -// input[624]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[368] from Q5 -vmul.u32 Q2, Q0, r7 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[624] from Q7 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[628]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(480)] -vadd.s32 Q3, Q3, Q4 -// Release input[112] from Q4 -vstrw.u32 Q3, [r0,#(448)] -// input[372]: Already loaded as Q5 -// input[628]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[372] from Q5 -vmul.u32 Q2, Q0, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[628] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[632]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(480)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(496)] -vadd.s32 Q3, Q3, Q4 -// Release input[116] from Q4 -vstrw.u32 Q3, [r0,#(464)] -// input[376]: Already loaded as Q5 -// input[632]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[632] from Q7 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q6, Q2, Q1 -// input[636]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[380]: Already loaded as Q5 -// input[636]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[380] from Q5 -vmul.u32 Q2, Q0, r7 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[636] from Q7 -// input[384]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -120)] -vadd.s32 Q6, Q2, Q1 -// input[640]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q3, Q3, Q4 -// Release input[124] from Q4 -vstrw.u32 Q3, [r0,#(496)] -// input[384]: Already loaded as Q5 -// input[640]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[384] from Q5 -vmul.u32 Q2, Q0, r7 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[640] from Q7 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[644]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-480)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-464)] -vadd.s32 Q3, Q3, Q4 -// Release input[128] from Q4 -vstrw.u32 Q3, [r14,#(-496)] -// input[388]: Already loaded as Q5 -// input[644]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[388] from Q5 -vmul.u32 Q2, Q0, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[644] from Q7 -// input[392]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -112)] -vadd.s32 Q6, Q2, Q1 -// input[648]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[392]: Already loaded as Q5 -// input[648]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[392] from Q5 -vmul.u32 Q2, Q0, r7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[648] from Q7 -// input[396]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -108)] -vadd.s32 Q6, Q2, Q1 -// input[652]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q3, Q3, Q4 -// Release input[136] from Q4 -vstrw.u32 Q3, [r14,#(-464)] -// input[396]: Already loaded as Q5 -// input[652]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[396] from Q5 -vmul.u32 Q2, Q0, r7 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[652] from Q7 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[656]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-432)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q4 -// Release input[140] from Q4 -vstrw.u32 Q3, [r14,#(-448)] -// input[400]: Already loaded as Q5 -// input[656]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[400] from Q5 -vmul.u32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[656] from Q7 -// input[404]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -100)] -vadd.s32 Q6, Q2, Q1 -// input[660]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[404]: Already loaded as Q5 -// input[660]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[404] from Q5 -vmul.u32 Q2, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[660] from Q7 -// input[408]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -96)] -vadd.s32 Q6, Q2, Q1 -// input[664]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q4 -// Release input[148] from Q4 -vstrw.u32 Q3, [r14,#(-416)] -// input[408]: Already loaded as Q5 -// input[664]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[408] from Q5 -vmul.u32 Q2, Q0, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[664] from Q7 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[668]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q4 -// Release input[152] from Q4 -vstrw.u32 Q3, [r14,#(-400)] -// input[412]: Already loaded as Q5 -// input[668]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[412] from Q5 -vmul.u32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[668] from Q7 -// input[416]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -88)] -vadd.s32 Q6, Q2, Q1 -// input[672]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[416]: Already loaded as Q5 -// input[672]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[416] from Q5 -vmul.u32 Q2, Q0, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[672] from Q7 -// input[420]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -84)] -vadd.s32 Q6, Q2, Q1 -// input[676]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q4 -// Release input[160] from Q4 -vstrw.u32 Q3, [r14,#(-368)] -// input[420]: Already loaded as Q5 -// input[676]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[420] from Q5 -vmul.u32 Q2, Q0, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[676] from Q7 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[680]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-336)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q4 -// Release input[164] from Q4 -vstrw.u32 Q3, [r14,#(-352)] -// input[424]: Already loaded as Q5 -// input[680]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[424] from Q5 -vmul.u32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[680] from Q7 -// input[428]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -76)] -vadd.s32 Q6, Q2, Q1 -// input[684]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[428]: Already loaded as Q5 -// input[684]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[428] from Q5 -vmul.u32 Q2, Q0, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[684] from Q7 -// input[432]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -72)] -vadd.s32 Q6, Q2, Q1 -// input[688]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q4 -// Release input[172] from Q4 -vstrw.u32 Q3, [r14,#(-320)] -// input[432]: Already loaded as Q5 -// input[688]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[432] from Q5 -vmul.u32 Q2, Q0, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[688] from Q7 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[692]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-288)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q4 -// Release input[176] from Q4 -vstrw.u32 Q3, [r14,#(-304)] -// input[436]: Already loaded as Q5 -// input[692]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[436] from Q5 -vmul.u32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[692] from Q7 -// input[440]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -64)] -vadd.s32 Q6, Q2, Q1 -// input[696]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[440]: Already loaded as Q5 -// input[696]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[440] from Q5 -vmul.u32 Q2, Q0, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[696] from Q7 -// input[444]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -60)] -vadd.s32 Q6, Q2, Q1 -// input[700]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-240)] -vadd.s32 Q3, Q3, Q4 -// Release input[184] from Q4 -vstrw.u32 Q3, [r14,#(-272)] -// input[444]: Already loaded as Q5 -// input[700]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[444] from Q5 -vmul.u32 Q2, Q0, r7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[700] from Q7 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[704]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-240)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-224)] -vadd.s32 Q3, Q3, Q4 -// Release input[188] from Q4 -vstrw.u32 Q3, [r14,#(-256)] -// input[448]: Already loaded as Q5 -// input[704]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[448] from Q5 -vmul.u32 Q2, Q0, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[704] from Q7 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vadd.s32 Q6, Q2, Q1 -// input[708]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[452]: Already loaded as Q5 -// input[708]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[452] from Q5 -vmul.u32 Q2, Q0, r7 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[708] from Q7 -// input[456]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -48)] -vadd.s32 Q6, Q2, Q1 -// input[712]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-192)] -vadd.s32 Q3, Q3, Q4 -// Release input[196] from Q4 -vstrw.u32 Q3, [r14,#(-224)] -// input[456]: Already loaded as Q5 -// input[712]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[456] from Q5 -vmul.u32 Q2, Q0, r7 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[712] from Q7 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[716]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-192)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-176)] -vadd.s32 Q3, Q3, Q4 -// Release input[200] from Q4 -vstrw.u32 Q3, [r14,#(-208)] -// input[460]: Already loaded as Q5 -// input[716]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[460] from Q5 -vmul.u32 Q2, Q0, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[716] from Q7 -// input[464]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -40)] -vadd.s32 Q6, Q2, Q1 -// input[720]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[464]: Already loaded as Q5 -// input[720]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[464] from Q5 -vmul.u32 Q2, Q0, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[720] from Q7 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vadd.s32 Q6, Q2, Q1 -// input[724]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-144)] -vadd.s32 Q3, Q3, Q4 -// Release input[208] from Q4 -vstrw.u32 Q3, [r14,#(-176)] -// input[468]: Already loaded as Q5 -// input[724]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[468] from Q5 -vmul.u32 Q2, Q0, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[724] from Q7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[728]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-144)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-128)] -vadd.s32 Q3, Q3, Q4 -// Release input[212] from Q4 -vstrw.u32 Q3, [r14,#(-160)] -// input[472]: Already loaded as Q5 -// input[728]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[472] from Q5 -vmul.u32 Q2, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[728] from Q7 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vadd.s32 Q6, Q2, Q1 -// input[732]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[476]: Already loaded as Q5 -// input[732]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[476] from Q5 -vmul.u32 Q2, Q0, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[732] from Q7 -// input[480]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -24)] -vadd.s32 Q6, Q2, Q1 -// input[736]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-96)] -vadd.s32 Q3, Q3, Q4 -// Release input[220] from Q4 -vstrw.u32 Q3, [r14,#(-128)] -// input[480]: Already loaded as Q5 -// input[736]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[480] from Q5 -vmul.u32 Q2, Q0, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[736] from Q7 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[740]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-80)] -vadd.s32 Q3, Q3, Q4 -// Release input[224] from Q4 -vstrw.u32 Q3, [r14,#(-112)] -// input[484]: Already loaded as Q5 -// input[740]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[484] from Q5 -vmul.u32 Q2, Q0, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[740] from Q7 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vadd.s32 Q6, Q2, Q1 -// input[744]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[488]: Already loaded as Q5 -// input[744]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[488] from Q5 -vmul.u32 Q2, Q0, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[744] from Q7 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vadd.s32 Q6, Q2, Q1 -// input[748]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-48)] -vadd.s32 Q3, Q3, Q4 -// Release input[232] from Q4 -vstrw.u32 Q3, [r14,#(-80)] -// input[492]: Already loaded as Q5 -// input[748]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[492] from Q5 -vmul.u32 Q2, Q0, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[748] from Q7 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[752]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-48)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-32)] -vadd.s32 Q3, Q3, Q4 -// Release input[236] from Q4 -vstrw.u32 Q3, [r14,#(-64)] -// input[496]: Already loaded as Q5 -// input[752]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[496] from Q5 -vmul.u32 Q2, Q0, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[752] from Q7 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vadd.s32 Q6, Q2, Q1 -// input[756]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 0)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[500]: Already loaded as Q5 -// input[756]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[500] from Q5 -vmul.u32 Q2, Q0, r7 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[756] from Q7 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vadd.s32 Q6, Q2, Q1 -// input[760]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(0)] -vadd.s32 Q3, Q3, Q4 -// Release input[244] from Q4 -vstrw.u32 Q3, [r14,#(-32)] -// input[504]: Already loaded as Q5 -// input[760]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[504] from Q5 -vmul.u32 Q2, Q0, r7 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[760] from Q7 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[764]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(0)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(16)] -vadd.s32 Q3, Q3, Q4 -// Release input[248] from Q4 -vstrw.u32 Q3, [r14,#(-16)] -// input[508]: Already loaded as Q5 -// input[764]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[508] from Q5 -vmul.u32 Q2, Q0, r7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[764] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r12,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r8 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r10 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r7 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r10 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r7 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[456]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[468]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[480]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[704]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[708]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[708]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-192)] -// Release input[708] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[720]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-144)] -// Release input[720] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[532]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[728]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(112)] -// Release input[532] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-112)] -// Release input[728] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[540]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(144)] -// Release input[540] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[740]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[740]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[744]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-64)] -// Release input[740] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[744]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-48)] -// Release input[744] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[556]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(208)] -// Release input[556] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[756]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(0)] -// Release input[756] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[48]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q2, Q2, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[120]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[180]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[180]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-288)] -// Release input[180] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q2, Q2, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[304]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[312]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[368]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q1, Q1, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[372]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[372]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(480)] -// Release input[372] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[432]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[388]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-272)] -// Release input[436] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[440]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-464)] -// Release input[388] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-256)] -// Release input[440] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-448)] -// Release input[392] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[396]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q2, Q2, r7 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-432)] -// Release input[396] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-208)] -// Release input[452] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[460]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[560]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-176)] -// Release input[460] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[564]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[568]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[568]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(256)] -// Release input[568] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q1, Q1, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[624]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(480)] -// Release input[624] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[628]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[580]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(496)] -// Release input[628] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(304)] -// Release input[580] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[584]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-496)] -// Release input[632] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(320)] -// Release input[584] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[588]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[688]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(336)] -// Release input[588] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-272)] -// Release input[688] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[696]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[696]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-448)] -// Release input[644] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-240)] -// Release input[696] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-432)] -// Release input[648] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[652]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-416)] -// Release input[652] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[756]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(0)] -// Release input[756] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[712]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q1, Q1, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-176)] -// Release input[712] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r7 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[44]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r7 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q2, Q2, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[92]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(368)] -// Release input[92] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r7 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r7 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[268]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q2, Q2, r7 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[284]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q2, Q2, r7 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[332]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q1, Q1, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[336]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(336)] -// Release input[336] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[428]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[428]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-304)] -// Release input[428] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q2, Q2, r7 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q2, Q2, r7 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[524]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r7 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[556]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[604]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[620]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[620]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(352)] -// Release input[592] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(464)] -// Release input[620] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[652]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[668]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[668]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-352)] -// Release input[668] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[672]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-336)] -// Release input[672] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-80)] -// Release input[736] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -vqrdmulh.s32 Q0, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q0, Q4, r10 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 7467 -// Instruction count: 5655 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_bitrev.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_bitrev.s deleted file mode 100644 index 3d2050c..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_bitrev.s +++ /dev/null @@ -1,7499 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 66220859 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 12340695 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2018646057 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 66384763 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 66384763 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 66220859 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 50311979 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1261697749 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 62228979 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 2430825 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 9032575 // zeta^736 * 2^31 = 299353^736 * 2^31 = 23624597 * 2^31 -.word 3659342465 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 299353^736 * 375649793 * 2^31 -.word 18922677 // zeta^752 * 2^31 = 299353^752 * 2^31 = 403828 * 2^31 -.word 25843019 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 299353^752 * 375649793 * 2^31 -.word 38694841 // zeta^560 * 2^31 = 299353^560 * 2^31 = 994165 * 2^31 -.word 2211105351 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 299353^560 * 375649793 * 2^31 -.word 11561947 // zeta^544 * 2^31 = 299353^544 * 2^31 = 29356361 * 2^31 -.word 4026147365 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 299353^544 * 375649793 * 2^31 -.word 14211205 // zeta^656 * 2^31 = 299353^656 * 2^31 = 30845592 * 2^31 -.word 1973967227 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 299353^656 * 375649793 * 2^31 -.word 15880423 // zeta^464 * 2^31 = 299353^464 * 2^31 = 518908 * 2^31 -.word 33207577 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 299353^464 * 375649793 * 2^31 -.word 66220859 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 12340695 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2018646057 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 66384763 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 50311979 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1261697749 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 66385749 // zeta^608 * 2^31 = 299353^608 * 2^31 = 9045021 * 2^31 -.word 2726320811 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 299353^608 * 375649793 * 2^31 -.word 59595857 // zeta^416 * 2^31 = 299353^416 * 2^31 = 32616688 * 2^31 -.word 2087308719 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 299353^416 * 375649793 * 2^31 -.word 12340695 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2018646057 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 9032575 // zeta^736 * 2^31 = 299353^736 * 2^31 = 23624597 * 2^31 -.word 3659342465 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 299353^736 * 375649793 * 2^31 -.word 11561947 // zeta^544 * 2^31 = 299353^544 * 2^31 = 29356361 * 2^31 -.word 4026147365 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 299353^544 * 375649793 * 2^31 -.word 66384763 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 66220859 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 50311979 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1261697749 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 66385749 // zeta^608 * 2^31 = 299353^608 * 2^31 = 9045021 * 2^31 -.word 2726320811 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 299353^608 * 375649793 * 2^31 -.word 2707023 // zeta^688 * 2^31 = 299353^688 * 2^31 = 15739856 * 2^31 -.word 1007273905 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 299353^688 * 375649793 * 2^31 -.word 49061917 // zeta^496 * 2^31 = 299353^496 * 2^31 = 13108720 * 2^31 -.word 838894051 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 299353^496 * 375649793 * 2^31 -.word 59595857 // zeta^416 * 2^31 = 299353^416 * 2^31 = 32616688 * 2^31 -.word 2087308719 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 299353^416 * 375649793 * 2^31 -.word 44552409 // zeta^592 * 2^31 = 299353^592 * 2^31 = 21166324 * 2^31 -.word 1354541351 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 299353^592 * 375649793 * 2^31 -.word 37943863 // zeta^400 * 2^31 = 299353^400 * 2^31 = 9445248 * 2^31 -.word 604449737 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 299353^400 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 62228979 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 2430825 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 62228979 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 12542317 // zeta^744 * 2^31 = 299353^744 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 299353^744 * 375649793 * 2^31 -.word 48811299 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 27114239 // zeta^648 * 2^31 = 299353^648 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 299353^648 * 375649793 * 2^31 -.word 21796399 // zeta^456 * 2^31 = 299353^456 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 299353^456 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 41196349 // zeta^696 * 2^31 = 299353^696 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 299353^696 * 375649793 * 2^31 -.word 58757463 // zeta^504 * 2^31 = 299353^504 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 299353^504 * 375649793 * 2^31 -.word 2430825 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 40500013 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 7832335 // zeta^408 * 2^31 = 299353^408 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 299353^408 * 375649793 * 2^31 -.word 12542317 // zeta^744 * 2^31 = 299353^744 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 299353^744 * 375649793 * 2^31 -.word 61099389 // zeta^756 * 2^31 = 299353^756 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^756 * f(q^(-1) mod 2^32) * 2^31 = 299353^756 * 375649793 * 2^31 -.word 31380141 // zeta^564 * 2^31 = 299353^564 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 299353^564 * 375649793 * 2^31 -.word 48811299 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 54335767 // zeta^660 * 2^31 = 299353^660 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^660 * f(q^(-1) mod 2^32) * 2^31 = 299353^660 * 375649793 * 2^31 -.word 46002083 // zeta^468 * 2^31 = 299353^468 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 299353^468 * 375649793 * 2^31 -.word 27114239 // zeta^648 * 2^31 = 299353^648 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 299353^648 * 375649793 * 2^31 -.word 54842419 // zeta^708 * 2^31 = 299353^708 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^708 * f(q^(-1) mod 2^32) * 2^31 = 299353^708 * 375649793 * 2^31 -.word 35394733 // zeta^516 * 2^31 = 299353^516 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 299353^516 * 375649793 * 2^31 -.word 21796399 // zeta^456 * 2^31 = 299353^456 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 299353^456 * 375649793 * 2^31 -.word 56722355 // zeta^612 * 2^31 = 299353^612 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 299353^612 * 375649793 * 2^31 -.word 1316163 // zeta^420 * 2^31 = 299353^420 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 299353^420 * 375649793 * 2^31 -.word 41196349 // zeta^696 * 2^31 = 299353^696 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 299353^696 * 375649793 * 2^31 -.word 9983051 // zeta^732 * 2^31 = 299353^732 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^732 * f(q^(-1) mod 2^32) * 2^31 = 299353^732 * 375649793 * 2^31 -.word 3784291 // zeta^540 * 2^31 = 299353^540 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 299353^540 * 375649793 * 2^31 -.word 58757463 // zeta^504 * 2^31 = 299353^504 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 299353^504 * 375649793 * 2^31 -.word 57730785 // zeta^636 * 2^31 = 299353^636 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 299353^636 * 375649793 * 2^31 -.word 7721125 // zeta^444 * 2^31 = 299353^444 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 299353^444 * 375649793 * 2^31 -.word 40500013 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 8316793 // zeta^684 * 2^31 = 299353^684 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^684 * f(q^(-1) mod 2^32) * 2^31 = 299353^684 * 375649793 * 2^31 -.word 16634213 // zeta^492 * 2^31 = 299353^492 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 299353^492 * 375649793 * 2^31 -.word 7832335 // zeta^408 * 2^31 = 299353^408 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 299353^408 * 375649793 * 2^31 -.word 62080381 // zeta^588 * 2^31 = 299353^588 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 299353^588 * 375649793 * 2^31 -.word 26241327 // zeta^396 * 2^31 = 299353^396 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 299353^396 * 375649793 * 2^31 -.word 51789473 // zeta^760 * 2^31 = 299353^760 * 2^31 = 3793231 * 2^31 -.word 2390231903 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 299353^760 * 375649793 * 2^31 -.word 13776707 // zeta^764 * 2^31 = 299353^764 * 2^31 = 24424675 * 2^31 -.word 3710543549 // zeta^764 * f(q^(-1) mod 2^32) * 2^31 = 299353^764 * 375649793 * 2^31 -.word 4211035 // zeta^572 * 2^31 = 299353^572 * 2^31 = 10684514 * 2^31 -.word 683756709 // zeta^572 * f(q^(-1) mod 2^32) * 2^31 = 299353^572 * 375649793 * 2^31 -.word 38147821 // zeta^568 * 2^31 = 299353^568 * 2^31 = 27276494 * 2^31 -.word 1745562387 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 299353^568 * 375649793 * 2^31 -.word 2239199 // zeta^668 * 2^31 = 299353^668 * 2^31 = 28426918 * 2^31 -.word 1819183905 // zeta^668 * f(q^(-1) mod 2^32) * 2^31 = 299353^668 * 375649793 * 2^31 -.word 22779603 // zeta^476 * 2^31 = 299353^476 * 2^31 = 1602327 * 2^31 -.word 2250024749 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 299353^476 * 375649793 * 2^31 -.word 50846023 // zeta^664 * 2^31 = 299353^664 * 2^31 = 21424662 * 2^31 -.word 1371073721 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 299353^664 * 375649793 * 2^31 -.word 66165173 // zeta^716 * 2^31 = 299353^716 * 2^31 = 21853285 * 2^31 -.word 3545987147 // zeta^716 * f(q^(-1) mod 2^32) * 2^31 = 299353^716 * 375649793 * 2^31 -.word 50761721 // zeta^524 * 2^31 = 299353^524 * 2^31 = 7165643 * 2^31 -.word 2606049799 // zeta^524 * f(q^(-1) mod 2^32) * 2^31 = 299353^524 * 375649793 * 2^31 -.word 60005985 // zeta^472 * 2^31 = 299353^472 * 2^31 = 3334573 * 2^31 -.word 2360880031 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 299353^472 * 375649793 * 2^31 -.word 22799139 // zeta^620 * 2^31 = 299353^620 * 2^31 = 23553265 * 2^31 -.word 3654777565 // zeta^620 * f(q^(-1) mod 2^32) * 2^31 = 299353^620 * 375649793 * 2^31 -.word 58461905 // zeta^428 * 2^31 = 299353^428 * 2^31 = 5624346 * 2^31 -.word 359930671 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 299353^428 * 375649793 * 2^31 -.word 10205025 // zeta^712 * 2^31 = 299353^712 * 2^31 = 7514760 * 2^31 -.word 480907935 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 299353^712 * 375649793 * 2^31 -.word 26033017 // zeta^740 * 2^31 = 299353^740 * 2^31 = 25562521 * 2^31 -.word 3783360135 // zeta^740 * f(q^(-1) mod 2^32) * 2^31 = 299353^740 * 375649793 * 2^31 -.word 18756165 // zeta^548 * 2^31 = 299353^548 * 2^31 = 7312429 * 2^31 -.word 2615443387 // zeta^548 * f(q^(-1) mod 2^32) * 2^31 = 299353^548 * 375649793 * 2^31 -.word 62474397 // zeta^520 * 2^31 = 299353^520 * 2^31 = 24586938 * 2^31 -.word 1573443939 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 299353^520 * 375649793 * 2^31 -.word 23976243 // zeta^644 * 2^31 = 299353^644 * 2^31 = 3967426 * 2^31 -.word 253895885 // zeta^644 * f(q^(-1) mod 2^32) * 2^31 = 299353^644 * 375649793 * 2^31 -.word 12747875 // zeta^452 * 2^31 = 299353^452 * 2^31 = 25099490 * 2^31 -.word 1606244765 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 299353^452 * 375649793 * 2^31 -.word 40016325 // zeta^616 * 2^31 = 299353^616 * 2^31 = 23245647 * 2^31 -.word 3635091515 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 299353^616 * 375649793 * 2^31 -.word 60810771 // zeta^692 * 2^31 = 299353^692 * 2^31 = 15189991 * 2^31 -.word 3119568877 // zeta^692 * f(q^(-1) mod 2^32) * 2^31 = 299353^692 * 375649793 * 2^31 -.word 17968525 // zeta^500 * 2^31 = 299353^500 * 2^31 = 19827515 * 2^31 -.word 3416347763 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 299353^500 * 375649793 * 2^31 -.word 24437007 // zeta^424 * 2^31 = 299353^424 * 2^31 = 23981562 * 2^31 -.word 1534702833 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 299353^424 * 375649793 * 2^31 -.word 38196147 // zeta^596 * 2^31 = 299353^596 * 2^31 = 12078147 * 2^31 -.word 2920426061 // zeta^596 * f(q^(-1) mod 2^32) * 2^31 = 299353^596 * 375649793 * 2^31 -.word 61373823 // zeta^404 * 2^31 = 299353^404 * 2^31 = 8817795 * 2^31 -.word 2711779457 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 299353^404 * 375649793 * 2^31 -.word 9032575 // zeta^736 * 2^31 = 299353^736 * 2^31 = 23624597 * 2^31 -.word 3659342465 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 299353^736 * 375649793 * 2^31 -.word 18922677 // zeta^752 * 2^31 = 299353^752 * 2^31 = 403828 * 2^31 -.word 25843019 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 299353^752 * 375649793 * 2^31 -.word 38694841 // zeta^560 * 2^31 = 299353^560 * 2^31 = 994165 * 2^31 -.word 2211105351 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 299353^560 * 375649793 * 2^31 -.word 11561947 // zeta^544 * 2^31 = 299353^544 * 2^31 = 29356361 * 2^31 -.word 4026147365 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 299353^544 * 375649793 * 2^31 -.word 14211205 // zeta^656 * 2^31 = 299353^656 * 2^31 = 30845592 * 2^31 -.word 1973967227 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 299353^656 * 375649793 * 2^31 -.word 15880423 // zeta^464 * 2^31 = 299353^464 * 2^31 = 518908 * 2^31 -.word 33207577 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 299353^464 * 375649793 * 2^31 -.word 66220859 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 12340695 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2018646057 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 66384763 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 50311979 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1261697749 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 66385749 // zeta^608 * 2^31 = 299353^608 * 2^31 = 9045021 * 2^31 -.word 2726320811 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 299353^608 * 375649793 * 2^31 -.word 59595857 // zeta^416 * 2^31 = 299353^416 * 2^31 = 32616688 * 2^31 -.word 2087308719 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 299353^416 * 375649793 * 2^31 -.word 2707023 // zeta^688 * 2^31 = 299353^688 * 2^31 = 15739856 * 2^31 -.word 1007273905 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 299353^688 * 375649793 * 2^31 -.word 14051021 // zeta^728 * 2^31 = 299353^728 * 2^31 = 6580323 * 2^31 -.word 2568592179 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 299353^728 * 375649793 * 2^31 -.word 25121365 // zeta^536 * 2^31 = 299353^536 * 2^31 = 135177 * 2^31 -.word 2156134315 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 299353^536 * 375649793 * 2^31 -.word 49061917 // zeta^496 * 2^31 = 299353^496 * 2^31 = 13108720 * 2^31 -.word 838894051 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 299353^496 * 375649793 * 2^31 -.word 9875957 // zeta^632 * 2^31 = 299353^632 * 2^31 = 21146062 * 2^31 -.word 1353244683 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 299353^632 * 375649793 * 2^31 -.word 30508465 // zeta^440 * 2^31 = 299353^440 * 2^31 = 16323183 * 2^31 -.word 3192087631 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 299353^440 * 375649793 * 2^31 -.word 44552409 // zeta^592 * 2^31 = 299353^592 * 2^31 = 21166324 * 2^31 -.word 1354541351 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 299353^592 * 375649793 * 2^31 -.word 57931285 // zeta^680 * 2^31 = 299353^680 * 2^31 = 5756199 * 2^31 -.word 2515852267 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 299353^680 * 375649793 * 2^31 -.word 61031001 // zeta^488 * 2^31 = 299353^488 * 2^31 = 19973843 * 2^31 -.word 3425712039 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 299353^488 * 375649793 * 2^31 -.word 37943863 // zeta^400 * 2^31 = 299353^400 * 2^31 = 9445248 * 2^31 -.word 604449737 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 299353^400 * 375649793 * 2^31 -.word 65558417 // zeta^584 * 2^31 = 299353^584 * 2^31 = 26445100 * 2^31 -.word 1692357231 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 299353^584 * 375649793 * 2^31 -.word 1803165 // zeta^392 * 2^31 = 299353^392 * 2^31 = 11458020 * 2^31 -.word 733257315 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 299353^392 * 375649793 * 2^31 -.word 18922677 // zeta^752 * 2^31 = 299353^752 * 2^31 = 403828 * 2^31 -.word 25843019 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 299353^752 * 375649793 * 2^31 -.word 51789473 // zeta^760 * 2^31 = 299353^760 * 2^31 = 3793231 * 2^31 -.word 2390231903 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 299353^760 * 375649793 * 2^31 -.word 38147821 // zeta^568 * 2^31 = 299353^568 * 2^31 = 27276494 * 2^31 -.word 1745562387 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 299353^568 * 375649793 * 2^31 -.word 38694841 // zeta^560 * 2^31 = 299353^560 * 2^31 = 994165 * 2^31 -.word 2211105351 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 299353^560 * 375649793 * 2^31 -.word 50846023 // zeta^664 * 2^31 = 299353^664 * 2^31 = 21424662 * 2^31 -.word 1371073721 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 299353^664 * 375649793 * 2^31 -.word 60005985 // zeta^472 * 2^31 = 299353^472 * 2^31 = 3334573 * 2^31 -.word 2360880031 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 299353^472 * 375649793 * 2^31 -.word 14211205 // zeta^656 * 2^31 = 299353^656 * 2^31 = 30845592 * 2^31 -.word 1973967227 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 299353^656 * 375649793 * 2^31 -.word 10205025 // zeta^712 * 2^31 = 299353^712 * 2^31 = 7514760 * 2^31 -.word 480907935 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 299353^712 * 375649793 * 2^31 -.word 62474397 // zeta^520 * 2^31 = 299353^520 * 2^31 = 24586938 * 2^31 -.word 1573443939 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 299353^520 * 375649793 * 2^31 -.word 15880423 // zeta^464 * 2^31 = 299353^464 * 2^31 = 518908 * 2^31 -.word 33207577 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 299353^464 * 375649793 * 2^31 -.word 40016325 // zeta^616 * 2^31 = 299353^616 * 2^31 = 23245647 * 2^31 -.word 3635091515 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 299353^616 * 375649793 * 2^31 -.word 24437007 // zeta^424 * 2^31 = 299353^424 * 2^31 = 23981562 * 2^31 -.word 1534702833 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 299353^424 * 375649793 * 2^31 -.word 12340695 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2018646057 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 9032575 // zeta^736 * 2^31 = 299353^736 * 2^31 = 23624597 * 2^31 -.word 3659342465 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 299353^736 * 375649793 * 2^31 -.word 11561947 // zeta^544 * 2^31 = 299353^544 * 2^31 = 29356361 * 2^31 -.word 4026147365 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 299353^544 * 375649793 * 2^31 -.word 66384763 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 66220859 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 1602345669 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 50311979 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1261697749 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 66385749 // zeta^608 * 2^31 = 299353^608 * 2^31 = 9045021 * 2^31 -.word 2726320811 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 299353^608 * 375649793 * 2^31 -.word 2707023 // zeta^688 * 2^31 = 299353^688 * 2^31 = 15739856 * 2^31 -.word 1007273905 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 299353^688 * 375649793 * 2^31 -.word 49061917 // zeta^496 * 2^31 = 299353^496 * 2^31 = 13108720 * 2^31 -.word 838894051 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 299353^496 * 375649793 * 2^31 -.word 59595857 // zeta^416 * 2^31 = 299353^416 * 2^31 = 32616688 * 2^31 -.word 2087308719 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 299353^416 * 375649793 * 2^31 -.word 44552409 // zeta^592 * 2^31 = 299353^592 * 2^31 = 21166324 * 2^31 -.word 1354541351 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 299353^592 * 375649793 * 2^31 -.word 37943863 // zeta^400 * 2^31 = 299353^400 * 2^31 = 9445248 * 2^31 -.word 604449737 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 299353^400 * 375649793 * 2^31 -.word 14051021 // zeta^728 * 2^31 = 299353^728 * 2^31 = 6580323 * 2^31 -.word 2568592179 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 299353^728 * 375649793 * 2^31 -.word 39721919 // zeta^748 * 2^31 = 299353^748 * 2^31 = 27817396 * 2^31 -.word 1780177473 // zeta^748 * f(q^(-1) mod 2^32) * 2^31 = 299353^748 * 375649793 * 2^31 -.word 33221705 // zeta^556 * 2^31 = 299353^556 * 2^31 = 14873638 * 2^31 -.word 951840183 // zeta^556 * f(q^(-1) mod 2^32) * 2^31 = 299353^556 * 375649793 * 2^31 -.word 25121365 // zeta^536 * 2^31 = 299353^536 * 2^31 = 135177 * 2^31 -.word 2156134315 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 299353^536 * 375649793 * 2^31 -.word 58077387 // zeta^652 * 2^31 = 299353^652 * 2^31 = 6592748 * 2^31 -.word 421903669 // zeta^652 * f(q^(-1) mod 2^32) * 2^31 = 299353^652 * 375649793 * 2^31 -.word 29472201 // zeta^460 * 2^31 = 299353^460 * 2^31 = 18568730 * 2^31 -.word 1188308023 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 299353^460 * 375649793 * 2^31 -.word 9875957 // zeta^632 * 2^31 = 299353^632 * 2^31 = 21146062 * 2^31 -.word 1353244683 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 299353^632 * 375649793 * 2^31 -.word 30046903 // zeta^700 * 2^31 = 299353^700 * 2^31 = 16127868 * 2^31 -.word 1032104777 // zeta^700 * f(q^(-1) mod 2^32) * 2^31 = 299353^700 * 375649793 * 2^31 -.word 10397085 // zeta^508 * 2^31 = 299353^508 * 2^31 = 661028 * 2^31 -.word 42302563 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 299353^508 * 375649793 * 2^31 -.word 30508465 // zeta^440 * 2^31 = 299353^440 * 2^31 = 16323183 * 2^31 -.word 3192087631 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 299353^440 * 375649793 * 2^31 -.word 66319647 // zeta^604 * 2^31 = 299353^604 * 2^31 = 6688514 * 2^31 -.word 428032225 // zeta^604 * f(q^(-1) mod 2^32) * 2^31 = 299353^604 * 375649793 * 2^31 -.word 35102085 // zeta^412 * 2^31 = 299353^412 * 2^31 = 30439269 * 2^31 -.word 4095448187 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 299353^412 * 375649793 * 2^31 -.word 57931285 // zeta^680 * 2^31 = 299353^680 * 2^31 = 5756199 * 2^31 -.word 2515852267 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 299353^680 * 375649793 * 2^31 -.word 25751057 // zeta^724 * 2^31 = 299353^724 * 2^31 = 25986735 * 2^31 -.word 3810507759 // zeta^724 * f(q^(-1) mod 2^32) * 2^31 = 299353^724 * 375649793 * 2^31 -.word 15038611 // zeta^532 * 2^31 = 299353^532 * 2^31 = 15308198 * 2^31 -.word 979649901 // zeta^532 * f(q^(-1) mod 2^32) * 2^31 = 299353^532 * 375649793 * 2^31 -.word 61031001 // zeta^488 * 2^31 = 299353^488 * 2^31 = 19973843 * 2^31 -.word 3425712039 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 299353^488 * 375649793 * 2^31 -.word 45510921 // zeta^628 * 2^31 = 299353^628 * 2^31 = 12632936 * 2^31 -.word 808446199 // zeta^628 * f(q^(-1) mod 2^32) * 2^31 = 299353^628 * 375649793 * 2^31 -.word 4126363 // zeta^436 * 2^31 = 299353^436 * 2^31 = 20669063 * 2^31 -.word 3470202725 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 299353^436 * 375649793 * 2^31 -.word 65558417 // zeta^584 * 2^31 = 299353^584 * 2^31 = 26445100 * 2^31 -.word 1692357231 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 299353^584 * 375649793 * 2^31 -.word 50996995 // zeta^676 * 2^31 = 299353^676 * 2^31 = 26036127 * 2^31 -.word 3813668605 // zeta^676 * f(q^(-1) mod 2^32) * 2^31 = 299353^676 * 375649793 * 2^31 -.word 64246331 // zeta^484 * 2^31 = 299353^484 * 2^31 = 5855662 * 2^31 -.word 374733765 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 299353^484 * 375649793 * 2^31 -.word 1803165 // zeta^392 * 2^31 = 299353^392 * 2^31 = 11458020 * 2^31 -.word 733257315 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 299353^392 * 375649793 * 2^31 -.word 34033301 // zeta^580 * 2^31 = 299353^580 * 2^31 = 18571159 * 2^31 -.word 3335947115 // zeta^580 * f(q^(-1) mod 2^32) * 2^31 = 299353^580 * 375649793 * 2^31 -.word 44975483 // zeta^388 * 2^31 = 299353^388 * 2^31 = 5764058 * 2^31 -.word 368871557 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 299353^388 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete_bitrev, %function -.global ntt_768_u32_33556993_299353_incomplete_bitrev -ntt_768_u32_33556993_299353_incomplete_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[424]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -80)] -// input[596]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 92)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[424] from Q0 -vmul.u32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[596] from Q1 -// input[680]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -76)] -vadd.s32 Q6, Q4, Q3 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r12,#(-320)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r12,#(368)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[680]: Already loaded as Q1 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[680] from Q1 -vmul.u32 Q3, Q0, r7 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmlah.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[212] from Q7 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q6, Q3, Q2 -// input[468]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -36)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r11,#(-304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(-160)] -vadd.s32 Q4, Q4, Q1 -// Release input[256] from Q1 -vstrw.u32 Q4, [r14,#(16)] -// input[104]: Already loaded as Q5 -// input[468]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[104] from Q5 -vmul.u32 Q2, Q0, r7 -// input[512]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[468] from Q7 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[724]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-144)] -vadd.s32 Q3, Q3, Q4 -// Release input[512] from Q4 -vstrw.u32 Q3, [r12,#(32)] -// input[360]: Already loaded as Q5 -// input[724]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[360] from Q5 -vmul.u32 Q2, Q0, r7 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[724] from Q7 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(432)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-128)] -vadd.s32 Q3, Q3, Q4 -// Release input[128] from Q4 -vstrw.u32 Q3, [r14,#(-496)] -// input[616]: Already loaded as Q5 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[616] from Q5 -vmul.u32 Q2, Q0, r7 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[52] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q4 -// Release input[384] from Q4 -vstrw.u32 Q3, [r12,#(-480)] -// input[232]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vadd.s32 Q6, Q2, Q1 -// input[564]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[640] from Q4 -vstrw.u32 Q3, [r11,#(-464)] -// input[488]: Already loaded as Q5 -// input[564]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[488] from Q5 -vmul.u32 Q2, Q0, r7 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[564] from Q7 -// input[744]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q6, Q2, Q1 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(240)] -vadd.s32 Q3, Q3, Q4 -// Release input[64] from Q4 -vstrw.u32 Q3, [r0,#(256)] -// input[744]: Already loaded as Q5 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[744] from Q5 -vmul.u32 Q2, Q0, r7 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[180] from Q7 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[436]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-48)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q4 -// Release input[320] from Q4 -vstrw.u32 Q3, [r14,#(272)] -// input[24]: Already loaded as Q5 -// input[436]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[24] from Q5 -vmul.u32 Q2, Q0, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[436] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[692]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-272)] -vadd.s32 Q3, Q3, Q4 -// Release input[576] from Q4 -vstrw.u32 Q3, [r12,#(288)] -// input[280]: Already loaded as Q5 -// input[692]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[692] from Q7 -// input[536]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[536]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[536] from Q5 -vmul.u32 Q2, Q0, r7 -// input[448]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -56)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -100)] -vadd.s32 Q6, Q2, Q1 -// input[372]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[448] from Q4 -vstrw.u32 Q3, [r12,#(-224)] -// input[152]: Already loaded as Q5 -// input[372]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[152] from Q5 -vmul.u32 Q2, Q0, r7 -// input[704]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[372] from Q7 -// input[408]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -96)] -vadd.s32 Q6, Q2, Q1 -// input[628]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(480)] -vadd.s32 Q3, Q3, Q4 -// Release input[704] from Q4 -vstrw.u32 Q3, [r11,#(-208)] -// input[408]: Already loaded as Q5 -// input[628]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[408] from Q5 -vmul.u32 Q2, Q0, r7 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[628] from Q7 -// input[664]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(496)] -vadd.s32 Q3, Q3, Q4 -// Release input[32] from Q4 -vstrw.u32 Q3, [r0,#(128)] -// input[664]: Already loaded as Q5 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[664] from Q5 -vmul.u32 Q2, Q0, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[244] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[500]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-32)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[88]: Already loaded as Q5 -// input[500]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r7 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[500] from Q7 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[756]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 0)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[544] from Q4 -vstrw.u32 Q3, [r12,#(160)] -// input[344]: Already loaded as Q5 -// input[756]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[344] from Q5 -vmul.u32 Q2, Q0, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[756] from Q7 -// input[600]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[12]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(0)] -vadd.s32 Q3, Q3, Q4 -// Release input[160] from Q4 -vstrw.u32 Q3, [r14,#(-368)] -// input[600]: Already loaded as Q5 -// input[12]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[600] from Q5 -vmul.u32 Q2, Q0, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[12] from Q7 -// input[216]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -36)] -vadd.s32 Q6, Q2, Q1 -// input[268]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q3, Q3, Q4 -// Release input[416] from Q4 -vstrw.u32 Q3, [r12,#(-352)] -// input[216]: Already loaded as Q5 -// input[268]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[216] from Q5 -vmul.u32 Q2, Q0, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[268] from Q7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[524]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-144)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(64)] -vadd.s32 Q3, Q3, Q4 -// Release input[672] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[472]: Already loaded as Q5 -// input[524]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[472] from Q5 -vmul.u32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[524] from Q7 -// input[728]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[728]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[728] from Q5 -vmul.u32 Q2, Q0, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[396]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[352] from Q4 -vstrw.u32 Q3, [r14,#(400)] -// input[56]: Already loaded as Q5 -// input[396]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[56] from Q5 -vmul.u32 Q2, Q0, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[396] from Q7 -// input[312]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[652]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-432)] -vadd.s32 Q3, Q3, Q4 -// Release input[608] from Q4 -vstrw.u32 Q3, [r12,#(416)] -// input[312]: Already loaded as Q5 -// input[652]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[312] from Q5 -vmul.u32 Q2, Q0, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[652] from Q7 -// input[568]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(240)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q4 -// Release input[224] from Q4 -vstrw.u32 Q3, [r14,#(-112)] -// input[568]: Already loaded as Q5 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[568] from Q5 -vmul.u32 Q2, Q0, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[76] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(304)] -vadd.s32 Q3, Q3, Q4 -// Release input[480] from Q4 -vstrw.u32 Q3, [r12,#(-96)] -// input[184]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[440]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -64)] -vadd.s32 Q6, Q2, Q1 -// input[588]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[736] from Q4 -vstrw.u32 Q3, [r11,#(-80)] -// input[440]: Already loaded as Q5 -// input[588]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[440] from Q5 -vmul.u32 Q2, Q0, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[588] from Q7 -// input[696]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -60)] -vadd.s32 Q6, Q2, Q1 -// input[204]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(336)] -vadd.s32 Q3, Q3, Q4 -// Release input[16] from Q4 -vstrw.u32 Q3, [r0,#(64)] -// input[696]: Already loaded as Q5 -// input[204]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[696] from Q5 -vmul.u32 Q2, Q0, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[204] from Q7 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[460]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-240)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-192)] -vadd.s32 Q3, Q3, Q4 -// Release input[272] from Q4 -vstrw.u32 Q3, [r14,#(80)] -// input[120]: Already loaded as Q5 -// input[460]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[120] from Q5 -vmul.u32 Q2, Q0, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[460] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[716]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(480)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-176)] -vadd.s32 Q3, Q3, Q4 -// Release input[528] from Q4 -vstrw.u32 Q3, [r12,#(96)] -// input[376]: Already loaded as Q5 -// input[716]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[716] from Q7 -// input[632]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -124)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[632]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[632] from Q5 -vmul.u32 Q2, Q0, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q6, Q2, Q1 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[400] from Q4 -vstrw.u32 Q3, [r12,#(-416)] -// input[248]: Already loaded as Q5 -// input[300]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[248] from Q5 -vmul.u32 Q2, Q0, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[300] from Q7 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vadd.s32 Q6, Q2, Q1 -// input[556]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q4 -// Release input[656] from Q4 -vstrw.u32 Q3, [r11,#(-400)] -// input[504]: Already loaded as Q5 -// input[556]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[504] from Q5 -vmul.u32 Q2, Q0, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[556] from Q7 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(0)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(208)] -vadd.s32 Q3, Q3, Q4 -// Release input[80] from Q4 -vstrw.u32 Q3, [r0,#(320)] -// input[760]: Already loaded as Q5 -// input[172]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[760] from Q5 -vmul.u32 Q2, Q0, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[172] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[428]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[4]: Already loaded as Q5 -// input[428]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[428] from Q7 -// input[260]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q2, Q1 -// input[684]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[592] from Q4 -vstrw.u32 Q3, [r12,#(352)] -// input[260]: Already loaded as Q5 -// input[684]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[260] from Q5 -vmul.u32 Q2, Q0, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[684] from Q7 -// input[516]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q4 -// Release input[208] from Q4 -vstrw.u32 Q3, [r14,#(-176)] -// input[516]: Already loaded as Q5 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[516] from Q5 -vmul.u32 Q2, Q0, r7 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[108] from Q7 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -120)] -vadd.s32 Q6, Q2, Q1 -// input[364]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(48)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(432)] -vadd.s32 Q3, Q3, Q4 -// Release input[464] from Q4 -vstrw.u32 Q3, [r12,#(-160)] -// input[132]: Already loaded as Q5 -// input[364]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[132] from Q5 -vmul.u32 Q2, Q0, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[364] from Q7 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-480)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(448)] -vadd.s32 Q3, Q3, Q4 -// Release input[720] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[388]: Already loaded as Q5 -// input[620]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[388] from Q5 -vmul.u32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[620] from Q7 -// input[644]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[644]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[644] from Q5 -vmul.u32 Q2, Q0, r7 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[68]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[492]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[304] from Q4 -vstrw.u32 Q3, [r14,#(208)] -// input[68]: Already loaded as Q5 -// input[492]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[68] from Q5 -vmul.u32 Q2, Q0, r7 -// input[560]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[492] from Q7 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[748]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-48)] -vadd.s32 Q3, Q3, Q4 -// Release input[560] from Q4 -vstrw.u32 Q3, [r12,#(224)] -// input[324]: Already loaded as Q5 -// input[748]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[324] from Q5 -vmul.u32 Q2, Q0, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[748] from Q7 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(288)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-32)] -vadd.s32 Q3, Q3, Q4 -// Release input[176] from Q4 -vstrw.u32 Q3, [r14,#(-304)] -// input[580]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[580] from Q5 -vmul.u32 Q2, Q0, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[28] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q4 -// Release input[432] from Q4 -vstrw.u32 Q3, [r12,#(-288)] -// input[196]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vadd.s32 Q6, Q2, Q1 -// input[540]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[688] from Q4 -vstrw.u32 Q3, [r11,#(-272)] -// input[452]: Already loaded as Q5 -// input[540]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[452] from Q5 -vmul.u32 Q2, Q0, r7 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[540] from Q7 -// input[708]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -48)] -vadd.s32 Q6, Q2, Q1 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(144)] -vadd.s32 Q3, Q3, Q4 -// Release input[112] from Q4 -vstrw.u32 Q3, [r0,#(448)] -// input[708]: Already loaded as Q5 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[708] from Q5 -vmul.u32 Q2, Q0, r7 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[156] from Q7 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[412]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-192)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q4 -// Release input[368] from Q4 -vstrw.u32 Q3, [r14,#(464)] -// input[36]: Already loaded as Q5 -// input[412]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[36] from Q5 -vmul.u32 Q2, Q0, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[412] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[668]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(144)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-368)] -vadd.s32 Q3, Q3, Q4 -// Release input[624] from Q4 -vstrw.u32 Q3, [r12,#(480)] -// input[292]: Already loaded as Q5 -// input[668]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[668] from Q7 -// input[548]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[548]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[548] from Q5 -vmul.u32 Q2, Q0, r7 -// input[496]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -88)] -vadd.s32 Q6, Q2, Q1 -// input[348]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[496] from Q4 -vstrw.u32 Q3, [r12,#(-32)] -// input[164]: Already loaded as Q5 -// input[348]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[164] from Q5 -vmul.u32 Q2, Q0, r7 -// input[752]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -4)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[348] from Q7 -// input[420]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -84)] -vadd.s32 Q6, Q2, Q1 -// input[604]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(384)] -vadd.s32 Q3, Q3, Q4 -// Release input[752] from Q4 -vstrw.u32 Q3, [r11,#(-16)] -// input[420]: Already loaded as Q5 -// input[604]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[420] from Q5 -vmul.u32 Q2, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[604] from Q7 -// input[676]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-336)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(400)] -vadd.s32 Q3, Q3, Q4 -// Release input[8] from Q4 -vstrw.u32 Q3, [r0,#(32)] -// input[676]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[676] from Q5 -vmul.u32 Q2, Q0, r7 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[220] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[476]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-128)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[100]: Already loaded as Q5 -// input[476]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r7 -// input[520]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[476] from Q7 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[732]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[520] from Q4 -vstrw.u32 Q3, [r12,#(64)] -// input[356]: Already loaded as Q5 -// input[732]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[356] from Q5 -vmul.u32 Q2, Q0, r7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[732] from Q7 -// input[612]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-96)] -vadd.s32 Q3, Q3, Q4 -// Release input[136] from Q4 -vstrw.u32 Q3, [r14,#(-464)] -// input[612]: Already loaded as Q5 -// input[60]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[612] from Q5 -vmul.u32 Q2, Q0, r7 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[60] from Q7 -// input[228]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -24)] -vadd.s32 Q6, Q2, Q1 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(432)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q4 -// Release input[392] from Q4 -vstrw.u32 Q3, [r12,#(-448)] -// input[228]: Already loaded as Q5 -// input[316]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[228] from Q5 -vmul.u32 Q2, Q0, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[316] from Q7 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[572]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q4 -// Release input[648] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[484]: Already loaded as Q5 -// input[572]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[484] from Q5 -vmul.u32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[572] from Q7 -// input[740]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[740]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[740] from Q5 -vmul.u32 Q2, Q0, r7 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[20]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[444]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[328] from Q4 -vstrw.u32 Q3, [r14,#(304)] -// input[20]: Already loaded as Q5 -// input[444]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[20] from Q5 -vmul.u32 Q2, Q0, r7 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[444] from Q7 -// input[276]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[700]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-240)] -vadd.s32 Q3, Q3, Q4 -// Release input[584] from Q4 -vstrw.u32 Q3, [r12,#(320)] -// input[276]: Already loaded as Q5 -// input[700]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[276] from Q5 -vmul.u32 Q2, Q0, r7 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[700] from Q7 -// input[532]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-224)] -vadd.s32 Q3, Q3, Q4 -// Release input[200] from Q4 -vstrw.u32 Q3, [r14,#(-208)] -// input[532]: Already loaded as Q5 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[532] from Q5 -vmul.u32 Q2, Q0, r7 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[124] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(496)] -vadd.s32 Q3, Q3, Q4 -// Release input[456] from Q4 -vstrw.u32 Q3, [r12,#(-192)] -// input[148]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r7 -// input[712]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -// input[404]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -100)] -vadd.s32 Q6, Q2, Q1 -// input[636]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[712] from Q4 -vstrw.u32 Q3, [r11,#(-176)] -// input[404]: Already loaded as Q5 -// input[636]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[404] from Q5 -vmul.u32 Q2, Q0, r7 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[636] from Q7 -// input[660]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -96)] -vadd.s32 Q6, Q2, Q1 -// input[252]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 0)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q3, Q3, Q4 -// Release input[40] from Q4 -vstrw.u32 Q3, [r0,#(160)] -// input[660]: Already loaded as Q5 -// input[252]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[660] from Q5 -vmul.u32 Q2, Q0, r7 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[252] from Q7 -// input[84]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[508]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(0)] -vadd.s32 Q3, Q3, Q4 -// Release input[296] from Q4 -vstrw.u32 Q3, [r14,#(176)] -// input[84]: Already loaded as Q5 -// input[508]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[84] from Q5 -vmul.u32 Q2, Q0, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[508] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[764]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(336)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(16)] -vadd.s32 Q3, Q3, Q4 -// Release input[552] from Q4 -vstrw.u32 Q3, [r12,#(192)] -// input[340]: Already loaded as Q5 -// input[764]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[764] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[8]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 8)] -vqrdmulh.s32 Q1, Q0, r8 -// input[592]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 88)] -vmul.u32 Q0, Q0, r7 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r10 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r10 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(32)] -// Release input[8] from Q0 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r12,#(352)] -// Release input[592] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[264]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vmul.u32 Q4, Q4, r7 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmlah.s32 Q0, Q4, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r10 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(48)] -// Release input[264] from Q4 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[520]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q1, Q1, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[136]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q0, Q0, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[392]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-448)] -// Release input[392] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[648]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q1, Q1, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-432)] -// Release input[648] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[72]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vmul.u32 Q0, Q0, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[328]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[584]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q1, Q1, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(272)] -// Release input[320] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q0, Q0, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[456]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[712]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q1, Q1, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[40]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q0, Q0, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[296]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[552]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vmul.u32 Q1, Q1, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[544]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[168]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[168]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[752]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -4)] -vmul.u32 Q0, Q0, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(160)] -// Release input[544] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-336)] -// Release input[168] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[420]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmul.u32 Q2, Q2, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[424]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[676]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-336)] -// Release input[420] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[676]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vmul.u32 Q1, Q1, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-320)] -// Release input[424] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[680]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[676] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[100]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vmul.u32 Q0, Q0, r7 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-304)] -// Release input[680] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[356]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q2, Q2, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[360]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 108)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[612]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[612]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(432)] -// Release input[360] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[616]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(432)] -// Release input[612] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(448)] -// Release input[616] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q2, Q2, r7 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[740]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[740]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q1, Q1, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[744]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[20]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-64)] -// Release input[740] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[20]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q0, Q0, r7 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-48)] -// Release input[744] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(80)] -// Release input[20] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[276]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmul.u32 Q2, Q2, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[532]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[532]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[452]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -52)] -vmul.u32 Q1, Q1, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[148]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(112)] -// Release input[532] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[148]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmul.u32 Q0, Q0, r7 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-416)] -// Release input[148] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[404]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[408]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -96)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[660]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-400)] -// Release input[404] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[660]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-384)] -// Release input[408] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[664]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[84]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release input[660] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[84]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-368)] -// Release input[664] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(336)] -// Release input[84] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(0)] -// Release input[504] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[340]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r7 -// input[760]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 4)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[344]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 92)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[604]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(352)] -// Release input[340] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(16)] -// Release input[760] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[604]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(368)] -// Release input[344] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(400)] -// Release input[604] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q0, Q0, r7 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(368)] -// Release input[596] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[476]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[476]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-112)] -// Release input[476] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(80)] -// Release input[524] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[308]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q1, Q1, r7 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(224)] -// Release input[308] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q0, Q0, r7 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[700]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[700]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q1, Q1, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[700] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q0, Q0, r7 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-256)] -// Release input[692] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-176)] -// Release input[460] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[372]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-160)] -// Release input[716] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(480)] -// Release input[372] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(496)] -// Release input[628] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q2, Q2, r7 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[32]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vmul.u32 Q0, Q0, r7 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[288]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[544]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[544]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vmul.u32 Q1, Q1, r7 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(160)] -// Release input[544] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[160]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[704]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -52)] -vmul.u32 Q0, Q0, r7 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[400]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[656]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-416)] -// Release input[400] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[656]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q1, Q1, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-352)] -// Release input[416] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[672]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-400)] -// Release input[656] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[80]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q0, Q0, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-336)] -// Release input[672] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[336]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q1, Q1, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[592]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(192)] -// Release input[48] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q0, Q0, r7 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(352)] -// Release input[592] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(208)] -// Release input[304] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[560]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(224)] -// Release input[560] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q1, Q1, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[720]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[40]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-144)] -// Release input[720] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[296]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q2, Q2, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[552]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[168]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[168]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q0, Q0, r7 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[408]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-336)] -// Release input[168] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[408]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[424]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[664]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-384)] -// Release input[408] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[664]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-320)] -// Release input[424] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[680]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[664] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-304)] -// Release input[680] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[360]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 108)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[632]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q1, Q1, r7 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(432)] -// Release input[360] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[600]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-496)] -// Release input[632] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q0, Q0, r7 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(384)] -// Release input[600] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r7 -// input[568]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 64)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[472]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -32)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(256)] -// Release input[568] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-128)] -// Release input[472] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[728]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[36]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q0, Q0, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-112)] -// Release input[728] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(144)] -// Release input[36] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[292]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmul.u32 Q2, Q2, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[548]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[548]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[452]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -52)] -vmul.u32 Q1, Q1, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[164]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(176)] -// Release input[548] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[164]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmul.u32 Q0, Q0, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-352)] -// Release input[164] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[404]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[420]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -84)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[660]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-400)] -// Release input[404] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[660]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-336)] -// Release input[420] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[84]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release input[660] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[84]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q0, Q0, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-320)] -// Release input[676] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(336)] -// Release input[84] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[340]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[356]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[628]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(352)] -// Release input[340] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[628]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(416)] -// Release input[356] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(496)] -// Release input[628] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(368)] -// Release input[596] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q2, Q2, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q1, Q1, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q0, Q0, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[300]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[300]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmul.u32 Q2, Q2, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[556]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(192)] -// Release input[300] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[556]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[460]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -44)] -vmul.u32 Q1, Q1, r7 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(208)] -// Release input[556] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-176)] -// Release input[460] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[716]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vmul.u32 Q0, Q0, r7 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-160)] -// Release input[716] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[428]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[668]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[668]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q1, Q1, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-304)] -// Release input[428] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[684]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-352)] -// Release input[668] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q0, Q0, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-288)] -// Release input[684] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[364]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q1, Q1, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(448)] -// Release input[364] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q0, Q0, r7 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vmul.u32 Q2, Q2, r7 -// input[572]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(272)] -// Release input[572] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmul.u32 Q1, Q1, r7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[732]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -24)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[128]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vmul.u32 Q0, Q0, r7 -// input[256]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-96)] -// Release input[732] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(16)] -// Release input[256] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[320]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmul.u32 Q2, Q2, r7 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[704]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vmul.u32 Q1, Q1, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[160]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q0, Q0, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[352]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q2, Q2, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[736]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[736]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q1, Q1, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-352)] -// Release input[416] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[736] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[144]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vmul.u32 Q0, Q0, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[336]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[720]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q1, Q1, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[592]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vmul.u32 Q0, Q0, r7 -// input[304]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(352)] -// Release input[592] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(208)] -// Release input[304] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vmul.u32 Q1, Q1, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[624]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-48)] -// Release input[240] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[136]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q0, Q0, r7 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(480)] -// Release input[624] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(48)] -// Release input[264] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[328]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q2, Q2, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[712]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r7 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[584]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[168]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[168]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r7 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(320)] -// Release input[584] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[360]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-336)] -// Release input[168] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[360]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[424]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(432)] -// Release input[360] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[744]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q1, Q1, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-320)] -// Release input[424] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[616]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[152]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-48)] -// Release input[744] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[152]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(448)] -// Release input[616] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-400)] -// Release input[152] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[408]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -96)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[728]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-384)] -// Release input[408] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[600]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-112)] -// Release input[728] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q0, Q0, r7 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(384)] -// Release input[600] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(240)] -// Release input[312] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[376]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[440]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q1, Q1, r7 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-256)] -// Release input[440] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[132]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vmul.u32 Q0, Q0, r7 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[324]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q2, Q2, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[388]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[708]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[452]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -52)] -vmul.u32 Q1, Q1, r7 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-464)] -// Release input[388] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[164]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[164]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-352)] -// Release input[164] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[356]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[420]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -84)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[740]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[740]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-336)] -// Release input[420] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[612]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 108)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[148]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-64)] -// Release input[740] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[148]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q0, Q0, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(432)] -// Release input[612] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[20]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-416)] -// Release input[148] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[340]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(80)] -// Release input[20] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[404]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -100)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[724]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(352)] -// Release input[340] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[724]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q1, Q1, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-400)] -// Release input[404] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[180]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[724] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[180]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vmul.u32 Q0, Q0, r7 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(368)] -// Release input[596] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-288)] -// Release input[180] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q2, Q2, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[500]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -4)] -vmul.u32 Q1, Q1, r7 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-16)] -// Release input[500] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vmul.u32 Q0, Q0, r7 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(496)] -// Release input[628] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[332]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmul.u32 Q2, Q2, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[460]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -44)] -vmul.u32 Q1, Q1, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[588]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-176)] -// Release input[460] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q0, Q0, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(336)] -// Release input[588] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[428]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[748]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[748]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q1, Q1, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-304)] -// Release input[428] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[620]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[748] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q0, Q0, r7 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(464)] -// Release input[620] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q2, Q2, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[412]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vmul.u32 Q1, Q1, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-368)] -// Release input[412] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-112)] -// Release input[476] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vmul.u32 Q0, Q0, r7 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vmul.u32 Q2, Q2, r7 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vmul.u32 Q1, Q1, r7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[636]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -vqrdmulh.s32 Q0, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q0, Q4, r10 -vstrw.u32 Q3, [r12,#(16)] -// Release input[508] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-480)] -// Release input[636] from Q2 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 7467 -// Instruction count: 5655 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_double.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_double.s deleted file mode 100644 index 6fa25e8..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_double.s +++ /dev/null @@ -1,8366 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40872659 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 5033605 // zeta^204 * 2^31 = 299353^204 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 299353^204 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 50479773 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 58797193 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 59392861 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 9383201 // zeta^252 * 2^31 = 299353^252 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 299353^252 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 63329695 // zeta^156 * 2^31 = 299353^156 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 299353^156 * 375649793 * 2^31 -.word 57130935 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 65797823 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 10391631 // zeta^228 * 2^31 = 299353^228 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 299353^228 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 31719253 // zeta^132 * 2^31 = 299353^132 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 299353^132 * 375649793 * 2^31 -.word 12271567 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 21111903 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 12778219 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 35733845 // zeta^180 * 2^31 = 299353^180 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 299353^180 * 375649793 * 2^31 -.word 6014597 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 65310821 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 3561709979 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 22138503 // zeta^ 4 * 2^31 = 299353^ 4 * 2^31 = 27792935 * 2^31 -.word 3926095737 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 4 * 375649793 * 2^31 -.word 33080685 // zeta^196 * 2^31 = 299353^196 * 2^31 = 14985834 * 2^31 -.word 959020179 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 299353^196 * 375649793 * 2^31 -.word 1555569 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 2602610063 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 2867655 // zeta^100 * 2^31 = 299353^100 * 2^31 = 27701331 * 2^31 -.word 3920233529 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 299353^100 * 375649793 * 2^31 -.word 16116991 // zeta^292 * 2^31 = 299353^292 * 2^31 = 7520866 * 2^31 -.word 481298689 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 299353^292 * 375649793 * 2^31 -.word 6082985 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 869255255 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 62987623 // zeta^ 52 * 2^31 = 299353^ 52 * 2^31 = 12887930 * 2^31 -.word 824764569 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 52 * 375649793 * 2^31 -.word 21603065 // zeta^244 * 2^31 = 299353^244 * 2^31 = 20924057 * 2^31 -.word 3486521095 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 299353^244 * 375649793 * 2^31 -.word 9182701 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1779115027 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 52075375 // zeta^148 * 2^31 = 299353^148 * 2^31 = 18248795 * 2^31 -.word 3315317393 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 299353^148 * 375649793 * 2^31 -.word 41362929 // zeta^340 * 2^31 = 299353^340 * 2^31 = 7570258 * 2^31 -.word 484459535 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 299353^340 * 375649793 * 2^31 -.word 36605521 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1102879663 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 32011901 // zeta^ 28 * 2^31 = 299353^ 28 * 2^31 = 3117724 * 2^31 -.word 199519107 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 28 * 375649793 * 2^31 -.word 794339 // zeta^220 * 2^31 = 299353^220 * 2^31 = 26868479 * 2^31 -.word 3866935069 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 299353^220 * 375649793 * 2^31 -.word 57238029 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 2941722611 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 56716901 // zeta^124 * 2^31 = 299353^124 * 2^31 = 32895965 * 2^31 -.word 4252664731 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 299353^124 * 375649793 * 2^31 -.word 37067083 // zeta^316 * 2^31 = 299353^316 * 2^31 = 17429125 * 2^31 -.word 3262862517 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 299353^316 * 375649793 * 2^31 -.word 41992621 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 2138832979 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 37641785 // zeta^ 76 * 2^31 = 299353^ 76 * 2^31 = 14988263 * 2^31 -.word 3106659271 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 76 * 375649793 * 2^31 -.word 9036599 // zeta^268 * 2^31 = 299353^268 * 2^31 = 26964245 * 2^31 -.word 3873063625 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 299353^268 * 375649793 * 2^31 -.word 53062965 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1726375115 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 33892281 // zeta^172 * 2^31 = 299353^172 * 2^31 = 18683355 * 2^31 -.word 3343127111 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 299353^172 * 375649793 * 2^31 -.word 27392067 // zeta^364 * 2^31 = 299353^364 * 2^31 = 5739597 * 2^31 -.word 2514789821 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 299353^364 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 42676979 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 2760264461 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 27097661 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 659875779 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 4639589 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2721523355 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 56908961 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 3814059359 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 7108001 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 1934087263 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 16267963 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 2923893573 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 28966165 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2549404907 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 15324513 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 1904735391 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 65310821 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 3561709979 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 1555569 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 2602610063 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 6082985 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 869255255 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 9182701 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1779115027 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 36605521 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1102879663 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 57238029 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 2941722611 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 41992621 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 2138832979 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 53062965 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1726375115 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 42676979 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 2760264461 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 5740163 // zeta^ 20 * 2^31 = 299353^ 20 * 2^31 = 24739198 * 2^31 -.word 1583187837 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 20 * 375649793 * 2^31 -.word 28917839 // zeta^212 * 2^31 = 299353^212 * 2^31 = 21478846 * 2^31 -.word 1374541233 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 299353^212 * 375649793 * 2^31 -.word 27097661 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 659875779 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 49145461 // zeta^116 * 2^31 = 299353^116 * 2^31 = 13729478 * 2^31 -.word 878619531 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 299353^116 * 375649793 * 2^31 -.word 6303215 // zeta^308 * 2^31 = 299353^308 * 2^31 = 18367002 * 2^31 -.word 1175398417 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 299353^308 * 375649793 * 2^31 -.word 4639589 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2721523355 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 54366111 // zeta^ 68 * 2^31 = 299353^ 68 * 2^31 = 8457503 * 2^31 -.word 2688722529 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 68 * 375649793 * 2^31 -.word 43137743 // zeta^260 * 2^31 = 299353^260 * 2^31 = 29589567 * 2^31 -.word 4041071409 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 299353^260 * 375649793 * 2^31 -.word 56908961 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 3814059359 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 48357821 // zeta^164 * 2^31 = 299353^164 * 2^31 = 26244564 * 2^31 -.word 1679523907 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 299353^164 * 375649793 * 2^31 -.word 41080969 // zeta^356 * 2^31 = 299353^356 * 2^31 = 7994472 * 2^31 -.word 511607159 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 299353^356 * 375649793 * 2^31 -.word 7108001 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 1934087263 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 8652081 // zeta^ 44 * 2^31 = 299353^ 44 * 2^31 = 27932647 * 2^31 -.word 3935036623 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 44 * 375649793 * 2^31 -.word 44314847 // zeta^236 * 2^31 = 299353^236 * 2^31 = 10003728 * 2^31 -.word 640189729 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 299353^236 * 375649793 * 2^31 -.word 16267963 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 2923893573 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 16352265 // zeta^140 * 2^31 = 299353^140 * 2^31 = 26391350 * 2^31 -.word 1688917495 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 299353^140 * 375649793 * 2^31 -.word 948813 // zeta^332 * 2^31 = 299353^332 * 2^31 = 11703708 * 2^31 -.word 748980147 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 299353^332 * 375649793 * 2^31 -.word 28966165 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2549404907 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 44334383 // zeta^ 92 * 2^31 = 299353^ 92 * 2^31 = 31954666 * 2^31 -.word 2044942545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 92 * 375649793 * 2^31 -.word 64874787 // zeta^284 * 2^31 = 299353^284 * 2^31 = 5130075 * 2^31 -.word 2475783389 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 299353^284 * 375649793 * 2^31 -.word 15324513 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 1904735391 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 62902951 // zeta^188 * 2^31 = 299353^188 * 2^31 = 22872479 * 2^31 -.word 3611210585 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 299353^188 * 375649793 * 2^31 -.word 53337279 // zeta^380 * 2^31 = 299353^380 * 2^31 = 9132318 * 2^31 -.word 584423745 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 299353^380 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete_double, %function -.global ntt_768_u32_33556993_299353_incomplete_double -ntt_768_u32_33556993_299353_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[512] from Q1 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vadd.s32 Q6, Q4, Q3 -// input[516]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 12)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r12,#(32)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[260]: Already loaded as Q1 -// input[516]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[260] from Q1 -vmul.u32 Q3, Q0, r7 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[516] from Q7 -// input[264]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 12)] -vadd.s32 Q6, Q3, Q2 -// input[520]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(32)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r12,#(48)] -vadd.s32 Q4, Q4, Q1 -// Release input[4] from Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[264]: Already loaded as Q5 -// input[520]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[264] from Q5 -vmul.u32 Q2, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[520] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[524]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(48)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(64)] -vadd.s32 Q3, Q3, Q4 -// Release input[8] from Q4 -vstrw.u32 Q3, [r0,#(32)] -// input[268]: Already loaded as Q5 -// input[524]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[524] from Q7 -// input[272]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[528]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[272]: Already loaded as Q5 -// input[528]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[272] from Q5 -vmul.u32 Q2, Q0, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[528] from Q7 -// input[276]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[532]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(96)] -vadd.s32 Q3, Q3, Q4 -// Release input[16] from Q4 -vstrw.u32 Q3, [r0,#(64)] -// input[276]: Already loaded as Q5 -// input[532]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[276] from Q5 -vmul.u32 Q2, Q0, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[532] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[536]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(112)] -vadd.s32 Q3, Q3, Q4 -// Release input[20] from Q4 -vstrw.u32 Q3, [r0,#(80)] -// input[280]: Already loaded as Q5 -// input[536]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[536] from Q7 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[540]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[284]: Already loaded as Q5 -// input[540]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[284] from Q5 -vmul.u32 Q2, Q0, r7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[540] from Q7 -// input[288]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[544]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(144)] -vadd.s32 Q3, Q3, Q4 -// Release input[28] from Q4 -vstrw.u32 Q3, [r0,#(112)] -// input[288]: Already loaded as Q5 -// input[544]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[288] from Q5 -vmul.u32 Q2, Q0, r7 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[544] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[548]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(144)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(160)] -vadd.s32 Q3, Q3, Q4 -// Release input[32] from Q4 -vstrw.u32 Q3, [r0,#(128)] -// input[292]: Already loaded as Q5 -// input[548]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[548] from Q7 -// input[296]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[552]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[296]: Already loaded as Q5 -// input[552]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[296] from Q5 -vmul.u32 Q2, Q0, r7 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[552] from Q7 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[556]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(192)] -vadd.s32 Q3, Q3, Q4 -// Release input[40] from Q4 -vstrw.u32 Q3, [r0,#(160)] -// input[300]: Already loaded as Q5 -// input[556]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[300] from Q5 -vmul.u32 Q2, Q0, r7 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[556] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[560]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(192)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(208)] -vadd.s32 Q3, Q3, Q4 -// Release input[44] from Q4 -vstrw.u32 Q3, [r0,#(176)] -// input[304]: Already loaded as Q5 -// input[560]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[560] from Q7 -// input[308]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[564]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[308]: Already loaded as Q5 -// input[564]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[308] from Q5 -vmul.u32 Q2, Q0, r7 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[564] from Q7 -// input[312]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[568]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(240)] -vadd.s32 Q3, Q3, Q4 -// Release input[52] from Q4 -vstrw.u32 Q3, [r0,#(208)] -// input[312]: Already loaded as Q5 -// input[568]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[312] from Q5 -vmul.u32 Q2, Q0, r7 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[568] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[572]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(240)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(256)] -vadd.s32 Q3, Q3, Q4 -// Release input[56] from Q4 -vstrw.u32 Q3, [r0,#(224)] -// input[316]: Already loaded as Q5 -// input[572]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[572] from Q7 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[576]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[320]: Already loaded as Q5 -// input[576]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[320] from Q5 -vmul.u32 Q2, Q0, r7 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[576] from Q7 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[580]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(288)] -vadd.s32 Q3, Q3, Q4 -// Release input[64] from Q4 -vstrw.u32 Q3, [r0,#(256)] -// input[324]: Already loaded as Q5 -// input[580]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[324] from Q5 -vmul.u32 Q2, Q0, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[580] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[584]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(288)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(304)] -vadd.s32 Q3, Q3, Q4 -// Release input[68] from Q4 -vstrw.u32 Q3, [r0,#(272)] -// input[328]: Already loaded as Q5 -// input[584]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[584] from Q7 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[588]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[332]: Already loaded as Q5 -// input[588]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[332] from Q5 -vmul.u32 Q2, Q0, r7 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[588] from Q7 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[592]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(336)] -vadd.s32 Q3, Q3, Q4 -// Release input[76] from Q4 -vstrw.u32 Q3, [r0,#(304)] -// input[336]: Already loaded as Q5 -// input[592]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[336] from Q5 -vmul.u32 Q2, Q0, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[592] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(336)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(352)] -vadd.s32 Q3, Q3, Q4 -// Release input[80] from Q4 -vstrw.u32 Q3, [r0,#(320)] -// input[340]: Already loaded as Q5 -// input[596]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[596] from Q7 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[600]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[344]: Already loaded as Q5 -// input[600]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[344] from Q5 -vmul.u32 Q2, Q0, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[600] from Q7 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[604]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(384)] -vadd.s32 Q3, Q3, Q4 -// Release input[88] from Q4 -vstrw.u32 Q3, [r0,#(352)] -// input[348]: Already loaded as Q5 -// input[604]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[348] from Q5 -vmul.u32 Q2, Q0, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[604] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[608]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(400)] -vadd.s32 Q3, Q3, Q4 -// Release input[92] from Q4 -vstrw.u32 Q3, [r0,#(368)] -// input[352]: Already loaded as Q5 -// input[608]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[608] from Q7 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[612]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[356]: Already loaded as Q5 -// input[612]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[356] from Q5 -vmul.u32 Q2, Q0, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[612] from Q7 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[616]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(432)] -vadd.s32 Q3, Q3, Q4 -// Release input[100] from Q4 -vstrw.u32 Q3, [r0,#(400)] -// input[360]: Already loaded as Q5 -// input[616]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[360] from Q5 -vmul.u32 Q2, Q0, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[616] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(432)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(448)] -vadd.s32 Q3, Q3, Q4 -// Release input[104] from Q4 -vstrw.u32 Q3, [r0,#(416)] -// input[364]: Already loaded as Q5 -// input[620]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[620] from Q7 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[624]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[368]: Already loaded as Q5 -// input[624]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[368] from Q5 -vmul.u32 Q2, Q0, r7 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[624] from Q7 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[628]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(480)] -vadd.s32 Q3, Q3, Q4 -// Release input[112] from Q4 -vstrw.u32 Q3, [r0,#(448)] -// input[372]: Already loaded as Q5 -// input[628]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[372] from Q5 -vmul.u32 Q2, Q0, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[628] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[632]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(480)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(496)] -vadd.s32 Q3, Q3, Q4 -// Release input[116] from Q4 -vstrw.u32 Q3, [r0,#(464)] -// input[376]: Already loaded as Q5 -// input[632]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[632] from Q7 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q6, Q2, Q1 -// input[636]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[380]: Already loaded as Q5 -// input[636]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[380] from Q5 -vmul.u32 Q2, Q0, r7 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[636] from Q7 -// input[384]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -120)] -vadd.s32 Q6, Q2, Q1 -// input[640]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q3, Q3, Q4 -// Release input[124] from Q4 -vstrw.u32 Q3, [r0,#(496)] -// input[384]: Already loaded as Q5 -// input[640]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[384] from Q5 -vmul.u32 Q2, Q0, r7 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[640] from Q7 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[644]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-480)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-464)] -vadd.s32 Q3, Q3, Q4 -// Release input[128] from Q4 -vstrw.u32 Q3, [r14,#(-496)] -// input[388]: Already loaded as Q5 -// input[644]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[388] from Q5 -vmul.u32 Q2, Q0, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[644] from Q7 -// input[392]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -112)] -vadd.s32 Q6, Q2, Q1 -// input[648]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[392]: Already loaded as Q5 -// input[648]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[392] from Q5 -vmul.u32 Q2, Q0, r7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[648] from Q7 -// input[396]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -108)] -vadd.s32 Q6, Q2, Q1 -// input[652]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q3, Q3, Q4 -// Release input[136] from Q4 -vstrw.u32 Q3, [r14,#(-464)] -// input[396]: Already loaded as Q5 -// input[652]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[396] from Q5 -vmul.u32 Q2, Q0, r7 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[652] from Q7 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[656]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-432)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q4 -// Release input[140] from Q4 -vstrw.u32 Q3, [r14,#(-448)] -// input[400]: Already loaded as Q5 -// input[656]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[400] from Q5 -vmul.u32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[656] from Q7 -// input[404]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -100)] -vadd.s32 Q6, Q2, Q1 -// input[660]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[404]: Already loaded as Q5 -// input[660]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[404] from Q5 -vmul.u32 Q2, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[660] from Q7 -// input[408]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -96)] -vadd.s32 Q6, Q2, Q1 -// input[664]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q4 -// Release input[148] from Q4 -vstrw.u32 Q3, [r14,#(-416)] -// input[408]: Already loaded as Q5 -// input[664]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[408] from Q5 -vmul.u32 Q2, Q0, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[664] from Q7 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[668]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-384)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q4 -// Release input[152] from Q4 -vstrw.u32 Q3, [r14,#(-400)] -// input[412]: Already loaded as Q5 -// input[668]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[412] from Q5 -vmul.u32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[668] from Q7 -// input[416]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -88)] -vadd.s32 Q6, Q2, Q1 -// input[672]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[416]: Already loaded as Q5 -// input[672]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[416] from Q5 -vmul.u32 Q2, Q0, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[672] from Q7 -// input[420]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -84)] -vadd.s32 Q6, Q2, Q1 -// input[676]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q4 -// Release input[160] from Q4 -vstrw.u32 Q3, [r14,#(-368)] -// input[420]: Already loaded as Q5 -// input[676]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[420] from Q5 -vmul.u32 Q2, Q0, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[676] from Q7 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[680]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-336)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q4 -// Release input[164] from Q4 -vstrw.u32 Q3, [r14,#(-352)] -// input[424]: Already loaded as Q5 -// input[680]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[424] from Q5 -vmul.u32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[680] from Q7 -// input[428]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -76)] -vadd.s32 Q6, Q2, Q1 -// input[684]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[428]: Already loaded as Q5 -// input[684]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[428] from Q5 -vmul.u32 Q2, Q0, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[684] from Q7 -// input[432]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -72)] -vadd.s32 Q6, Q2, Q1 -// input[688]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q4 -// Release input[172] from Q4 -vstrw.u32 Q3, [r14,#(-320)] -// input[432]: Already loaded as Q5 -// input[688]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[432] from Q5 -vmul.u32 Q2, Q0, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[688] from Q7 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[692]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-288)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q4 -// Release input[176] from Q4 -vstrw.u32 Q3, [r14,#(-304)] -// input[436]: Already loaded as Q5 -// input[692]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[436] from Q5 -vmul.u32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[692] from Q7 -// input[440]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -64)] -vadd.s32 Q6, Q2, Q1 -// input[696]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[440]: Already loaded as Q5 -// input[696]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[440] from Q5 -vmul.u32 Q2, Q0, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[696] from Q7 -// input[444]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -60)] -vadd.s32 Q6, Q2, Q1 -// input[700]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-240)] -vadd.s32 Q3, Q3, Q4 -// Release input[184] from Q4 -vstrw.u32 Q3, [r14,#(-272)] -// input[444]: Already loaded as Q5 -// input[700]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[444] from Q5 -vmul.u32 Q2, Q0, r7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[700] from Q7 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[704]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-240)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-224)] -vadd.s32 Q3, Q3, Q4 -// Release input[188] from Q4 -vstrw.u32 Q3, [r14,#(-256)] -// input[448]: Already loaded as Q5 -// input[704]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[448] from Q5 -vmul.u32 Q2, Q0, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[704] from Q7 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vadd.s32 Q6, Q2, Q1 -// input[708]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[452]: Already loaded as Q5 -// input[708]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[452] from Q5 -vmul.u32 Q2, Q0, r7 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[708] from Q7 -// input[456]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -48)] -vadd.s32 Q6, Q2, Q1 -// input[712]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-192)] -vadd.s32 Q3, Q3, Q4 -// Release input[196] from Q4 -vstrw.u32 Q3, [r14,#(-224)] -// input[456]: Already loaded as Q5 -// input[712]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[456] from Q5 -vmul.u32 Q2, Q0, r7 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[712] from Q7 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[716]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-192)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-176)] -vadd.s32 Q3, Q3, Q4 -// Release input[200] from Q4 -vstrw.u32 Q3, [r14,#(-208)] -// input[460]: Already loaded as Q5 -// input[716]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[460] from Q5 -vmul.u32 Q2, Q0, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[716] from Q7 -// input[464]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -40)] -vadd.s32 Q6, Q2, Q1 -// input[720]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[464]: Already loaded as Q5 -// input[720]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[464] from Q5 -vmul.u32 Q2, Q0, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[720] from Q7 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vadd.s32 Q6, Q2, Q1 -// input[724]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-144)] -vadd.s32 Q3, Q3, Q4 -// Release input[208] from Q4 -vstrw.u32 Q3, [r14,#(-176)] -// input[468]: Already loaded as Q5 -// input[724]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[468] from Q5 -vmul.u32 Q2, Q0, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[724] from Q7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[728]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-144)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-128)] -vadd.s32 Q3, Q3, Q4 -// Release input[212] from Q4 -vstrw.u32 Q3, [r14,#(-160)] -// input[472]: Already loaded as Q5 -// input[728]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[472] from Q5 -vmul.u32 Q2, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[728] from Q7 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vadd.s32 Q6, Q2, Q1 -// input[732]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[476]: Already loaded as Q5 -// input[732]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[476] from Q5 -vmul.u32 Q2, Q0, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[732] from Q7 -// input[480]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -24)] -vadd.s32 Q6, Q2, Q1 -// input[736]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-96)] -vadd.s32 Q3, Q3, Q4 -// Release input[220] from Q4 -vstrw.u32 Q3, [r14,#(-128)] -// input[480]: Already loaded as Q5 -// input[736]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[480] from Q5 -vmul.u32 Q2, Q0, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[736] from Q7 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[740]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-96)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-80)] -vadd.s32 Q3, Q3, Q4 -// Release input[224] from Q4 -vstrw.u32 Q3, [r14,#(-112)] -// input[484]: Already loaded as Q5 -// input[740]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[484] from Q5 -vmul.u32 Q2, Q0, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[740] from Q7 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vadd.s32 Q6, Q2, Q1 -// input[744]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[488]: Already loaded as Q5 -// input[744]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[488] from Q5 -vmul.u32 Q2, Q0, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[744] from Q7 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vadd.s32 Q6, Q2, Q1 -// input[748]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-48)] -vadd.s32 Q3, Q3, Q4 -// Release input[232] from Q4 -vstrw.u32 Q3, [r14,#(-80)] -// input[492]: Already loaded as Q5 -// input[748]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[492] from Q5 -vmul.u32 Q2, Q0, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[748] from Q7 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[752]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-48)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-32)] -vadd.s32 Q3, Q3, Q4 -// Release input[236] from Q4 -vstrw.u32 Q3, [r14,#(-64)] -// input[496]: Already loaded as Q5 -// input[752]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[496] from Q5 -vmul.u32 Q2, Q0, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[752] from Q7 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vadd.s32 Q6, Q2, Q1 -// input[756]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 0)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[500]: Already loaded as Q5 -// input[756]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[500] from Q5 -vmul.u32 Q2, Q0, r7 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[756] from Q7 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vadd.s32 Q6, Q2, Q1 -// input[760]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(0)] -vadd.s32 Q3, Q3, Q4 -// Release input[244] from Q4 -vstrw.u32 Q3, [r14,#(-32)] -// input[504]: Already loaded as Q5 -// input[760]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[504] from Q5 -vmul.u32 Q2, Q0, r7 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[760] from Q7 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[764]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(0)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(16)] -vadd.s32 Q3, Q3, Q4 -// Release input[248] from Q4 -vstrw.u32 Q3, [r14,#(-16)] -// input[508]: Already loaded as Q5 -// input[764]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[508] from Q5 -vmul.u32 Q2, Q0, r7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[764] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r12,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r8 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r10 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r7 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r10 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[448]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vmul.u32 Q2, Q2, r7 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[452]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[452]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[388]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-208)] -// Release input[452] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-464)] -// Release input[388] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[456]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[464]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-432)] -// Release input[396] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[464]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q1, Q1, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[468]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-160)] -// Release input[464] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[468]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[276]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-144)] -// Release input[468] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(96)] -// Release input[276] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[480]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[488]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-64)] -// Release input[488] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[368]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(464)] -// Release input[368] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[436]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[704]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-240)] -// Release input[444] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-496)] -// Release input[380] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[704]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[708]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-208)] -// Release input[704] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(288)] -// Release input[576] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[708]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-192)] -// Release input[708] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[584]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(320)] -// Release input[584] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmul.u32 Q1, Q1, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(336)] -// Release input[588] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[720]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-144)] -// Release input[720] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[532]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[728]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(112)] -// Release input[532] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-112)] -// Release input[728] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[540]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(144)] -// Release input[540] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[740]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[740]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[744]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-64)] -// Release input[740] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[744]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-48)] -// Release input[744] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(192)] -// Release input[552] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[556]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(208)] -// Release input[556] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[756]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(480)] -// Release input[624] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(0)] -// Release input[756] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[632]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -124)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[568]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-496)] -// Release input[632] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmul.u32 Q1, Q1, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(256)] -// Release input[568] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-480)] -// Release input[636] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[48]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q2, Q2, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[120]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[180]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[180]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-288)] -// Release input[180] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q2, Q2, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[304]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[312]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[268]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[368]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[368]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q1, Q1, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(64)] -// Release input[268] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[372]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(464)] -// Release input[368] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[372]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(480)] -// Release input[372] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[432]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[388]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-272)] -// Release input[436] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[440]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-464)] -// Release input[388] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-256)] -// Release input[440] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-448)] -// Release input[392] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[396]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q2, Q2, r7 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-432)] -// Release input[396] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-208)] -// Release input[452] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[460]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[560]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[560]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-176)] -// Release input[460] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(224)] -// Release input[560] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(96)] -// Release input[528] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[564]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[568]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[568]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(256)] -// Release input[568] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q1, Q1, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[624]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(480)] -// Release input[624] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[628]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[580]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(496)] -// Release input[628] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(304)] -// Release input[580] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[584]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-496)] -// Release input[632] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(320)] -// Release input[584] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[588]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[688]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(336)] -// Release input[588] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-272)] -// Release input[688] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[696]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[696]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-448)] -// Release input[644] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-240)] -// Release input[696] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q2, Q2, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-432)] -// Release input[648] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[652]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[752]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-416)] -// Release input[652] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[756]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-16)] -// Release input[752] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(0)] -// Release input[756] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[712]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q1, Q1, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-176)] -// Release input[712] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[12]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r7 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q0, [r1,#(96)] -vqrdmulh.s32 Q7, Q0, r4 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q0, Q0, r3 -vstrw.u32 Q3, [r1,#(64)] -vqrdmlah.s32 Q7, Q0, r10 -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q3, r4 -vsub.s32 Q4, Q1, Q6 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q4, [r1,#(32)] -vqrdmlah.s32 Q7, Q3, r10 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(0)]! -vqrdmlah.s32 Q7, Q4, r10 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q0, Q1, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q1, Q1, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q0, [r1,#(16)] -// Release input[0] from Q1 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q2, Q2, r7 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q2, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q2, Q3, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q2, r10 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q2, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q2, [r1,#(224)] -vqrdmulh.s32 Q7, Q2, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q2, Q2, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q2, r10 -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q3, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q3, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q3, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q1 -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q3, r10 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[16] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r7 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q4, Q4, r7 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vmul.u32 Q3, Q3, r7 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q4, Q4, r7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r7 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r7 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r7 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q4, Q4, r7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vmul.u32 Q3, Q3, r7 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q4, Q4, r7 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r7 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r7 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[252] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[248] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r7 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[268] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[264] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[260] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[256] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r7 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[272]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[284] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[280] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[276] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[272] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[300]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q4, Q4, r7 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[300] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[296] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[292] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[288] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[316]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r7 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[316] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[312] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[308] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[304] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r7 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[332] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[328] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[324] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[320] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r7 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[348] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[344] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[340] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[336] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q4, Q4, r7 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[364] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[360] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[356] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[352] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q3, Q3, r7 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[380] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[376] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[372] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[368] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[396]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vmul.u32 Q4, Q4, r7 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[396] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[392] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[388] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[384] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[412]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[408]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -96)] -vmul.u32 Q3, Q3, r7 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[412] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[408] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[404] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[400] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[428]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[424]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -80)] -vmul.u32 Q4, Q4, r7 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[444]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[428] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[424] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[420] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[416] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[444]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vmul.u32 Q3, Q3, r7 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[444] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[440] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[436] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[432] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[460]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vmul.u32 Q4, Q4, r7 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[460] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[456] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[452] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[448] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[476]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -32)] -vmul.u32 Q3, Q3, r7 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[476] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[472] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[468] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[464] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[492]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -16)] -vmul.u32 Q4, Q4, r7 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[492] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[488] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[484] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[480] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vmul.u32 Q3, Q3, r7 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[524]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[508] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[504] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[500] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[496] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[524]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vmul.u32 Q4, Q4, r7 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[524] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[520] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[516] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[512] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[540]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 32)] -vmul.u32 Q3, Q3, r7 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[540] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[536] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[532] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[528] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[556]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 48)] -vmul.u32 Q4, Q4, r7 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[556] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[552] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[548] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[544] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[572]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vmul.u32 Q3, Q3, r7 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[572] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[568] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[564] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[560] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[588]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vmul.u32 Q4, Q4, r7 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[588] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[584] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[580] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[576] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[604]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[600]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 96)] -vmul.u32 Q3, Q3, r7 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[604] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[600] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[596] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[592] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[620]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[616]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 112)] -vmul.u32 Q4, Q4, r7 -// input[612]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[608]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[620] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[616] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[612] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[608] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[636]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q3, Q3, r7 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[636] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[632] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[628] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[624] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[652]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vmul.u32 Q4, Q4, r7 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[652] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[648] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[644] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[640] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[668]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[664]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q3, Q3, r7 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[656]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[668] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[664] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[660] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[656] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[684]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[680]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -76)] -vmul.u32 Q4, Q4, r7 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[672]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[684] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[680] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[676] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[672] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[700]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -60)] -vmul.u32 Q3, Q3, r7 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[716]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[700] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[696] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[692] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[688] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[716]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vmul.u32 Q4, Q4, r7 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[716] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[712] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[708] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[704] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q3, Q3, r7 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[732] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[728] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[724] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[720] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[748]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vmul.u32 Q4, Q4, r7 -// input[740]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[748] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[744] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[740] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[736] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[764]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vmul.u32 Q3, Q3, r7 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -vqrdmulh.s32 Q4, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q6, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q6, Q3, r10 -// Release input[764] from Q3 -vqrdmlah.s32 Q4, Q2, r10 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1,#(240)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q6, Q1, r10 -vstrw.u32 Q6, [r1,#(208)] -// Release input[760] from Q1 -vqrdmulh.s32 Q6, Q2, r6 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q6, Q6 -// Release input[756] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q6, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[752] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 8334 -// Instruction count: 6522 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good.s deleted file mode 100644 index 623ebe7..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good.s +++ /dev/null @@ -1,7130 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_768_u32_33556993_299353_incomplete_good_twiddles -ntt_768_u32_33556993_299353_incomplete_good_twiddles: // For base multiplication -.word 22568483 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2863202269 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 14800813 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 3019303507 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 29445319 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 1937930553 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 490723 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 2973981 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 2600037 // zeta^544 * 2^31 = 299353^544 * 2^31 = 29356361 * 2^31 -.word 2267039131 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 299353^544 * 375649793 * 2^31 -.word 50737913 // zeta^416 * 2^31 = 299353^416 * 2^31 = 32616688 * 2^31 -.word 3954411271 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 299353^416 * 375649793 * 2^31 -.word 12865833 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 1885554903 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 25786061 // zeta^736 * 2^31 = 299353^736 * 2^31 = 23624597 * 2^31 -.word 1460046643 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 299353^736 * 375649793 * 2^31 -.word 23929393 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2965686223 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 64126327 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 4170721417 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 7890165 // zeta^592 * 2^31 = 299353^592 * 2^31 = 21166324 * 2^31 -.word 902114571 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 299353^592 * 375649793 * 2^31 -.word 12060473 // zeta^464 * 2^31 = 299353^464 * 2^31 = 518908 * 2^31 -.word 2001482439 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 299353^464 * 375649793 * 2^31 -.word 22718411 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 2347189813 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 42631425 // zeta^688 * 2^31 = 299353^688 * 2^31 = 15739856 * 2^31 -.word 3961227519 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 299353^688 * 375649793 * 2^31 -.word 59046925 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 2604631539 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 21096031 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 1788989345 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 13060269 // zeta^520 * 2^31 = 299353^520 * 2^31 = 24586938 * 2^31 -.word 2256238931 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 299353^520 * 375649793 * 2^31 -.word 42196639 // zeta^392 * 2^31 = 299353^392 * 2^31 = 11458020 * 2^31 -.word 2215794529 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 299353^392 * 375649793 * 2^31 -.word 21384651 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3900153909 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 59814003 // zeta^712 * 2^31 = 299353^712 * 2^31 = 7514760 * 2^31 -.word 3563572621 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 299353^712 * 375649793 * 2^31 -.word 37926881 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1862799903 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 15144207 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 2182054129 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 34389191 // zeta^616 * 2^31 = 299353^616 * 2^31 = 23245647 * 2^31 -.word 2778138937 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 299353^616 * 375649793 * 2^31 -.word 61314877 // zeta^488 * 2^31 = 299353^488 * 2^31 = 19973843 * 2^31 -.word 913099459 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 299353^488 * 375649793 * 2^31 -.word 5537745 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 1761978927 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 45573613 // zeta^664 * 2^31 = 299353^664 * 2^31 = 21424662 * 2^31 -.word 4172274707 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 299353^664 * 375649793 * 2^31 -.word 41269183 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1713094209 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 8430951 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 2411290777 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 47819683 // zeta^568 * 2^31 = 299353^568 * 2^31 = 27276494 * 2^31 -.word 1882108509 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 299353^568 * 375649793 * 2^31 -.word 3519557 // zeta^440 * 2^31 = 299353^440 * 2^31 = 16323183 * 2^31 -.word 3219193275 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 299353^440 * 375649793 * 2^31 -.word 16723085 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1906988403 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 23154315 // zeta^760 * 2^31 = 299353^760 * 2^31 = 3793231 * 2^31 -.word 727981941 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 299353^760 * 375649793 * 2^31 -.word 29464415 // zeta^260 * 2^31 = 299353^260 * 2^31 = 29589567 * 2^31 -.word 1124867745 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 299353^260 * 375649793 * 2^31 -.word 16834411 // zeta^132 * 2^31 = 299353^132 * 2^31 = 23825509 * 2^31 -.word 2414694037 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 299353^132 * 375649793 * 2^31 -.word 3336971 // zeta^580 * 2^31 = 299353^580 * 2^31 = 18571159 * 2^31 -.word 984580853 // zeta^580 * f(q^(-1) mod 2^32) * 2^31 = 299353^580 * 375649793 * 2^31 -.word 35987869 // zeta^452 * 2^31 = 299353^452 * 2^31 = 25099490 * 2^31 -.word 3520528483 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 299353^452 * 375649793 * 2^31 -.word 65924417 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 66231487 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 56270029 // zeta^676 * 2^31 = 299353^676 * 2^31 = 26036127 * 2^31 -.word 2830198067 // zeta^676 * f(q^(-1) mod 2^32) * 2^31 = 299353^676 * 375649793 * 2^31 -.word 6393295 // zeta^356 * 2^31 = 299353^356 * 2^31 = 7994472 * 2^31 -.word 2689370161 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 299353^356 * 375649793 * 2^31 -.word 9671303 // zeta^228 * 2^31 = 299353^228 * 2^31 = 2138810 * 2^31 -.word 3966350201 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 299353^228 * 375649793 * 2^31 -.word 39920455 // zeta^532 * 2^31 = 299353^532 * 2^31 = 15308198 * 2^31 -.word 3879969465 // zeta^532 * f(q^(-1) mod 2^32) * 2^31 = 299353^532 * 375649793 * 2^31 -.word 12915337 // zeta^404 * 2^31 = 299353^404 * 2^31 = 8817795 * 2^31 -.word 2926593911 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 299353^404 * 375649793 * 2^31 -.word 28711753 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 752271031 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 49951167 // zeta^724 * 2^31 = 299353^724 * 2^31 = 25986735 * 2^31 -.word 3931849793 // zeta^724 * f(q^(-1) mod 2^32) * 2^31 = 299353^724 * 375649793 * 2^31 -.word 32477005 // zeta^308 * 2^31 = 299353^308 * 2^31 = 18367002 * 2^31 -.word 3247796915 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 299353^308 * 375649793 * 2^31 -.word 52956899 // zeta^180 * 2^31 = 299353^180 * 2^31 = 31254932 * 2^31 -.word 985714461 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 299353^180 * 375649793 * 2^31 -.word 49240327 // zeta^628 * 2^31 = 299353^628 * 2^31 = 12632936 * 2^31 -.word 4123979001 // zeta^628 * f(q^(-1) mod 2^32) * 2^31 = 299353^628 * 375649793 * 2^31 -.word 4015583 // zeta^500 * 2^31 = 299353^500 * 2^31 = 19827515 * 2^31 -.word 4016140321 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 299353^500 * 375649793 * 2^31 -.word 4106643 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2465535085 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 3455577 // zeta^652 * 2^31 = 299353^652 * 2^31 = 6592748 * 2^31 -.word 2655960999 // zeta^652 * f(q^(-1) mod 2^32) * 2^31 = 299353^652 * 375649793 * 2^31 -.word 24352595 // zeta^332 * 2^31 = 299353^332 * 2^31 = 11703708 * 2^31 -.word 1141483181 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 299353^332 * 375649793 * 2^31 -.word 7734269 // zeta^204 * 2^31 = 299353^204 * 2^31 = 26691971 * 2^31 -.word 1323163139 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 299353^204 * 375649793 * 2^31 -.word 46722155 // zeta^556 * 2^31 = 299353^556 * 2^31 = 14873638 * 2^31 -.word 1252475285 // zeta^556 * f(q^(-1) mod 2^32) * 2^31 = 299353^556 * 375649793 * 2^31 -.word 40466367 // zeta^428 * 2^31 = 299353^428 * 2^31 = 5624346 * 2^31 -.word 3953655361 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 299353^428 * 375649793 * 2^31 -.word 22115915 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2248243125 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 22497809 // zeta^748 * 2^31 = 299353^748 * 2^31 = 27817396 * 2^31 -.word 48848879 // zeta^748 * f(q^(-1) mod 2^32) * 2^31 = 299353^748 * 375649793 * 2^31 -.word 8747555 // zeta^284 * 2^31 = 299353^284 * 2^31 = 5130075 * 2^31 -.word 2123621341 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 299353^284 * 375649793 * 2^31 -.word 6917847 // zeta^156 * 2^31 = 299353^156 * 2^31 = 8247799 * 2^31 -.word 3643725609 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 299353^156 * 375649793 * 2^31 -.word 25107397 // zeta^604 * 2^31 = 299353^604 * 2^31 = 6688514 * 2^31 -.word 782407227 // zeta^604 * f(q^(-1) mod 2^32) * 2^31 = 299353^604 * 375649793 * 2^31 -.word 179365 // zeta^476 * 2^31 = 299353^476 * 2^31 = 1602327 * 2^31 -.word 1021818203 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 299353^476 * 375649793 * 2^31 -.word 17007163 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 822528965 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 53305899 // zeta^700 * 2^31 = 299353^700 * 2^31 = 16127868 * 2^31 -.word 3084667861 // zeta^700 * f(q^(-1) mod 2^32) * 2^31 = 299353^700 * 375649793 * 2^31 -.word 42721649 // zeta^380 * 2^31 = 299353^380 * 2^31 = 9132318 * 2^31 -.word 2921236623 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 299353^380 * 375649793 * 2^31 -.word 8805149 // zeta^252 * 2^31 = 299353^252 * 2^31 = 8471290 * 2^31 -.word 699713251 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 299353^252 * 375649793 * 2^31 -.word 52313173 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 1275663787 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 41324663 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 4138866057 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 66623263 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 4291993313 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 62511589 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1934956571 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 16376073 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 340556023 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 52533103 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 2607595153 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 41327925 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 2834920651 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 20636765 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 425508259 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 2987659 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 124245877 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 60474045 // zeta^400 * 2^31 = 299353^400 * 2^31 = 9445248 * 2^31 -.word 3089932099 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 299353^400 * 375649793 * 2^31 -.word 55053513 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 2293484855 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 29386685 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 3195599427 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 24482561 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 333739775 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 13643979 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2680929589 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 46017955 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 2505977949 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 4393901 // zeta^496 * 2^31 = 299353^496 * 2^31 = 13108720 * 2^31 -.word 815642195 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 299353^496 * 375649793 * 2^31 -.word 24917347 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 2079172765 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 4420623 // zeta^648 * 2^31 = 299353^648 * 2^31 = 13128918 * 2^31 -.word 40444401 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 299353^648 * 375649793 * 2^31 -.word 7299983 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 731394673 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 62241627 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 336581285 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 51969779 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 2112913165 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 56339667 // zeta^424 * 2^31 = 299353^424 * 2^31 = 23981562 * 2^31 -.word 3975713069 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 299353^424 * 375649793 * 2^31 -.word 5799109 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 3381867835 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 6631307 // zeta^744 * 2^31 = 299353^744 * 2^31 = 3271804 * 2^31 -.word 1865039477 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 299353^744 * 375649793 * 2^31 -.word 21540373 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 122692587 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 60635111 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 1884671513 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 58683035 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 1883676517 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 66395225 // zeta^472 * 2^31 = 299353^472 * 2^31 = 3334573 * 2^31 -.word 3596770727 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 299353^472 * 375649793 * 2^31 -.word 63594429 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1075774019 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 10743133 // zeta^696 * 2^31 = 299353^696 * 2^31 = 10953311 * 2^31 -.word 2957882531 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 299353^696 * 375649793 * 2^31 -.word 43959671 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 3566985353 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 27125763 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 1179006461 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 50279575 // zeta^516 * 2^31 = 299353^516 * 2^31 = 9731484 * 2^31 -.word 1880273257 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 299353^516 * 375649793 * 2^31 -.word 46186997 // zeta^388 * 2^31 = 299353^388 * 2^31 = 5764058 * 2^31 -.word 3005141003 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 299353^388 * 375649793 * 2^31 -.word 31126117 // zeta^ 68 * 2^31 = 299353^ 68 * 2^31 = 8457503 * 2^31 -.word 774438811 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 68 * 375649793 * 2^31 -.word 906095 // zeta^708 * 2^31 = 299353^708 * 2^31 = 27028662 * 2^31 -.word 1759019665 // zeta^708 * f(q^(-1) mod 2^32) * 2^31 = 299353^708 * 375649793 * 2^31 -.word 10843957 // zeta^292 * 2^31 = 299353^292 * 2^31 = 7520866 * 2^31 -.word 1464769227 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 299353^292 * 375649793 * 2^31 -.word 43211381 // zeta^164 * 2^31 = 299353^164 * 2^31 = 26244564 * 2^31 -.word 1531000715 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 299353^164 * 375649793 * 2^31 -.word 57442683 // zeta^612 * 2^31 = 299353^612 * 2^31 = 31418183 * 2^31 -.word 328617093 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 299353^612 * 375649793 * 2^31 -.word 30278985 // zeta^484 * 2^31 = 299353^484 * 2^31 = 5855662 * 2^31 -.word 3017987255 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 299353^484 * 375649793 * 2^31 -.word 54198649 // zeta^ 20 * 2^31 = 299353^ 20 * 2^31 = 24739198 * 2^31 -.word 1368373383 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 20 * 375649793 * 2^31 -.word 60562111 // zeta^660 * 2^31 = 299353^660 * 2^31 = 6490403 * 2^31 -.word 953375553 // zeta^660 * f(q^(-1) mod 2^32) * 2^31 = 299353^660 * 375649793 * 2^31 -.word 17162819 // zeta^340 * 2^31 = 299353^340 * 2^31 = 7570258 * 2^31 -.word 363117501 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 299353^340 * 375649793 * 2^31 -.word 12317579 // zeta^212 * 2^31 = 299353^212 * 2^31 = 21478846 * 2^31 -.word 1115388533 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 299353^212 * 375649793 * 2^31 -.word 14157087 // zeta^564 * 2^31 = 299353^564 * 2^31 = 2302061 * 2^31 -.word 3309252833 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 299353^564 * 375649793 * 2^31 -.word 13077099 // zeta^436 * 2^31 = 299353^436 * 2^31 = 20669063 * 2^31 -.word 2262082453 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 299353^436 * 375649793 * 2^31 -.word 63098403 // zeta^116 * 2^31 = 299353^116 * 2^31 = 13729478 * 2^31 -.word 278826973 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 299353^116 * 375649793 * 2^31 -.word 11667751 // zeta^756 * 2^31 = 299353^756 * 2^31 = 26362414 * 2^31 -.word 107838681 // zeta^756 * f(q^(-1) mod 2^32) * 2^31 = 299353^756 * 375649793 * 2^31 -.word 63658409 // zeta^268 * 2^31 = 299353^268 * 2^31 = 26964245 * 2^31 -.word 1639006295 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 299353^268 * 375649793 * 2^31 -.word 34208059 // zeta^140 * 2^31 = 299353^140 * 2^31 = 26391350 * 2^31 -.word 4104541381 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 299353^140 * 375649793 * 2^31 -.word 59379717 // zeta^588 * 2^31 = 299353^588 * 2^31 = 6865022 * 2^31 -.word 2971804155 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 299353^588 * 375649793 * 2^31 -.word 50175319 // zeta^460 * 2^31 = 299353^460 * 2^31 = 18568730 * 2^31 -.word 4113287337 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 299353^460 * 375649793 * 2^31 -.word 26647619 // zeta^ 44 * 2^31 = 299353^ 44 * 2^31 = 27932647 * 2^31 -.word 341311933 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 44 * 375649793 * 2^31 -.word 39812781 // zeta^684 * 2^31 = 299353^684 * 2^31 = 9249292 * 2^31 -.word 1593787219 // zeta^684 * f(q^(-1) mod 2^32) * 2^31 = 299353^684 * 375649793 * 2^31 -.word 44616177 // zeta^364 * 2^31 = 299353^364 * 2^31 = 5739597 * 2^31 -.word 4246118415 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 299353^364 * 375649793 * 2^31 -.word 33175099 // zeta^236 * 2^31 = 299353^236 * 2^31 = 10003728 * 2^31 -.word 2199394245 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 299353^236 * 375649793 * 2^31 -.word 60196139 // zeta^540 * 2^31 = 299353^540 * 2^31 = 25309194 * 2^31 -.word 651241685 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 299353^540 * 375649793 * 2^31 -.word 35386701 // zeta^412 * 2^31 = 299353^412 * 2^31 = 30439269 * 2^31 -.word 2774863027 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 299353^412 * 375649793 * 2^31 -.word 66934621 // zeta^ 92 * 2^31 = 299353^ 92 * 2^31 = 31954666 * 2^31 -.word 3273149091 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 92 * 375649793 * 2^31 -.word 58485025 // zeta^732 * 2^31 = 299353^732 * 2^31 = 5086187 * 2^31 -.word 4055556319 // zeta^732 * f(q^(-1) mod 2^32) * 2^31 = 299353^732 * 375649793 * 2^31 -.word 13808087 // zeta^316 * 2^31 = 299353^316 * 2^31 = 17429125 * 2^31 -.word 1210299433 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 299353^316 * 375649793 * 2^31 -.word 64372243 // zeta^188 * 2^31 = 299353^188 * 2^31 = 22872479 * 2^31 -.word 2032828397 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 299353^188 * 375649793 * 2^31 -.word 58308837 // zeta^636 * 2^31 = 299353^636 * 2^31 = 25085703 * 2^31 -.word 3595254043 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 299353^636 * 375649793 * 2^31 -.word 359507 // zeta^508 * 2^31 = 299353^508 * 2^31 = 661028 * 2^31 -.word 2221523373 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 299353^508 * 375649793 * 2^31 -.word 25789323 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 156101237 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 44545503 // zeta^384 * 2^31 = 299353^384 * 2^31 = 33556992 * 2^31 -.word 1431765025 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 299353^384 * 375649793 * 2^31 -.word 4602397 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 2360010723 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 37668667 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2357036741 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 14580883 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 1687372141 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 64513949 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 2027928163 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 46477221 // zeta^608 * 2^31 = 299353^608 * 2^31 = 9045021 * 2^31 -.word 3869459035 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 299353^608 * 375649793 * 2^31 -.word 54248153 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 2409412391 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 6639941 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 1205035195 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 43184593 // zeta^656 * 2^31 = 299353^656 * 2^31 = 30845592 * 2^31 -.word 1329281071 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 299353^656 * 375649793 * 2^31 -.word 37727301 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 1099367867 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 59223821 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 3392852723 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 53470007 // zeta^560 * 2^31 = 299353^560 * 2^31 = 994165 * 2^31 -.word 1614037705 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 299353^560 * 375649793 * 2^31 -.word 44395575 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1947777481 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 62720085 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3479325099 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 8067061 // zeta^752 * 2^31 = 299353^752 * 2^31 = 403828 * 2^31 -.word 1690335755 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 299353^752 * 375649793 * 2^31 -.word 62693363 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 4254522893 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 54053717 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2038728363 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 4872359 // zeta^584 * 2^31 = 299353^584 * 2^31 = 26445100 * 2^31 -.word 3958386009 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 299353^584 * 375649793 * 2^31 -.word 45729335 // zeta^456 * 2^31 = 299353^456 * 2^31 = 18930340 * 2^31 -.word 394813385 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 299353^456 * 375649793 * 2^31 -.word 10774319 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 319254225 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 29187105 // zeta^680 * 2^31 = 299353^680 * 2^31 = 5756199 * 2^31 -.word 2432167391 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 299353^680 * 375649793 * 2^31 -.word 60482679 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 2429927817 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 32724795 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 1516828357 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 6478875 // zeta^536 * 2^31 = 299353^536 * 2^31 = 135177 * 2^31 -.word 2410295781 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 299353^536 * 375649793 * 2^31 -.word 61576241 // zeta^408 * 2^31 = 299353^408 * 2^31 = 12267508 * 2^31 -.word 2532988367 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 299353^408 * 375649793 * 2^31 -.word 718761 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 698196567 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 25844803 // zeta^728 * 2^31 = 299353^728 * 2^31 = 6580323 * 2^31 -.word 2581873085 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 299353^728 * 375649793 * 2^31 -.word 56370853 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1337084763 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 19294303 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2412858785 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 39988223 // zeta^632 * 2^31 = 299353^632 * 2^31 = 21146062 * 2^31 -.word 3115960833 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 299353^632 * 375649793 * 2^31 -.word 50390901 // zeta^504 * 2^31 = 299353^504 * 2^31 = 17352831 * 2^31 -.word 2387978891 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 299353^504 * 375649793 * 2^31 -.word 20926989 // zeta^ 4 * 2^31 = 299353^ 4 * 2^31 = 27792935 * 2^31 -.word 1289826291 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 4 * 375649793 * 2^31 -.word 37649571 // zeta^644 * 2^31 = 299353^644 * 2^31 = 3967426 * 2^31 -.word 3170099549 // zeta^644 * f(q^(-1) mod 2^32) * 2^31 = 299353^644 * 375649793 * 2^31 -.word 66207891 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2535947629 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 63777015 // zeta^196 * 2^31 = 299353^196 * 2^31 = 14985834 * 2^31 -.word 3310386441 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 299353^196 * 375649793 * 2^31 -.word 23902605 // zeta^548 * 2^31 = 299353^548 * 2^31 = 7312429 * 2^31 -.word 2763966579 // zeta^548 * f(q^(-1) mod 2^32) * 2^31 = 299353^548 * 375649793 * 2^31 -.word 1189569 // zeta^420 * 2^31 = 299353^420 * 2^31 = 14833295 * 2^31 -.word 4228735807 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 299353^420 * 375649793 * 2^31 -.word 36835001 // zeta^100 * 2^31 = 299353^100 * 2^31 = 27701331 * 2^31 -.word 1276980039 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 299353^100 * 375649793 * 2^31 -.word 60720691 // zeta^740 * 2^31 = 299353^740 * 2^31 = 25562521 * 2^31 -.word 1605597133 // zeta^740 * f(q^(-1) mod 2^32) * 2^31 = 299353^740 * 375649793 * 2^31 -.word 6551875 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 3341591741 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 27193531 // zeta^148 * 2^31 = 299353^148 * 2^31 = 18248795 * 2^31 -.word 414997829 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 299353^148 * 375649793 * 2^31 -.word 54796407 // zeta^596 * 2^31 = 299353^596 * 2^31 = 12078147 * 2^31 -.word 3179578761 // zeta^596 * f(q^(-1) mod 2^32) * 2^31 = 299353^596 * 375649793 * 2^31 -.word 38402233 // zeta^468 * 2^31 = 299353^468 * 2^31 = 19648405 * 2^31 -.word 3542696263 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 299353^468 * 375649793 * 2^31 -.word 54036887 // zeta^ 52 * 2^31 = 299353^ 52 * 2^31 = 12887930 * 2^31 -.word 2032884841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 52 * 375649793 * 2^31 -.word 34636981 // zeta^692 * 2^31 = 299353^692 * 2^31 = 15189991 * 2^31 -.word 1047170379 // zeta^692 * f(q^(-1) mod 2^32) * 2^31 = 299353^692 * 375649793 * 2^31 -.word 55446235 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 4187128613 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 17873659 // zeta^244 * 2^31 = 299353^244 * 2^31 = 20924057 * 2^31 -.word 170988293 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 299353^244 * 375649793 * 2^31 -.word 32905927 // zeta^524 * 2^31 = 299353^524 * 2^31 = 7165643 * 2^31 -.word 190425913 // zeta^524 * f(q^(-1) mod 2^32) * 2^31 = 299353^524 * 375649793 * 2^31 -.word 63007343 // zeta^396 * 2^31 = 299353^396 * 2^31 = 572895 * 2^31 -.word 1829432209 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 299353^396 * 375649793 * 2^31 -.word 16938667 // zeta^ 76 * 2^31 = 299353^ 76 * 2^31 = 14988263 * 2^31 -.word 181679957 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 76 * 375649793 * 2^31 -.word 42761391 // zeta^716 * 2^31 = 299353^716 * 2^31 = 21853285 * 2^31 -.word 3153484113 // zeta^716 * f(q^(-1) mod 2^32) * 2^31 = 299353^716 * 375649793 * 2^31 -.word 27301205 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 2701180075 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 20391831 // zeta^172 * 2^31 = 299353^172 * 2^31 = 18683355 * 2^31 -.word 3042492009 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 299353^172 * 375649793 * 2^31 -.word 33938887 // zeta^620 * 2^31 = 299353^620 * 2^31 = 23553265 * 2^31 -.word 2095573049 // zeta^620 * f(q^(-1) mod 2^32) * 2^31 = 299353^620 * 375649793 * 2^31 -.word 44998071 // zeta^492 * 2^31 = 299353^492 * 2^31 = 29292862 * 2^31 -.word 2046724169 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 299353^492 * 375649793 * 2^31 -.word 31727285 // zeta^ 28 * 2^31 = 299353^ 28 * 2^31 = 3117724 * 2^31 -.word 1520104267 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 28 * 375649793 * 2^31 -.word 58366431 // zeta^668 * 2^31 = 299353^668 * 2^31 = 28426918 * 2^31 -.word 2171345953 // zeta^668 * f(q^(-1) mod 2^32) * 2^31 = 299353^668 * 375649793 * 2^31 -.word 8628961 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 239410975 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 42006589 // zeta^220 * 2^31 = 299353^220 * 2^31 = 26868479 * 2^31 -.word 3512560067 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 299353^220 * 375649793 * 2^31 -.word 2741743 // zeta^572 * 2^31 = 299353^572 * 2^31 = 10684514 * 2^31 -.word 2262138897 // zeta^572 * f(q^(-1) mod 2^32) * 2^31 = 299353^572 * 375649793 * 2^31 -.word 50106823 // zeta^444 * 2^31 = 299353^444 * 2^31 = 28113639 * 2^31 -.word 3472438329 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 299353^444 * 375649793 * 2^31 -.word 66754479 // zeta^124 * 2^31 = 299353^124 * 2^31 = 32895965 * 2^31 -.word 2073443921 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 299353^124 * 375649793 * 2^31 -.word 24392337 // zeta^764 * 2^31 = 299353^764 * 2^31 = 24424675 * 2^31 -.word 1373730671 // zeta^764 * f(q^(-1) mod 2^32) * 2^31 = 299353^764 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.global ntt_768_u32_33556993_299353_incomplete_good_scale -ntt_768_u32_33556993_299353_incomplete_good_scale: // Constants for scaling by 1/N -.word 22568483 // 1/192 -.word 2863202269 // 1/192 twisted -.data -roots: -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 48811299 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40500013 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 35394733 // zeta^516 * 2^31 = 299353^516 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 299353^516 * 375649793 * 2^31 -.word 12271567 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 65797823 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 56722355 // zeta^612 * 2^31 = 299353^612 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 299353^612 * 375649793 * 2^31 -.word 48811299 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 12778219 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 21111903 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 31380141 // zeta^564 * 2^31 = 299353^564 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 299353^564 * 375649793 * 2^31 -.word 6014597 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40872659 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 62080381 // zeta^588 * 2^31 = 299353^588 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 299353^588 * 375649793 * 2^31 -.word 40500013 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 58797193 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 50479773 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 3784291 // zeta^540 * 2^31 = 299353^540 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 299353^540 * 375649793 * 2^31 -.word 57130935 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 59392861 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 57730785 // zeta^636 * 2^31 = 299353^636 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 299353^636 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete_good, %function -.global ntt_768_u32_33556993_299353_incomplete_good -ntt_768_u32_33556993_299353_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[512] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r12,#(32)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r7 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[520]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[516] from Q1 -vstrw.u32 Q4, [r12,#(48)] -// input[520]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[520] from Q5 -vmul.u32 Q2, Q0, r7 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[524]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[524]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[524] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[532]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[528] from Q4 -vstrw.u32 Q3, [r12,#(96)] -// input[532]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[532] from Q5 -vmul.u32 Q2, Q0, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[536]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[536]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[536] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[544]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[540] from Q4 -vstrw.u32 Q3, [r12,#(144)] -// input[544]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[544] from Q5 -vmul.u32 Q2, Q0, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[548]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[548]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[548] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[556]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[552] from Q4 -vstrw.u32 Q3, [r12,#(192)] -// input[556]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[556] from Q5 -vmul.u32 Q2, Q0, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[560]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[560]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[560] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[568]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[564] from Q4 -vstrw.u32 Q3, [r12,#(240)] -// input[568]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[568] from Q5 -vmul.u32 Q2, Q0, r7 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[572]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[572]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[572] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[576] from Q4 -vstrw.u32 Q3, [r12,#(288)] -// input[580]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[580] from Q5 -vmul.u32 Q2, Q0, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[584]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[584]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[584] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[588] from Q4 -vstrw.u32 Q3, [r12,#(336)] -// input[592]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[592] from Q5 -vmul.u32 Q2, Q0, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[596]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[596] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[600] from Q4 -vstrw.u32 Q3, [r12,#(384)] -// input[604]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[604] from Q5 -vmul.u32 Q2, Q0, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[608]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[608]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[608] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[612] from Q4 -vstrw.u32 Q3, [r12,#(432)] -// input[616]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[616] from Q5 -vmul.u32 Q2, Q0, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[620]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[620] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[624] from Q4 -vstrw.u32 Q3, [r12,#(480)] -// input[628]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[628] from Q5 -vmul.u32 Q2, Q0, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[632]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[632]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[632] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -// input[640]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[128]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[636] from Q4 -vstrw.u32 Q3, [r11,#(-480)] -// input[640]: Already loaded as Q5 -// input[128]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[640] from Q5 -vmul.u32 Q2, Q0, r7 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[128] from Q7 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[644]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[384] from Q4 -vstrw.u32 Q3, [r12,#(-480)] -// input[388]: Already loaded as Q5 -// input[644]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[388] from Q5 -vmul.u32 Q2, Q0, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[644] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[392]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[392]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[392] from Q7 -// input[652]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[648] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[652]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[652] from Q5 -vmul.u32 Q2, Q0, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[656]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[396] from Q4 -vstrw.u32 Q3, [r12,#(-432)] -// input[400]: Already loaded as Q5 -// input[656]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[400] from Q5 -vmul.u32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[656] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[404]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[404]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[404] from Q7 -// input[664]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[660] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[664]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[664] from Q5 -vmul.u32 Q2, Q0, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[668]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[408] from Q4 -vstrw.u32 Q3, [r12,#(-384)] -// input[412]: Already loaded as Q5 -// input[668]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[412] from Q5 -vmul.u32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[668] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[416]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[416]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[416] from Q7 -// input[676]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[672] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[676]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[676] from Q5 -vmul.u32 Q2, Q0, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[680]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[420] from Q4 -vstrw.u32 Q3, [r12,#(-336)] -// input[424]: Already loaded as Q5 -// input[680]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[424] from Q5 -vmul.u32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[680] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[428]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[428]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[428] from Q7 -// input[688]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[684] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[688]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[688] from Q5 -vmul.u32 Q2, Q0, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[692]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[432] from Q4 -vstrw.u32 Q3, [r12,#(-288)] -// input[436]: Already loaded as Q5 -// input[692]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[436] from Q5 -vmul.u32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[692] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[440]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[440]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[440] from Q7 -// input[700]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[696] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[700]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[700] from Q5 -vmul.u32 Q2, Q0, r7 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[704]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[444] from Q4 -vstrw.u32 Q3, [r12,#(-240)] -// input[448]: Already loaded as Q5 -// input[704]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[448] from Q5 -vmul.u32 Q2, Q0, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[704] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[452]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[452]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[452] from Q7 -// input[712]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[708] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[712]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[712] from Q5 -vmul.u32 Q2, Q0, r7 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[716]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[456] from Q4 -vstrw.u32 Q3, [r12,#(-192)] -// input[460]: Already loaded as Q5 -// input[716]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[460] from Q5 -vmul.u32 Q2, Q0, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[716] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[464]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[464]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[464] from Q7 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[720] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[724]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[724] from Q5 -vmul.u32 Q2, Q0, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[728]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[468] from Q4 -vstrw.u32 Q3, [r12,#(-144)] -// input[472]: Already loaded as Q5 -// input[728]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[472] from Q5 -vmul.u32 Q2, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[728] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[476]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[476]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[476] from Q7 -// input[736]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[732] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[736]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[736] from Q5 -vmul.u32 Q2, Q0, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[740]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[480] from Q4 -vstrw.u32 Q3, [r12,#(-96)] -// input[484]: Already loaded as Q5 -// input[740]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[484] from Q5 -vmul.u32 Q2, Q0, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[740] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[488]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[488]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[488] from Q7 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[744] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[748]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[748] from Q5 -vmul.u32 Q2, Q0, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[752]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[492] from Q4 -vstrw.u32 Q3, [r12,#(-48)] -// input[496]: Already loaded as Q5 -// input[752]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[496] from Q5 -vmul.u32 Q2, Q0, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[752] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[500]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[500]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[500] from Q7 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[756] from Q4 -vstrw.u32 Q3, [r11,#(0)] -// input[760]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[760] from Q5 -vmul.u32 Q2, Q0, r7 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[764]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[504] from Q4 -vstrw.u32 Q3, [r12,#(0)] -// input[508]: Already loaded as Q5 -// input[764]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[508] from Q5 -vmul.u32 Q2, Q0, r7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[764] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r12,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r4 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[192] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[708]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[384] from Q4 -vmul.u32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// input[324]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r12,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[576] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r14,#(-240)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r12,#(-480)] -// input[324]: Already loaded as Q7 -// input[708]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r4 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q6 -// Release input[708] from Q6 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[456]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -48)] -vadd.s32 Q3, Q3, Q2 -// Release input[132] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[324] from Q7 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[72]: Already loaded as Q6 -// input[456]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vadd.s32 Q6, Q6, Q5 -// Release input[456] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vadd.s32 Q3, Q3, Q2 -// Release input[648] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[588]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q6 -// Release input[72] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-432)] -// input[588]: Already loaded as Q7 -// input[204]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vadd.s32 Q7, Q7, Q5 -// Release input[204] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[720]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -36)] -vadd.s32 Q3, Q3, Q2 -// Release input[396] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[336]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[588] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-432)] -// input[336]: Already loaded as Q6 -// input[720]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vadd.s32 Q6, Q6, Q5 -// Release input[720] from Q5 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vadd.s32 Q3, Q3, Q2 -// Release input[144] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(336)] -vadd.s32 Q3, Q3, Q6 -// Release input[336] from Q6 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-432)] -// input[84]: Already loaded as Q7 -// input[468]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vadd.s32 Q7, Q7, Q5 -// Release input[468] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[216]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -36)] -vadd.s32 Q3, Q3, Q2 -// Release input[660] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[600]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-384)] -// input[600]: Already loaded as Q6 -// input[216]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[408]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -96)] -vadd.s32 Q6, Q6, Q5 -// Release input[216] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[732]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[408] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[348]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(384)] -vadd.s32 Q3, Q3, Q6 -// Release input[600] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-384)] -// input[348]: Already loaded as Q7 -// input[732]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vadd.s32 Q7, Q7, Q5 -// Release input[732] from Q5 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[480]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[156] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(384)] -vadd.s32 Q3, Q3, Q7 -// Release input[348] from Q7 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-384)] -// input[96]: Already loaded as Q6 -// input[480]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[672]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[480] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[228]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[672] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[612]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q3, Q3, Q6 -// Release input[96] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// input[612]: Already loaded as Q7 -// input[228]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vadd.s32 Q7, Q7, Q5 -// Release input[228] from Q5 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[744]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[420] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[360]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[612] from Q7 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-336)] -// input[360]: Already loaded as Q6 -// input[744]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[744] from Q5 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q6 -// Release input[360] from Q6 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[108]: Already loaded as Q7 -// input[492]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[684]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[492] from Q5 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[684] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[624]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[108] from Q7 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-288)] -// input[624]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[756]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[432] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[372]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(480)] -vadd.s32 Q3, Q3, Q6 -// Release input[624] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-288)] -// input[372]: Already loaded as Q7 -// input[756]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[756] from Q5 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vsub.s32 Q4, Q3, Q2 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[372] from Q7 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q6 -// input[504]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[696]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vadd.s32 Q6, Q6, Q5 -// Release input[504] from Q5 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[696] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[636]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q6 -// Release input[120] from Q6 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-240)] -// input[636]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vsub.s32 Q4, Q3, Q2 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q3, Q3, Q2 -// Release input[444] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q3, Q3, Q7 -// Release input[636] from Q7 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-240)] -// input[64]: Already loaded as Q6 -// input[448]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vadd.s32 Q6, Q6, Q5 -// Release input[448] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q3, Q3, Q2 -// Release input[640] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[580]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(256)] -vadd.s32 Q3, Q3, Q6 -// Release input[64] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-464)] -// input[580]: Already loaded as Q7 -// input[196]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vadd.s32 Q7, Q7, Q5 -// Release input[196] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[712]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[388] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[328]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[580] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-464)] -// input[328]: Already loaded as Q6 -// input[712]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.s32 Q6, Q6, Q5 -// Release input[712] from Q5 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[136] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q6 -// Release input[328] from Q6 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-464)] -// input[76]: Already loaded as Q7 -// input[460]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vadd.s32 Q7, Q7, Q5 -// Release input[460] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[652] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[592]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[76] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-416)] -// input[592]: Already loaded as Q6 -// input[208]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[208] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[400] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[340]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[592] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-416)] -// input[340]: Already loaded as Q7 -// input[724]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q7, Q7, Q5 -// Release input[724] from Q5 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[340] from Q7 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q6 -// input[472]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[664]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[472] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[664] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[604]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// input[604]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[736]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[412] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[352]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(400)] -vadd.s32 Q3, Q3, Q7 -// Release input[604] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-368)] -// input[352]: Already loaded as Q6 -// input[736]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[736] from Q5 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(400)] -vadd.s32 Q3, Q3, Q6 -// Release input[352] from Q6 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[100]: Already loaded as Q7 -// input[484]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[484] from Q5 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[676] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[616]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(400)] -vadd.s32 Q3, Q3, Q7 -// Release input[100] from Q7 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// input[616]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[424]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -80)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[424] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[364]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(448)] -vadd.s32 Q3, Q3, Q6 -// Release input[616] from Q6 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-320)] -// input[364]: Already loaded as Q7 -// input[748]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[748] from Q5 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(448)] -vadd.s32 Q3, Q3, Q7 -// Release input[364] from Q7 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[112]: Already loaded as Q6 -// input[496]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vadd.s32 Q6, Q6, Q5 -// Release input[496] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[688] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[628]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(448)] -vadd.s32 Q3, Q3, Q6 -// Release input[112] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-272)] -// input[628]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vadd.s32 Q3, Q3, Q2 -// Release input[436] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[376]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[628] from Q7 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-272)] -// input[376]: Already loaded as Q6 -// input[760]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.s32 Q6, Q6, Q5 -// Release input[760] from Q5 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q3, Q3, Q2 -// Release input[184] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q6 -// Release input[376] from Q6 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-272)] -// input[124]: Already loaded as Q7 -// input[508]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[508] from Q5 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[704]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -52)] -vadd.s32 Q3, Q3, Q2 -// Release input[700] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[320]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[124] from Q7 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// input[320]: Already loaded as Q6 -// input[704]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vadd.s32 Q6, Q6, Q5 -// Release input[704] from Q5 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vadd.s32 Q3, Q3, Q2 -// Release input[128] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q6 -// Release input[320] from Q6 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-496)] -// input[68]: Already loaded as Q7 -// input[452]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vadd.s32 Q7, Q7, Q5 -// Release input[452] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[200]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -52)] -vadd.s32 Q3, Q3, Q2 -// Release input[644] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[584]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q7 -// Release input[68] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-448)] -// input[584]: Already loaded as Q6 -// input[200]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -vadd.s32 Q6, Q6, Q5 -// Release input[200] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[716]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[392] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[584] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-448)] -// input[332]: Already loaded as Q7 -// input[716]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vadd.s32 Q7, Q7, Q5 -// Release input[716] from Q5 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[464]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[140] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q7 -// Release input[332] from Q7 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-448)] -// input[80]: Already loaded as Q6 -// input[464]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vadd.s32 Q6, Q6, Q5 -// Release input[464] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[212]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[656] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-400)] -// input[596]: Already loaded as Q7 -// input[212]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vadd.s32 Q7, Q7, Q5 -// Release input[212] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[728]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[404] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[344]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[596] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-400)] -// input[344]: Already loaded as Q6 -// input[728]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q6, Q6, Q5 -// Release input[728] from Q5 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[152] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q6 -// Release input[344] from Q6 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-400)] -// input[92]: Already loaded as Q7 -// input[476]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[668]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[476] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[668] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[608]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// input[608]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[740]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[416] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q6 -// Release input[608] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-352)] -// input[356]: Already loaded as Q7 -// input[740]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[740] from Q5 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vsub.s32 Q4, Q3, Q2 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[356] from Q7 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q6 -// input[488]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[680]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vadd.s32 Q6, Q6, Q5 -// Release input[488] from Q5 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[680] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q6 -// Release input[104] from Q6 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-304)] -// input[620]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[428]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -76)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vsub.s32 Q4, Q3, Q2 -// input[752]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[428] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[368]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q7 -// Release input[620] from Q7 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-304)] -// input[368]: Already loaded as Q6 -// input[752]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vadd.s32 Q6, Q6, Q5 -// Release input[752] from Q5 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[176] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q6 -// Release input[368] from Q6 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-304)] -// input[116]: Already loaded as Q7 -// input[500]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vadd.s32 Q7, Q7, Q5 -// Release input[500] from Q5 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[692] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[632]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q7 -// Release input[116] from Q7 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-256)] -// input[632]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[764]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 8)] -vadd.s32 Q3, Q3, Q2 -// Release input[440] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q6 -// Release input[632] from Q6 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-256)] -// input[380]: Already loaded as Q7 -// input[764]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vadd.s32 Q7, Q7, Q5 -// Release input[764] from Q5 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[188] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[528]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q7 -// Release input[380] from Q7 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-256)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r8 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q5, Q5, r7 -// input[528]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r10 -vqrdmulh.s32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r10 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r10 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r12,#(96)] -// Release input[528] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[564]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[312]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q1, Q1, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[304]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q0, Q0, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[568]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[560]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(224)] -// Release input[560] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[308]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q2, Q2, r7 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(272)] -// Release input[572] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(288)] -// Release input[576] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r7 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q1, Q1, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[696]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[696]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-240)] -// Release input[696] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q1, Q1, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-432)] -// Release input[648] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[688]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-272)] -// Release input[688] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q2, Q2, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-448)] -// Release input[392] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q1, Q1, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q2, Q2, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q0, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q1, Q1, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q0, Q0, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[712]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-176)] -// Release input[712] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q2, Q2, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q2, Q2, Q6 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q1, Q1, r7 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vqrdmlah.s32 Q0, Q1, r10 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q1, Q3, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q4, Q2, r10 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q1, r10 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q1, Q4, r8 -// input[520]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)] -vmul.u32 Q4, Q4, r7 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q4, r10 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q2, r8 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q0, Q2, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q2, Q1, Q0 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q4, r10 -// input[524]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r12,#(64)] -// Release input[520] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[524]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r7 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[540]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(80)] -// Release input[524] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[540]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(144)] -// Release input[540] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q2, Q2, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(96)] -// Release input[528] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[300]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q1, Q1, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[556]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[544]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(160)] -// Release input[544] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q1, Q1, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q2, Q2, r7 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r7 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[588]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(272)] -// Release input[572] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[588]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q1, Q1, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(336)] -// Release input[588] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q2, Q2, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(288)] -// Release input[576] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[332]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[332]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[348]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(320)] -// Release input[332] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[348]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(384)] -// Release input[348] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[604]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(336)] -// Release input[336] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[592]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(352)] -// Release input[592] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q1, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[620]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[620]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(464)] -// Release input[620] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q2, Q2, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(480)] -// Release input[624] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[396]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[396]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q1, Q1, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-432)] -// Release input[396] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[652]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q1, Q1, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q2, Q2, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[400]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[668]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[668]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-416)] -// Release input[400] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[684]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-352)] -// Release input[668] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[684]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[672]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-288)] -// Release input[684] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-336)] -// Release input[672] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[428]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[428]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-304)] -// Release input[428] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q1, Q1, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r7 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-288)] -// Release input[432] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[688]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[688] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q1, Q1, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q2, Q2, r7 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r7 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q1, Q1, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q2, Q2, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-144)] -// Release input[720] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[492]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[492]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q1, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-48)] -// Release input[492] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-96)] -// Release input[480] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[736]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[736] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q1, Q1, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q2, Q2, r7 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[496]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r7 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-32)] -// Release input[496] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 7097 -// Instruction count: 5236 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_bitrev.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_bitrev.s deleted file mode 100644 index 2734488..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_bitrev.s +++ /dev/null @@ -1,6737 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 2430825 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 62228979 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 23796181 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 2430825 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 52637069 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 62228979 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 58757463 // zeta^504 * 2^31 = 299353^504 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 299353^504 * 375649793 * 2^31 -.word 41196349 // zeta^696 * 2^31 = 299353^696 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 299353^696 * 375649793 * 2^31 -.word 2430825 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 7832335 // zeta^408 * 2^31 = 299353^408 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 299353^408 * 375649793 * 2^31 -.word 62228979 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 12542317 // zeta^744 * 2^31 = 299353^744 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 299353^744 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 21796399 // zeta^456 * 2^31 = 299353^456 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 299353^456 * 375649793 * 2^31 -.word 27114239 // zeta^648 * 2^31 = 299353^648 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 299353^648 * 375649793 * 2^31 -.word 58757463 // zeta^504 * 2^31 = 299353^504 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 299353^504 * 375649793 * 2^31 -.word 9383201 // zeta^252 * 2^31 = 299353^252 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 299353^252 * 375649793 * 2^31 -.word 7721125 // zeta^444 * 2^31 = 299353^444 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 299353^444 * 375649793 * 2^31 -.word 41196349 // zeta^696 * 2^31 = 299353^696 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 299353^696 * 375649793 * 2^31 -.word 9983051 // zeta^732 * 2^31 = 299353^732 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^732 * f(q^(-1) mod 2^32) * 2^31 = 299353^732 * 375649793 * 2^31 -.word 63329695 // zeta^156 * 2^31 = 299353^156 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 299353^156 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 16634213 // zeta^492 * 2^31 = 299353^492 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 299353^492 * 375649793 * 2^31 -.word 8316793 // zeta^684 * 2^31 = 299353^684 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^684 * f(q^(-1) mod 2^32) * 2^31 = 299353^684 * 375649793 * 2^31 -.word 7832335 // zeta^408 * 2^31 = 299353^408 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 299353^408 * 375649793 * 2^31 -.word 5033605 // zeta^204 * 2^31 = 299353^204 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 299353^204 * 375649793 * 2^31 -.word 26241327 // zeta^396 * 2^31 = 299353^396 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 299353^396 * 375649793 * 2^31 -.word 12542317 // zeta^744 * 2^31 = 299353^744 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 299353^744 * 375649793 * 2^31 -.word 61099389 // zeta^756 * 2^31 = 299353^756 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^756 * f(q^(-1) mod 2^32) * 2^31 = 299353^756 * 375649793 * 2^31 -.word 35733845 // zeta^180 * 2^31 = 299353^180 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 299353^180 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 46002083 // zeta^468 * 2^31 = 299353^468 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 299353^468 * 375649793 * 2^31 -.word 54335767 // zeta^660 * 2^31 = 299353^660 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^660 * f(q^(-1) mod 2^32) * 2^31 = 299353^660 * 375649793 * 2^31 -.word 21796399 // zeta^456 * 2^31 = 299353^456 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 299353^456 * 375649793 * 2^31 -.word 10391631 // zeta^228 * 2^31 = 299353^228 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 299353^228 * 375649793 * 2^31 -.word 1316163 // zeta^420 * 2^31 = 299353^420 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 299353^420 * 375649793 * 2^31 -.word 27114239 // zeta^648 * 2^31 = 299353^648 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 299353^648 * 375649793 * 2^31 -.word 54842419 // zeta^708 * 2^31 = 299353^708 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^708 * f(q^(-1) mod 2^32) * 2^31 = 299353^708 * 375649793 * 2^31 -.word 31719253 // zeta^132 * 2^31 = 299353^132 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 299353^132 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete_good_bitrev, %function -.global ntt_768_u32_33556993_299353_incomplete_good_bitrev -ntt_768_u32_33556993_299353_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[512] from Q1 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vadd.s32 Q6, Q4, Q3 -// input[128]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -124)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r12,#(32)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[640]: Already loaded as Q1 -// input[128]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[640] from Q1 -vmul.u32 Q3, Q0, r7 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[128] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q3, Q2 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(-496)] -vadd.s32 Q4, Q4, Q1 -// Release input[384] from Q1 -vstrw.u32 Q4, [r12,#(-480)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[704]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[576] from Q4 -vstrw.u32 Q3, [r12,#(288)] -// input[448]: Already loaded as Q5 -// input[704]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[448] from Q5 -vmul.u32 Q2, Q0, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[704] from Q7 -// input[544]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[544]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[544] from Q5 -vmul.u32 Q2, Q0, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[416]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[160]: Already loaded as Q5 -// input[416]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[416] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[608]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[672] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[352]: Already loaded as Q5 -// input[608]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[608] from Q7 -// input[736]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[736]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[736] from Q5 -vmul.u32 Q2, Q0, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[480] from Q4 -vstrw.u32 Q3, [r12,#(-96)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[656]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[528] from Q4 -vstrw.u32 Q3, [r12,#(96)] -// input[400]: Already loaded as Q5 -// input[656]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[400] from Q5 -vmul.u32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[656] from Q7 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[592]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[592] from Q5 -vmul.u32 Q2, Q0, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[464]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[208]: Already loaded as Q5 -// input[464]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[464] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[560]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[720] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[304]: Already loaded as Q5 -// input[560]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[560] from Q7 -// input[688]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[688]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[688] from Q5 -vmul.u32 Q2, Q0, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[432] from Q4 -vstrw.u32 Q3, [r12,#(-288)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[752]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[624] from Q4 -vstrw.u32 Q3, [r12,#(480)] -// input[496]: Already loaded as Q5 -// input[752]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[496] from Q5 -vmul.u32 Q2, Q0, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[752] from Q7 -// input[520]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[520]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[520] from Q5 -vmul.u32 Q2, Q0, r7 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[392]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[136]: Already loaded as Q5 -// input[392]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[392] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[584]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[648] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[328]: Already loaded as Q5 -// input[584]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[584] from Q7 -// input[712]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[712]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[712] from Q5 -vmul.u32 Q2, Q0, r7 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[456] from Q4 -vstrw.u32 Q3, [r12,#(-192)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[680]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[552] from Q4 -vstrw.u32 Q3, [r12,#(192)] -// input[424]: Already loaded as Q5 -// input[680]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[424] from Q5 -vmul.u32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[680] from Q7 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[616]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[616] from Q5 -vmul.u32 Q2, Q0, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[488]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[232]: Already loaded as Q5 -// input[488]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[488] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[536]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[744] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[280]: Already loaded as Q5 -// input[536]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[536] from Q7 -// input[664]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[664]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[664] from Q5 -vmul.u32 Q2, Q0, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[408] from Q4 -vstrw.u32 Q3, [r12,#(-384)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[728]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[600] from Q4 -vstrw.u32 Q3, [r12,#(384)] -// input[472]: Already loaded as Q5 -// input[728]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[472] from Q5 -vmul.u32 Q2, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[728] from Q7 -// input[568]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[568]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[568] from Q5 -vmul.u32 Q2, Q0, r7 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[440]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[184]: Already loaded as Q5 -// input[440]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[440] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[632]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[696] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[376]: Already loaded as Q5 -// input[632]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[632] from Q7 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[760]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[760] from Q5 -vmul.u32 Q2, Q0, r7 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[504] from Q4 -vstrw.u32 Q3, [r12,#(0)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r7 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[644]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[516] from Q4 -vstrw.u32 Q3, [r12,#(48)] -// input[388]: Already loaded as Q5 -// input[644]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[388] from Q5 -vmul.u32 Q2, Q0, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[644] from Q7 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[580]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[580] from Q5 -vmul.u32 Q2, Q0, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[452]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[196]: Already loaded as Q5 -// input[452]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[452] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[548]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[708] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[292]: Already loaded as Q5 -// input[548]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[548] from Q7 -// input[676]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[676]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[676] from Q5 -vmul.u32 Q2, Q0, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[420] from Q4 -vstrw.u32 Q3, [r12,#(-336)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[740]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[612] from Q4 -vstrw.u32 Q3, [r12,#(432)] -// input[484]: Already loaded as Q5 -// input[740]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[484] from Q5 -vmul.u32 Q2, Q0, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[740] from Q7 -// input[532]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[532]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[532] from Q5 -vmul.u32 Q2, Q0, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[404]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[148]: Already loaded as Q5 -// input[404]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[404] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[660] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[340]: Already loaded as Q5 -// input[596]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[596] from Q7 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[724]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[724] from Q5 -vmul.u32 Q2, Q0, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[468] from Q4 -vstrw.u32 Q3, [r12,#(-144)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[692]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[564] from Q4 -vstrw.u32 Q3, [r12,#(240)] -// input[436]: Already loaded as Q5 -// input[692]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[436] from Q5 -vmul.u32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[692] from Q7 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[628]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[628] from Q5 -vmul.u32 Q2, Q0, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[500]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[244]: Already loaded as Q5 -// input[500]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[500] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[524]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[756] from Q4 -vstrw.u32 Q3, [r11,#(0)] -// input[268]: Already loaded as Q5 -// input[524]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[524] from Q7 -// input[652]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[652]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[652] from Q5 -vmul.u32 Q2, Q0, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[396] from Q4 -vstrw.u32 Q3, [r12,#(-432)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[716]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[588] from Q4 -vstrw.u32 Q3, [r12,#(336)] -// input[460]: Already loaded as Q5 -// input[716]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[460] from Q5 -vmul.u32 Q2, Q0, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[716] from Q7 -// input[556]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[556]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[556] from Q5 -vmul.u32 Q2, Q0, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[428]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[172]: Already loaded as Q5 -// input[428]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[428] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[684] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[364]: Already loaded as Q5 -// input[620]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[620] from Q7 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[748]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[748] from Q5 -vmul.u32 Q2, Q0, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[492] from Q4 -vstrw.u32 Q3, [r12,#(-48)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[668]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[540] from Q4 -vstrw.u32 Q3, [r12,#(144)] -// input[412]: Already loaded as Q5 -// input[668]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[412] from Q5 -vmul.u32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[668] from Q7 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[604]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[604] from Q5 -vmul.u32 Q2, Q0, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[476]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[220]: Already loaded as Q5 -// input[476]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[476] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[572]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[732] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[316]: Already loaded as Q5 -// input[572]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[572] from Q7 -// input[700]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[700]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[700] from Q5 -vmul.u32 Q2, Q0, r7 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[444] from Q4 -vstrw.u32 Q3, [r12,#(-240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[764]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[636] from Q4 -vstrw.u32 Q3, [r11,#(-480)] -// input[508]: Already loaded as Q5 -// input[764]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[508] from Q5 -vmul.u32 Q2, Q0, r7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[764] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r12,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r4 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[396]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * -108)] -vadd.s32 Q1, Q1, Q4 -// Release input[516] from Q4 -vmul.u32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// input[648]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -108)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r12,#(48)] -// input[648]: Already loaded as Q7 -// input[396]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r4 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q6 -// Release input[396] from Q6 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[588]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q3, Q3, Q2 -// Release input[132] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q3, Q3, Q7 -// Release input[648] from Q7 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[72]: Already loaded as Q6 -// input[588]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[588] from Q5 -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[456]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -48)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q6 -// Release input[72] from Q6 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(336)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[456]: Already loaded as Q7 -// input[204]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[204] from Q5 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[708] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[552]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-192)] -vadd.s32 Q3, Q3, Q7 -// Release input[456] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// input[552]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[684]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -72)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[552] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[168]: Already loaded as Q7 -// input[684]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vadd.s32 Q7, Q7, Q5 -// Release input[684] from Q5 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[420] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[360]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-336)] -// input[360]: Already loaded as Q6 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[612]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 108)] -vadd.s32 Q6, Q6, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[612] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[744]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q6 -// Release input[360] from Q6 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(432)] -// input[744]: Already loaded as Q7 -// input[492]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[492] from Q5 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vsub.s32 Q4, Q3, Q2 -// input[540]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-48)] -vadd.s32 Q3, Q3, Q7 -// Release input[744] from Q7 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[540]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[540] from Q5 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[408]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[408]: Already loaded as Q7 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vadd.s32 Q7, Q7, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[660] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[600]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[408] from Q7 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-384)] -// input[600]: Already loaded as Q6 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[732]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(384)] -vadd.s32 Q3, Q3, Q6 -// Release input[600] from Q6 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[216]: Already loaded as Q7 -// input[732]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -vadd.s32 Q7, Q7, Q5 -// Release input[732] from Q5 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[468] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-144)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[444]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -60)] -vadd.s32 Q3, Q3, Q2 -// Release input[564] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[696]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -60)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(240)] -// input[696]: Already loaded as Q7 -// input[444]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[444] from Q5 -// input[432]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -72)] -vsub.s32 Q4, Q3, Q2 -// input[636]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -120)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-240)] -vadd.s32 Q3, Q3, Q7 -// Release input[696] from Q7 -vstrw.u32 Q3, [r12,#(-288)] -// Release input[432] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q6 -// input[636]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q6, Q6, Q5 -// Release input[636] from Q5 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[504]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 0)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q6 -// Release input[120] from Q6 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[504]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[756] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[520]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(0)] -vadd.s32 Q3, Q3, Q7 -// Release input[504] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(0)] -// input[520]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[652]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -104)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[136]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[520] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[136]: Already loaded as Q7 -// input[652]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vadd.s32 Q7, Q7, Q5 -// Release input[652] from Q5 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[388] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[328]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q7 -// Release input[136] from Q7 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-464)] -// input[328]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[580] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[712]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -44)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q6 -// Release input[328] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(304)] -// input[712]: Already loaded as Q7 -// input[460]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[460] from Q5 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vsub.s32 Q4, Q3, Q2 -// input[556]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-176)] -vadd.s32 Q3, Q3, Q7 -// Release input[712] from Q7 -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[556]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[556] from Q5 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[424]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[424]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[676] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[616]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[424] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// input[616]: Already loaded as Q6 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q6, Q6, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(448)] -vadd.s32 Q3, Q3, Q6 -// Release input[616] from Q6 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[232]: Already loaded as Q7 -// input[748]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vadd.s32 Q7, Q7, Q5 -// Release input[748] from Q5 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[484] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-80)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 28)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vadd.s32 Q3, Q3, Q2 -// Release input[532] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[664]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(112)] -// input[664]: Already loaded as Q7 -// input[412]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q7, Q7, Q5 -// Release input[412] from Q5 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q7 -// Release input[664] from Q7 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q6 -// input[604]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[604] from Q5 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[472]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[472]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[724] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[568]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-128)] -vadd.s32 Q3, Q3, Q7 -// Release input[472] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// input[568]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[700]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -56)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(256)] -vadd.s32 Q3, Q3, Q6 -// Release input[568] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[184]: Already loaded as Q7 -// input[700]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vadd.s32 Q7, Q7, Q5 -// Release input[700] from Q5 -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[436] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[376]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-272)] -// input[376]: Already loaded as Q6 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 124)] -vadd.s32 Q6, Q6, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q3, Q3, Q2 -// Release input[628] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[760]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q6 -// Release input[376] from Q6 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(496)] -// input[760]: Already loaded as Q7 -// input[508]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[508] from Q5 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vsub.s32 Q4, Q3, Q2 -// input[524]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 20)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(16)] -vadd.s32 Q3, Q3, Q7 -// Release input[760] from Q7 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[524]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[524] from Q5 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[392]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[392]: Already loaded as Q7 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vadd.s32 Q7, Q7, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[644] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[584]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-448)] -vadd.s32 Q3, Q3, Q7 -// Release input[392] from Q7 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-448)] -// input[584]: Already loaded as Q6 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[716]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[584] from Q6 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[200]: Already loaded as Q7 -// input[716]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vadd.s32 Q7, Q7, Q5 -// Release input[716] from Q5 -// input[704]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -52)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[452] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-208)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[428]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[548] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[680]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(176)] -// input[680]: Already loaded as Q7 -// input[428]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[428] from Q5 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vsub.s32 Q4, Q3, Q2 -// input[620]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q7 -// Release input[680] from Q7 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q6 -// input[620]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[620] from Q5 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[488]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -16)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q6 -// Release input[104] from Q6 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[488]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[740]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -16)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[740] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[536]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-64)] -vadd.s32 Q3, Q3, Q7 -// Release input[488] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// input[536]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[668]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -88)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[536] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[152]: Already loaded as Q7 -// input[668]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vadd.s32 Q7, Q7, Q5 -// Release input[668] from Q5 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[404] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[344]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q7 -// Release input[152] from Q7 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-352)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-400)] -// input[344]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[596] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[728]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q6 -// Release input[344] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(368)] -// input[728]: Already loaded as Q7 -// input[476]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[476] from Q5 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vsub.s32 Q4, Q3, Q2 -// input[572]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 68)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q7 -// Release input[728] from Q7 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[572]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[572] from Q5 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[440]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[440]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[692] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[632]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[440] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-256)] -// input[632]: Already loaded as Q6 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q6, Q6, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[764]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 8)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q6 -// Release input[632] from Q6 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -// input[248]: Already loaded as Q7 -// input[764]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vadd.s32 Q7, Q7, Q5 -// Release input[764] from Q5 -// input[752]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -4)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[500] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-16)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r8 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vmul.u32 Q5, Q5, r7 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r10 -vqrdmulh.s32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r10 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r10 -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -vqrdmulh.s32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[432]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[624]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q0, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(480)] -// Release input[624] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vmul.u32 Q1, Q1, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[304]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r7 -// input[544]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[688]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[400]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -104)] -vmul.u32 Q0, Q0, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-416)] -// Release input[400] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q1, Q1, r7 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-464)] -// Release input[640] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q2, Q2, r7 -// input[736]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[560]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q0, Q0, r7 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(224)] -// Release input[560] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[176]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vmul.u32 Q1, Q1, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r7 -// input[608]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[464]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -40)] -vmul.u32 Q0, Q0, r7 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-160)] -// Release input[464] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[696]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[696]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-240)] -// Release input[696] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q0, Q0, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-432)] -// Release input[648] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q1, Q1, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[456]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[568]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[568]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q2, Q2, r7 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-192)] -// Release input[456] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(256)] -// Release input[568] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q0, Q0, r7 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r7 -// input[616]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q2, Q2, r7 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[440]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[392]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-256)] -// Release input[440] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q2, Q2, r7 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-448)] -// Release input[392] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[488]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[564]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[564]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[372]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[372]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q0, Q0, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(480)] -// Release input[372] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[468]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -36)] -vmul.u32 Q1, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[708]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-144)] -// Release input[468] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vmul.u32 Q2, Q2, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-192)] -// Release input[708] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[628]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[628]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[580]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(496)] -// Release input[628] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vmul.u32 Q2, Q2, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(304)] -// Release input[580] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[308]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[308]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(224)] -// Release input[308] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[404]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -100)] -vmul.u32 Q1, Q1, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-400)] -// Release input[404] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q2, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-208)] -// Release input[452] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[444]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[396]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q0, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-432)] -// Release input[396] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[588]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vmul.u32 Q1, Q1, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(336)] -// Release input[588] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r7 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[700]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[700]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -92)] -vmul.u32 Q0, Q0, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-224)] -// Release input[700] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-368)] -// Release input[412] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q1, Q1, r7 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q2, Q2, r7 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -8)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(272)] -// Release input[572] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vmul.u32 Q1, Q1, r7 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q2, Q2, r7 -// input[620]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[476]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-112)] -// Release input[476] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q6 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[192]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[576]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 72)] -vmul.u32 Q1, Q1, r7 -// input[384]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r10 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q1, Q3, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q4, Q2, r10 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q1, r10 -// input[448]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r12,#(-480)] -// Release input[384] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[448]: Already loaded as Q4 -vqrdmulh.s32 Q1, Q4, r8 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q4, Q4, r7 -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmlah.s32 Q1, Q4, r10 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q2, r8 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q0, Q2, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q2, Q1, Q0 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q4, r10 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r12,#(-224)] -// Release input[448] from Q4 -vqrdmlah.s32 Q6, Q3, r10 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[704]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vmul.u32 Q0, Q0, r7 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-208)] -// Release input[704] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[480]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[736]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[736]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[544]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-80)] -// Release input[736] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[224]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(160)] -// Release input[544] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[720]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vmul.u32 Q2, Q2, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(96)] -// Release input[528] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[464]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-160)] -// Release input[464] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vmul.u32 Q1, Q1, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[456]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q1, Q1, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[712]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[712]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q2, Q2, r7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[200]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-176)] -// Release input[712] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[200]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-208)] -// Release input[200] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[744]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q1, Q1, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[552]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-48)] -// Release input[744] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(192)] -// Release input[552] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[488]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[488]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-64)] -// Release input[488] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[24]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[472]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(96)] -// Release input[24] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[728]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[728]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[536]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-112)] -// Release input[728] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(128)] -// Release input[536] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[760]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[760]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q2, Q2, r7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(16)] -// Release input[760] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[708]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[324]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 72)] -vmul.u32 Q1, Q1, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[516]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(288)] -// Release input[324] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[196]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vmul.u32 Q2, Q2, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(48)] -// Release input[516] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[452]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vmul.u32 Q0, Q0, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[228]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q1, Q1, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[36]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[484]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q2, Q2, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(144)] -// Release input[36] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[292]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[740]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[740]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(160)] -// Release input[292] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[468]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-64)] -// Release input[740] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[468]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(176)] -// Release input[548] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[276]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 24)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-144)] -// Release input[468] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[724]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q2, Q2, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(96)] -// Release input[276] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[532]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[212]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-128)] -// Release input[724] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[212]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vmul.u32 Q0, Q0, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(112)] -// Release input[532] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-160)] -// Release input[212] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q1, Q1, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[564]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vmul.u32 Q2, Q2, r7 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(240)] -// Release input[564] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[500]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[500]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-16)] -// Release input[500] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vmul.u32 Q1, Q1, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmul.u32 Q2, Q2, r7 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[716]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[492]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-160)] -// Release input[716] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[492]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-48)] -// Release input[492] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[556]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[236]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[236]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(208)] -// Release input[556] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-64)] -// Release input[236] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vmul.u32 Q2, Q2, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[476]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[476]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-112)] -// Release input[476] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vmul.u32 Q1, Q1, r7 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vmul.u32 Q2, Q2, r7 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[572]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 6705 -// Instruction count: 4845 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_double.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_double.s deleted file mode 100644 index 5333654..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_good_double.s +++ /dev/null @@ -1,8101 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_768_u32_33556993_299353_incomplete_good_double_twiddles -ntt_768_u32_33556993_299353_incomplete_good_double_twiddles: // For base multiplication -.word 22568483 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2863202269 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 14800813 // zeta^640 * 2^31 = 299353^640 * 2^31 = 25038562 * 2^31 -.word 3019303507 // zeta^640 * f(q^(-1) mod 2^32) * 2^31 = 299353^640 * 375649793 * 2^31 -.word 29445319 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 1937930553 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 490723 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 2973981 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 2600037 // zeta^544 * 2^31 = 299353^544 * 2^31 = 29356361 * 2^31 -.word 2267039131 // zeta^544 * f(q^(-1) mod 2^32) * 2^31 = 299353^544 * 375649793 * 2^31 -.word 50737913 // zeta^416 * 2^31 = 299353^416 * 2^31 = 32616688 * 2^31 -.word 3954411271 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 299353^416 * 375649793 * 2^31 -.word 12865833 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 1885554903 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 25786061 // zeta^736 * 2^31 = 299353^736 * 2^31 = 23624597 * 2^31 -.word 1460046643 // zeta^736 * f(q^(-1) mod 2^32) * 2^31 = 299353^736 * 375649793 * 2^31 -.word 23929393 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2965686223 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 64126327 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 4170721417 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 7890165 // zeta^592 * 2^31 = 299353^592 * 2^31 = 21166324 * 2^31 -.word 902114571 // zeta^592 * f(q^(-1) mod 2^32) * 2^31 = 299353^592 * 375649793 * 2^31 -.word 12060473 // zeta^464 * 2^31 = 299353^464 * 2^31 = 518908 * 2^31 -.word 2001482439 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 299353^464 * 375649793 * 2^31 -.word 22718411 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 2347189813 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 42631425 // zeta^688 * 2^31 = 299353^688 * 2^31 = 15739856 * 2^31 -.word 3961227519 // zeta^688 * f(q^(-1) mod 2^32) * 2^31 = 299353^688 * 375649793 * 2^31 -.word 59046925 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 2604631539 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 21096031 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 1788989345 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 13060269 // zeta^520 * 2^31 = 299353^520 * 2^31 = 24586938 * 2^31 -.word 2256238931 // zeta^520 * f(q^(-1) mod 2^32) * 2^31 = 299353^520 * 375649793 * 2^31 -.word 42196639 // zeta^392 * 2^31 = 299353^392 * 2^31 = 11458020 * 2^31 -.word 2215794529 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 299353^392 * 375649793 * 2^31 -.word 21384651 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3900153909 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 59814003 // zeta^712 * 2^31 = 299353^712 * 2^31 = 7514760 * 2^31 -.word 3563572621 // zeta^712 * f(q^(-1) mod 2^32) * 2^31 = 299353^712 * 375649793 * 2^31 -.word 37926881 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1862799903 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 15144207 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 2182054129 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 34389191 // zeta^616 * 2^31 = 299353^616 * 2^31 = 23245647 * 2^31 -.word 2778138937 // zeta^616 * f(q^(-1) mod 2^32) * 2^31 = 299353^616 * 375649793 * 2^31 -.word 61314877 // zeta^488 * 2^31 = 299353^488 * 2^31 = 19973843 * 2^31 -.word 913099459 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 299353^488 * 375649793 * 2^31 -.word 5537745 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 1761978927 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 45573613 // zeta^664 * 2^31 = 299353^664 * 2^31 = 21424662 * 2^31 -.word 4172274707 // zeta^664 * f(q^(-1) mod 2^32) * 2^31 = 299353^664 * 375649793 * 2^31 -.word 41269183 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1713094209 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 8430951 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 2411290777 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 47819683 // zeta^568 * 2^31 = 299353^568 * 2^31 = 27276494 * 2^31 -.word 1882108509 // zeta^568 * f(q^(-1) mod 2^32) * 2^31 = 299353^568 * 375649793 * 2^31 -.word 3519557 // zeta^440 * 2^31 = 299353^440 * 2^31 = 16323183 * 2^31 -.word 3219193275 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 299353^440 * 375649793 * 2^31 -.word 16723085 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1906988403 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 23154315 // zeta^760 * 2^31 = 299353^760 * 2^31 = 3793231 * 2^31 -.word 727981941 // zeta^760 * f(q^(-1) mod 2^32) * 2^31 = 299353^760 * 375649793 * 2^31 -.word 29464415 // zeta^260 * 2^31 = 299353^260 * 2^31 = 29589567 * 2^31 -.word 1124867745 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 299353^260 * 375649793 * 2^31 -.word 16834411 // zeta^132 * 2^31 = 299353^132 * 2^31 = 23825509 * 2^31 -.word 2414694037 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 299353^132 * 375649793 * 2^31 -.word 3336971 // zeta^580 * 2^31 = 299353^580 * 2^31 = 18571159 * 2^31 -.word 984580853 // zeta^580 * f(q^(-1) mod 2^32) * 2^31 = 299353^580 * 375649793 * 2^31 -.word 35987869 // zeta^452 * 2^31 = 299353^452 * 2^31 = 25099490 * 2^31 -.word 3520528483 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 299353^452 * 375649793 * 2^31 -.word 65924417 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 66231487 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 56270029 // zeta^676 * 2^31 = 299353^676 * 2^31 = 26036127 * 2^31 -.word 2830198067 // zeta^676 * f(q^(-1) mod 2^32) * 2^31 = 299353^676 * 375649793 * 2^31 -.word 6393295 // zeta^356 * 2^31 = 299353^356 * 2^31 = 7994472 * 2^31 -.word 2689370161 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 299353^356 * 375649793 * 2^31 -.word 9671303 // zeta^228 * 2^31 = 299353^228 * 2^31 = 2138810 * 2^31 -.word 3966350201 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 299353^228 * 375649793 * 2^31 -.word 39920455 // zeta^532 * 2^31 = 299353^532 * 2^31 = 15308198 * 2^31 -.word 3879969465 // zeta^532 * f(q^(-1) mod 2^32) * 2^31 = 299353^532 * 375649793 * 2^31 -.word 12915337 // zeta^404 * 2^31 = 299353^404 * 2^31 = 8817795 * 2^31 -.word 2926593911 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 299353^404 * 375649793 * 2^31 -.word 28711753 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 752271031 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 49951167 // zeta^724 * 2^31 = 299353^724 * 2^31 = 25986735 * 2^31 -.word 3931849793 // zeta^724 * f(q^(-1) mod 2^32) * 2^31 = 299353^724 * 375649793 * 2^31 -.word 32477005 // zeta^308 * 2^31 = 299353^308 * 2^31 = 18367002 * 2^31 -.word 3247796915 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 299353^308 * 375649793 * 2^31 -.word 52956899 // zeta^180 * 2^31 = 299353^180 * 2^31 = 31254932 * 2^31 -.word 985714461 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 299353^180 * 375649793 * 2^31 -.word 49240327 // zeta^628 * 2^31 = 299353^628 * 2^31 = 12632936 * 2^31 -.word 4123979001 // zeta^628 * f(q^(-1) mod 2^32) * 2^31 = 299353^628 * 375649793 * 2^31 -.word 4015583 // zeta^500 * 2^31 = 299353^500 * 2^31 = 19827515 * 2^31 -.word 4016140321 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 299353^500 * 375649793 * 2^31 -.word 4106643 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2465535085 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 3455577 // zeta^652 * 2^31 = 299353^652 * 2^31 = 6592748 * 2^31 -.word 2655960999 // zeta^652 * f(q^(-1) mod 2^32) * 2^31 = 299353^652 * 375649793 * 2^31 -.word 24352595 // zeta^332 * 2^31 = 299353^332 * 2^31 = 11703708 * 2^31 -.word 1141483181 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 299353^332 * 375649793 * 2^31 -.word 7734269 // zeta^204 * 2^31 = 299353^204 * 2^31 = 26691971 * 2^31 -.word 1323163139 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 299353^204 * 375649793 * 2^31 -.word 46722155 // zeta^556 * 2^31 = 299353^556 * 2^31 = 14873638 * 2^31 -.word 1252475285 // zeta^556 * f(q^(-1) mod 2^32) * 2^31 = 299353^556 * 375649793 * 2^31 -.word 40466367 // zeta^428 * 2^31 = 299353^428 * 2^31 = 5624346 * 2^31 -.word 3953655361 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 299353^428 * 375649793 * 2^31 -.word 22115915 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2248243125 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 22497809 // zeta^748 * 2^31 = 299353^748 * 2^31 = 27817396 * 2^31 -.word 48848879 // zeta^748 * f(q^(-1) mod 2^32) * 2^31 = 299353^748 * 375649793 * 2^31 -.word 8747555 // zeta^284 * 2^31 = 299353^284 * 2^31 = 5130075 * 2^31 -.word 2123621341 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 299353^284 * 375649793 * 2^31 -.word 6917847 // zeta^156 * 2^31 = 299353^156 * 2^31 = 8247799 * 2^31 -.word 3643725609 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 299353^156 * 375649793 * 2^31 -.word 25107397 // zeta^604 * 2^31 = 299353^604 * 2^31 = 6688514 * 2^31 -.word 782407227 // zeta^604 * f(q^(-1) mod 2^32) * 2^31 = 299353^604 * 375649793 * 2^31 -.word 179365 // zeta^476 * 2^31 = 299353^476 * 2^31 = 1602327 * 2^31 -.word 1021818203 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 299353^476 * 375649793 * 2^31 -.word 17007163 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 822528965 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 53305899 // zeta^700 * 2^31 = 299353^700 * 2^31 = 16127868 * 2^31 -.word 3084667861 // zeta^700 * f(q^(-1) mod 2^32) * 2^31 = 299353^700 * 375649793 * 2^31 -.word 42721649 // zeta^380 * 2^31 = 299353^380 * 2^31 = 9132318 * 2^31 -.word 2921236623 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 299353^380 * 375649793 * 2^31 -.word 8805149 // zeta^252 * 2^31 = 299353^252 * 2^31 = 8471290 * 2^31 -.word 699713251 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 299353^252 * 375649793 * 2^31 -.word 52313173 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 1275663787 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 41324663 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 4138866057 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 66623263 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 4291993313 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 62511589 // zeta^448 * 2^31 = 299353^448 * 2^31 = 19715532 * 2^31 -.word 1934956571 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 299353^448 * 375649793 * 2^31 -.word 16376073 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 340556023 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 52533103 // zeta^672 * 2^31 = 299353^672 * 2^31 = 30296666 * 2^31 -.word 2607595153 // zeta^672 * f(q^(-1) mod 2^32) * 2^31 = 299353^672 * 375649793 * 2^31 -.word 41327925 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 2834920651 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 20636765 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 425508259 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 2987659 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 124245877 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 60474045 // zeta^400 * 2^31 = 299353^400 * 2^31 = 9445248 * 2^31 -.word 3089932099 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 299353^400 * 375649793 * 2^31 -.word 55053513 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 2293484855 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 29386685 // zeta^720 * 2^31 = 299353^720 * 2^31 = 20647416 * 2^31 -.word 3195599427 // zeta^720 * f(q^(-1) mod 2^32) * 2^31 = 299353^720 * 375649793 * 2^31 -.word 24482561 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 333739775 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 13643979 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2680929589 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 46017955 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 2505977949 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 4393901 // zeta^496 * 2^31 = 299353^496 * 2^31 = 13108720 * 2^31 -.word 815642195 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 299353^496 * 375649793 * 2^31 -.word 24917347 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 2079172765 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 4420623 // zeta^648 * 2^31 = 299353^648 * 2^31 = 13128918 * 2^31 -.word 40444401 // zeta^648 * f(q^(-1) mod 2^32) * 2^31 = 299353^648 * 375649793 * 2^31 -.word 7299983 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 731394673 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 62241627 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 336581285 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 51969779 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 2112913165 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 56339667 // zeta^424 * 2^31 = 299353^424 * 2^31 = 23981562 * 2^31 -.word 3975713069 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 299353^424 * 375649793 * 2^31 -.word 5799109 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 3381867835 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 6631307 // zeta^744 * 2^31 = 299353^744 * 2^31 = 3271804 * 2^31 -.word 1865039477 // zeta^744 * f(q^(-1) mod 2^32) * 2^31 = 299353^744 * 375649793 * 2^31 -.word 21540373 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 122692587 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 60635111 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 1884671513 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 58683035 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 1883676517 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 66395225 // zeta^472 * 2^31 = 299353^472 * 2^31 = 3334573 * 2^31 -.word 3596770727 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 299353^472 * 375649793 * 2^31 -.word 63594429 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1075774019 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 10743133 // zeta^696 * 2^31 = 299353^696 * 2^31 = 10953311 * 2^31 -.word 2957882531 // zeta^696 * f(q^(-1) mod 2^32) * 2^31 = 299353^696 * 375649793 * 2^31 -.word 43959671 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 3566985353 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 27125763 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 1179006461 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 50279575 // zeta^516 * 2^31 = 299353^516 * 2^31 = 9731484 * 2^31 -.word 1880273257 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 299353^516 * 375649793 * 2^31 -.word 46186997 // zeta^388 * 2^31 = 299353^388 * 2^31 = 5764058 * 2^31 -.word 3005141003 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 299353^388 * 375649793 * 2^31 -.word 31126117 // zeta^ 68 * 2^31 = 299353^ 68 * 2^31 = 8457503 * 2^31 -.word 774438811 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 68 * 375649793 * 2^31 -.word 906095 // zeta^708 * 2^31 = 299353^708 * 2^31 = 27028662 * 2^31 -.word 1759019665 // zeta^708 * f(q^(-1) mod 2^32) * 2^31 = 299353^708 * 375649793 * 2^31 -.word 10843957 // zeta^292 * 2^31 = 299353^292 * 2^31 = 7520866 * 2^31 -.word 1464769227 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 299353^292 * 375649793 * 2^31 -.word 43211381 // zeta^164 * 2^31 = 299353^164 * 2^31 = 26244564 * 2^31 -.word 1531000715 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 299353^164 * 375649793 * 2^31 -.word 57442683 // zeta^612 * 2^31 = 299353^612 * 2^31 = 31418183 * 2^31 -.word 328617093 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 299353^612 * 375649793 * 2^31 -.word 30278985 // zeta^484 * 2^31 = 299353^484 * 2^31 = 5855662 * 2^31 -.word 3017987255 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 299353^484 * 375649793 * 2^31 -.word 54198649 // zeta^ 20 * 2^31 = 299353^ 20 * 2^31 = 24739198 * 2^31 -.word 1368373383 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 20 * 375649793 * 2^31 -.word 60562111 // zeta^660 * 2^31 = 299353^660 * 2^31 = 6490403 * 2^31 -.word 953375553 // zeta^660 * f(q^(-1) mod 2^32) * 2^31 = 299353^660 * 375649793 * 2^31 -.word 17162819 // zeta^340 * 2^31 = 299353^340 * 2^31 = 7570258 * 2^31 -.word 363117501 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 299353^340 * 375649793 * 2^31 -.word 12317579 // zeta^212 * 2^31 = 299353^212 * 2^31 = 21478846 * 2^31 -.word 1115388533 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 299353^212 * 375649793 * 2^31 -.word 14157087 // zeta^564 * 2^31 = 299353^564 * 2^31 = 2302061 * 2^31 -.word 3309252833 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 299353^564 * 375649793 * 2^31 -.word 13077099 // zeta^436 * 2^31 = 299353^436 * 2^31 = 20669063 * 2^31 -.word 2262082453 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 299353^436 * 375649793 * 2^31 -.word 63098403 // zeta^116 * 2^31 = 299353^116 * 2^31 = 13729478 * 2^31 -.word 278826973 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 299353^116 * 375649793 * 2^31 -.word 11667751 // zeta^756 * 2^31 = 299353^756 * 2^31 = 26362414 * 2^31 -.word 107838681 // zeta^756 * f(q^(-1) mod 2^32) * 2^31 = 299353^756 * 375649793 * 2^31 -.word 63658409 // zeta^268 * 2^31 = 299353^268 * 2^31 = 26964245 * 2^31 -.word 1639006295 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 299353^268 * 375649793 * 2^31 -.word 34208059 // zeta^140 * 2^31 = 299353^140 * 2^31 = 26391350 * 2^31 -.word 4104541381 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 299353^140 * 375649793 * 2^31 -.word 59379717 // zeta^588 * 2^31 = 299353^588 * 2^31 = 6865022 * 2^31 -.word 2971804155 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 299353^588 * 375649793 * 2^31 -.word 50175319 // zeta^460 * 2^31 = 299353^460 * 2^31 = 18568730 * 2^31 -.word 4113287337 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 299353^460 * 375649793 * 2^31 -.word 26647619 // zeta^ 44 * 2^31 = 299353^ 44 * 2^31 = 27932647 * 2^31 -.word 341311933 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 44 * 375649793 * 2^31 -.word 39812781 // zeta^684 * 2^31 = 299353^684 * 2^31 = 9249292 * 2^31 -.word 1593787219 // zeta^684 * f(q^(-1) mod 2^32) * 2^31 = 299353^684 * 375649793 * 2^31 -.word 44616177 // zeta^364 * 2^31 = 299353^364 * 2^31 = 5739597 * 2^31 -.word 4246118415 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 299353^364 * 375649793 * 2^31 -.word 33175099 // zeta^236 * 2^31 = 299353^236 * 2^31 = 10003728 * 2^31 -.word 2199394245 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 299353^236 * 375649793 * 2^31 -.word 60196139 // zeta^540 * 2^31 = 299353^540 * 2^31 = 25309194 * 2^31 -.word 651241685 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 299353^540 * 375649793 * 2^31 -.word 35386701 // zeta^412 * 2^31 = 299353^412 * 2^31 = 30439269 * 2^31 -.word 2774863027 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 299353^412 * 375649793 * 2^31 -.word 66934621 // zeta^ 92 * 2^31 = 299353^ 92 * 2^31 = 31954666 * 2^31 -.word 3273149091 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 92 * 375649793 * 2^31 -.word 58485025 // zeta^732 * 2^31 = 299353^732 * 2^31 = 5086187 * 2^31 -.word 4055556319 // zeta^732 * f(q^(-1) mod 2^32) * 2^31 = 299353^732 * 375649793 * 2^31 -.word 13808087 // zeta^316 * 2^31 = 299353^316 * 2^31 = 17429125 * 2^31 -.word 1210299433 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 299353^316 * 375649793 * 2^31 -.word 64372243 // zeta^188 * 2^31 = 299353^188 * 2^31 = 22872479 * 2^31 -.word 2032828397 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 299353^188 * 375649793 * 2^31 -.word 58308837 // zeta^636 * 2^31 = 299353^636 * 2^31 = 25085703 * 2^31 -.word 3595254043 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 299353^636 * 375649793 * 2^31 -.word 359507 // zeta^508 * 2^31 = 299353^508 * 2^31 = 661028 * 2^31 -.word 2221523373 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 299353^508 * 375649793 * 2^31 -.word 25789323 // zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 156101237 // zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 44545503 // zeta^384 * 2^31 = 299353^384 * 2^31 = 33556992 * 2^31 -.word 1431765025 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 299353^384 * 375649793 * 2^31 -.word 4602397 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 2360010723 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 37668667 // zeta^704 * 2^31 = 299353^704 * 2^31 = 31543752 * 2^31 -.word 2357036741 // zeta^704 * f(q^(-1) mod 2^32) * 2^31 = 299353^704 * 375649793 * 2^31 -.word 14580883 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 1687372141 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 64513949 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 2027928163 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 46477221 // zeta^608 * 2^31 = 299353^608 * 2^31 = 9045021 * 2^31 -.word 3869459035 // zeta^608 * f(q^(-1) mod 2^32) * 2^31 = 299353^608 * 375649793 * 2^31 -.word 54248153 // zeta^480 * 2^31 = 299353^480 * 2^31 = 18977417 * 2^31 -.word 2409412391 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 299353^480 * 375649793 * 2^31 -.word 6639941 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 1205035195 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 43184593 // zeta^656 * 2^31 = 299353^656 * 2^31 = 30845592 * 2^31 -.word 1329281071 // zeta^656 * f(q^(-1) mod 2^32) * 2^31 = 299353^656 * 375649793 * 2^31 -.word 37727301 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 1099367867 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 59223821 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 3392852723 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 53470007 // zeta^560 * 2^31 = 299353^560 * 2^31 = 994165 * 2^31 -.word 1614037705 // zeta^560 * f(q^(-1) mod 2^32) * 2^31 = 299353^560 * 375649793 * 2^31 -.word 44395575 // zeta^432 * 2^31 = 299353^432 * 2^31 = 18811302 * 2^31 -.word 1947777481 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 299353^432 * 375649793 * 2^31 -.word 62720085 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3479325099 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 8067061 // zeta^752 * 2^31 = 299353^752 * 2^31 = 403828 * 2^31 -.word 1690335755 // zeta^752 * f(q^(-1) mod 2^32) * 2^31 = 299353^752 * 375649793 * 2^31 -.word 62693363 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 4254522893 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 54053717 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2038728363 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 4872359 // zeta^584 * 2^31 = 299353^584 * 2^31 = 26445100 * 2^31 -.word 3958386009 // zeta^584 * f(q^(-1) mod 2^32) * 2^31 = 299353^584 * 375649793 * 2^31 -.word 45729335 // zeta^456 * 2^31 = 299353^456 * 2^31 = 18930340 * 2^31 -.word 394813385 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 299353^456 * 375649793 * 2^31 -.word 10774319 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 319254225 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 29187105 // zeta^680 * 2^31 = 299353^680 * 2^31 = 5756199 * 2^31 -.word 2432167391 // zeta^680 * f(q^(-1) mod 2^32) * 2^31 = 299353^680 * 375649793 * 2^31 -.word 60482679 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 2429927817 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 32724795 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 1516828357 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 6478875 // zeta^536 * 2^31 = 299353^536 * 2^31 = 135177 * 2^31 -.word 2410295781 // zeta^536 * f(q^(-1) mod 2^32) * 2^31 = 299353^536 * 375649793 * 2^31 -.word 61576241 // zeta^408 * 2^31 = 299353^408 * 2^31 = 12267508 * 2^31 -.word 2532988367 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 299353^408 * 375649793 * 2^31 -.word 718761 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 698196567 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 25844803 // zeta^728 * 2^31 = 299353^728 * 2^31 = 6580323 * 2^31 -.word 2581873085 // zeta^728 * f(q^(-1) mod 2^32) * 2^31 = 299353^728 * 375649793 * 2^31 -.word 56370853 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1337084763 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 19294303 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2412858785 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 39988223 // zeta^632 * 2^31 = 299353^632 * 2^31 = 21146062 * 2^31 -.word 3115960833 // zeta^632 * f(q^(-1) mod 2^32) * 2^31 = 299353^632 * 375649793 * 2^31 -.word 50390901 // zeta^504 * 2^31 = 299353^504 * 2^31 = 17352831 * 2^31 -.word 2387978891 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 299353^504 * 375649793 * 2^31 -.word 20926989 // zeta^ 4 * 2^31 = 299353^ 4 * 2^31 = 27792935 * 2^31 -.word 1289826291 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 4 * 375649793 * 2^31 -.word 37649571 // zeta^644 * 2^31 = 299353^644 * 2^31 = 3967426 * 2^31 -.word 3170099549 // zeta^644 * f(q^(-1) mod 2^32) * 2^31 = 299353^644 * 375649793 * 2^31 -.word 66207891 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2535947629 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 63777015 // zeta^196 * 2^31 = 299353^196 * 2^31 = 14985834 * 2^31 -.word 3310386441 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 299353^196 * 375649793 * 2^31 -.word 23902605 // zeta^548 * 2^31 = 299353^548 * 2^31 = 7312429 * 2^31 -.word 2763966579 // zeta^548 * f(q^(-1) mod 2^32) * 2^31 = 299353^548 * 375649793 * 2^31 -.word 1189569 // zeta^420 * 2^31 = 299353^420 * 2^31 = 14833295 * 2^31 -.word 4228735807 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 299353^420 * 375649793 * 2^31 -.word 36835001 // zeta^100 * 2^31 = 299353^100 * 2^31 = 27701331 * 2^31 -.word 1276980039 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 299353^100 * 375649793 * 2^31 -.word 60720691 // zeta^740 * 2^31 = 299353^740 * 2^31 = 25562521 * 2^31 -.word 1605597133 // zeta^740 * f(q^(-1) mod 2^32) * 2^31 = 299353^740 * 375649793 * 2^31 -.word 6551875 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 3341591741 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 27193531 // zeta^148 * 2^31 = 299353^148 * 2^31 = 18248795 * 2^31 -.word 414997829 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 299353^148 * 375649793 * 2^31 -.word 54796407 // zeta^596 * 2^31 = 299353^596 * 2^31 = 12078147 * 2^31 -.word 3179578761 // zeta^596 * f(q^(-1) mod 2^32) * 2^31 = 299353^596 * 375649793 * 2^31 -.word 38402233 // zeta^468 * 2^31 = 299353^468 * 2^31 = 19648405 * 2^31 -.word 3542696263 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 299353^468 * 375649793 * 2^31 -.word 54036887 // zeta^ 52 * 2^31 = 299353^ 52 * 2^31 = 12887930 * 2^31 -.word 2032884841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 52 * 375649793 * 2^31 -.word 34636981 // zeta^692 * 2^31 = 299353^692 * 2^31 = 15189991 * 2^31 -.word 1047170379 // zeta^692 * f(q^(-1) mod 2^32) * 2^31 = 299353^692 * 375649793 * 2^31 -.word 55446235 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 4187128613 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 17873659 // zeta^244 * 2^31 = 299353^244 * 2^31 = 20924057 * 2^31 -.word 170988293 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 299353^244 * 375649793 * 2^31 -.word 32905927 // zeta^524 * 2^31 = 299353^524 * 2^31 = 7165643 * 2^31 -.word 190425913 // zeta^524 * f(q^(-1) mod 2^32) * 2^31 = 299353^524 * 375649793 * 2^31 -.word 63007343 // zeta^396 * 2^31 = 299353^396 * 2^31 = 572895 * 2^31 -.word 1829432209 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 299353^396 * 375649793 * 2^31 -.word 16938667 // zeta^ 76 * 2^31 = 299353^ 76 * 2^31 = 14988263 * 2^31 -.word 181679957 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 76 * 375649793 * 2^31 -.word 42761391 // zeta^716 * 2^31 = 299353^716 * 2^31 = 21853285 * 2^31 -.word 3153484113 // zeta^716 * f(q^(-1) mod 2^32) * 2^31 = 299353^716 * 375649793 * 2^31 -.word 27301205 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 2701180075 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 20391831 // zeta^172 * 2^31 = 299353^172 * 2^31 = 18683355 * 2^31 -.word 3042492009 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 299353^172 * 375649793 * 2^31 -.word 33938887 // zeta^620 * 2^31 = 299353^620 * 2^31 = 23553265 * 2^31 -.word 2095573049 // zeta^620 * f(q^(-1) mod 2^32) * 2^31 = 299353^620 * 375649793 * 2^31 -.word 44998071 // zeta^492 * 2^31 = 299353^492 * 2^31 = 29292862 * 2^31 -.word 2046724169 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 299353^492 * 375649793 * 2^31 -.word 31727285 // zeta^ 28 * 2^31 = 299353^ 28 * 2^31 = 3117724 * 2^31 -.word 1520104267 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 28 * 375649793 * 2^31 -.word 58366431 // zeta^668 * 2^31 = 299353^668 * 2^31 = 28426918 * 2^31 -.word 2171345953 // zeta^668 * f(q^(-1) mod 2^32) * 2^31 = 299353^668 * 375649793 * 2^31 -.word 8628961 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 239410975 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 42006589 // zeta^220 * 2^31 = 299353^220 * 2^31 = 26868479 * 2^31 -.word 3512560067 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 299353^220 * 375649793 * 2^31 -.word 2741743 // zeta^572 * 2^31 = 299353^572 * 2^31 = 10684514 * 2^31 -.word 2262138897 // zeta^572 * f(q^(-1) mod 2^32) * 2^31 = 299353^572 * 375649793 * 2^31 -.word 50106823 // zeta^444 * 2^31 = 299353^444 * 2^31 = 28113639 * 2^31 -.word 3472438329 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 299353^444 * 375649793 * 2^31 -.word 66754479 // zeta^124 * 2^31 = 299353^124 * 2^31 = 32895965 * 2^31 -.word 2073443921 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 299353^124 * 375649793 * 2^31 -.word 24392337 // zeta^764 * 2^31 = 299353^764 * 2^31 = 24424675 * 2^31 -.word 1373730671 // zeta^764 * f(q^(-1) mod 2^32) * 2^31 = 299353^764 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.global ntt_768_u32_33556993_299353_incomplete_good_double_scale -ntt_768_u32_33556993_299353_incomplete_good_double_scale: // Constants for scaling by 1/N -.word 22568483 // 1/192 -.word 2863202269 // 1/192 twisted -.data -roots: -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 29095681 // zeta^576 * 2^31 = 299353^576 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^576 * f(q^(-1) mod 2^32) * 2^31 = 299353^576 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 18598075 // zeta^528 * 2^31 = 299353^528 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^528 * f(q^(-1) mod 2^32) * 2^31 = 299353^528 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 48811299 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40500013 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 34427601 // zeta^624 * 2^31 = 299353^624 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^624 * f(q^(-1) mod 2^32) * 2^31 = 299353^624 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 35394733 // zeta^516 * 2^31 = 299353^516 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^516 * f(q^(-1) mod 2^32) * 2^31 = 299353^516 * 375649793 * 2^31 -.word 12271567 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 65797823 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 56722355 // zeta^612 * 2^31 = 299353^612 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^612 * f(q^(-1) mod 2^32) * 2^31 = 299353^612 * 375649793 * 2^31 -.word 48811299 // zeta^552 * 2^31 = 299353^552 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^552 * f(q^(-1) mod 2^32) * 2^31 = 299353^552 * 375649793 * 2^31 -.word 12778219 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 21111903 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 31380141 // zeta^564 * 2^31 = 299353^564 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^564 * f(q^(-1) mod 2^32) * 2^31 = 299353^564 * 375649793 * 2^31 -.word 6014597 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40872659 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 62080381 // zeta^588 * 2^31 = 299353^588 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^588 * f(q^(-1) mod 2^32) * 2^31 = 299353^588 * 375649793 * 2^31 -.word 40500013 // zeta^600 * 2^31 = 299353^600 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^600 * f(q^(-1) mod 2^32) * 2^31 = 299353^600 * 375649793 * 2^31 -.word 58797193 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 50479773 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 3784291 // zeta^540 * 2^31 = 299353^540 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^540 * f(q^(-1) mod 2^32) * 2^31 = 299353^540 * 375649793 * 2^31 -.word 57130935 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 59392861 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 57730785 // zeta^636 * 2^31 = 299353^636 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^636 * f(q^(-1) mod 2^32) * 2^31 = 299353^636 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete_good_double, %function -.global ntt_768_u32_33556993_299353_incomplete_good_double -ntt_768_u32_33556993_299353_incomplete_good_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[512]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r8 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r7 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r10 -vsub.s32 Q4, Q0, Q1 -// Release input[512] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r12,#(32)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r7 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q3, r10 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[520]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[516] from Q1 -vstrw.u32 Q4, [r12,#(48)] -// input[520]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[520] from Q5 -vmul.u32 Q2, Q0, r7 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[524]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[524]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r7 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[524] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r7 -// input[528]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[532]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[528] from Q4 -vstrw.u32 Q3, [r12,#(96)] -// input[532]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[532] from Q5 -vmul.u32 Q2, Q0, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[536]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[536]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[536] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[544]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[540] from Q4 -vstrw.u32 Q3, [r12,#(144)] -// input[544]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[544] from Q5 -vmul.u32 Q2, Q0, r7 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[548]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[548]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[548] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r7 -// input[552]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[556]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[552] from Q4 -vstrw.u32 Q3, [r12,#(192)] -// input[556]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[556] from Q5 -vmul.u32 Q2, Q0, r7 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[560]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[560]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r7 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[560] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[568]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[564] from Q4 -vstrw.u32 Q3, [r12,#(240)] -// input[568]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[568] from Q5 -vmul.u32 Q2, Q0, r7 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[572]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[572]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[572] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r7 -// input[576]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[576] from Q4 -vstrw.u32 Q3, [r12,#(288)] -// input[580]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[580] from Q5 -vmul.u32 Q2, Q0, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[584]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[584]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r7 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[584] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r7 -// input[588]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[588] from Q4 -vstrw.u32 Q3, [r12,#(336)] -// input[592]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[592] from Q5 -vmul.u32 Q2, Q0, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[596]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[596] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[600] from Q4 -vstrw.u32 Q3, [r12,#(384)] -// input[604]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[604] from Q5 -vmul.u32 Q2, Q0, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[608]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[608]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r7 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[608] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[612] from Q4 -vstrw.u32 Q3, [r12,#(432)] -// input[616]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[616] from Q5 -vmul.u32 Q2, Q0, r7 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[620]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r7 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[620] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r7 -// input[624]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[624] from Q4 -vstrw.u32 Q3, [r12,#(480)] -// input[628]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[628] from Q5 -vmul.u32 Q2, Q0, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[632]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[632]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r7 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[632] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r7 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -// input[640]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[128]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[636] from Q4 -vstrw.u32 Q3, [r11,#(-480)] -// input[640]: Already loaded as Q5 -// input[128]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[640] from Q5 -vmul.u32 Q2, Q0, r7 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[128] from Q7 -// input[388]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[644]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[384] from Q4 -vstrw.u32 Q3, [r12,#(-480)] -// input[388]: Already loaded as Q5 -// input[644]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[388] from Q5 -vmul.u32 Q2, Q0, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[644] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[392]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[392]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r7 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[392] from Q7 -// input[652]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[648] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[652]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[652] from Q5 -vmul.u32 Q2, Q0, r7 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[400]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[656]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[396] from Q4 -vstrw.u32 Q3, [r12,#(-432)] -// input[400]: Already loaded as Q5 -// input[656]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[400] from Q5 -vmul.u32 Q2, Q0, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[656] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[404]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[404]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[404] from Q7 -// input[664]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[660] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[664]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[664] from Q5 -vmul.u32 Q2, Q0, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[412]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[668]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[408] from Q4 -vstrw.u32 Q3, [r12,#(-384)] -// input[412]: Already loaded as Q5 -// input[668]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[412] from Q5 -vmul.u32 Q2, Q0, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[668] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[416]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[416]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r7 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[416] from Q7 -// input[676]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[672] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[676]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[676] from Q5 -vmul.u32 Q2, Q0, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[424]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[680]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[420] from Q4 -vstrw.u32 Q3, [r12,#(-336)] -// input[424]: Already loaded as Q5 -// input[680]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[424] from Q5 -vmul.u32 Q2, Q0, r7 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[680] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[428]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[428]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r7 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[428] from Q7 -// input[688]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[684] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[688]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[688] from Q5 -vmul.u32 Q2, Q0, r7 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[436]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[692]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[432] from Q4 -vstrw.u32 Q3, [r12,#(-288)] -// input[436]: Already loaded as Q5 -// input[692]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[436] from Q5 -vmul.u32 Q2, Q0, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[692] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[440]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[440]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r7 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[440] from Q7 -// input[700]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[696] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[700]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[700] from Q5 -vmul.u32 Q2, Q0, r7 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[704]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[444] from Q4 -vstrw.u32 Q3, [r12,#(-240)] -// input[448]: Already loaded as Q5 -// input[704]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[448] from Q5 -vmul.u32 Q2, Q0, r7 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[704] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[452]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[452]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[452] from Q7 -// input[712]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[708] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[712]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[712] from Q5 -vmul.u32 Q2, Q0, r7 -// input[456]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[716]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[456] from Q4 -vstrw.u32 Q3, [r12,#(-192)] -// input[460]: Already loaded as Q5 -// input[716]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[460] from Q5 -vmul.u32 Q2, Q0, r7 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[716] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[464]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[464]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[464] from Q7 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[720] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[724]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[724] from Q5 -vmul.u32 Q2, Q0, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[728]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[468] from Q4 -vstrw.u32 Q3, [r12,#(-144)] -// input[472]: Already loaded as Q5 -// input[728]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[472] from Q5 -vmul.u32 Q2, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[728] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[476]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[476]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[476] from Q7 -// input[736]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[732] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[736]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[736] from Q5 -vmul.u32 Q2, Q0, r7 -// input[480]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[740]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[480] from Q4 -vstrw.u32 Q3, [r12,#(-96)] -// input[484]: Already loaded as Q5 -// input[740]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[484] from Q5 -vmul.u32 Q2, Q0, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[740] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[488]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[488]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r7 -// input[744]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[488] from Q7 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[744] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[748]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[748] from Q5 -vmul.u32 Q2, Q0, r7 -// input[492]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[752]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[492] from Q4 -vstrw.u32 Q3, [r12,#(-48)] -// input[496]: Already loaded as Q5 -// input[752]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[496] from Q5 -vmul.u32 Q2, Q0, r7 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[752] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[500]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[500]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[500] from Q7 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[756] from Q4 -vstrw.u32 Q3, [r11,#(0)] -// input[760]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[760] from Q5 -vmul.u32 Q2, Q0, r7 -// input[504]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[764]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[504] from Q4 -vstrw.u32 Q3, [r12,#(0)] -// input[508]: Already loaded as Q5 -// input[764]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r8 -vadd.s32 Q3, Q5, Q7 -// Release input[508] from Q5 -vmul.u32 Q2, Q0, r7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r10 -vsub.s32 Q2, Q4, Q7 -// Release input[764] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r12,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r4 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[192] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[708]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[384] from Q4 -vmul.u32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// input[324]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r12,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[576] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r14,#(-240)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r12,#(-480)] -// input[324]: Already loaded as Q7 -// input[708]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r4 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q6 -// Release input[708] from Q6 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[456]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -48)] -vadd.s32 Q3, Q3, Q2 -// Release input[132] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[324] from Q7 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[72]: Already loaded as Q6 -// input[456]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vadd.s32 Q6, Q6, Q5 -// Release input[456] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vadd.s32 Q3, Q3, Q2 -// Release input[648] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[588]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q6 -// Release input[72] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-432)] -// input[588]: Already loaded as Q7 -// input[204]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -vadd.s32 Q7, Q7, Q5 -// Release input[204] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[720]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -36)] -vadd.s32 Q3, Q3, Q2 -// Release input[396] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[336]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[588] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-432)] -// input[336]: Already loaded as Q6 -// input[720]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vadd.s32 Q6, Q6, Q5 -// Release input[720] from Q5 -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[468]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -36)] -vadd.s32 Q3, Q3, Q2 -// Release input[144] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(336)] -vadd.s32 Q3, Q3, Q6 -// Release input[336] from Q6 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-432)] -// input[84]: Already loaded as Q7 -// input[468]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vadd.s32 Q7, Q7, Q5 -// Release input[468] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[216]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -36)] -vadd.s32 Q3, Q3, Q2 -// Release input[660] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[600]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-384)] -// input[600]: Already loaded as Q6 -// input[216]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[408]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -96)] -vadd.s32 Q6, Q6, Q5 -// Release input[216] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[732]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[408] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[348]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(384)] -vadd.s32 Q3, Q3, Q6 -// Release input[600] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-144)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-384)] -// input[348]: Already loaded as Q7 -// input[732]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vadd.s32 Q7, Q7, Q5 -// Release input[732] from Q5 -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[480]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[156] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(384)] -vadd.s32 Q3, Q3, Q7 -// Release input[348] from Q7 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-384)] -// input[96]: Already loaded as Q6 -// input[480]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[672]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[480] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[228]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -24)] -vadd.s32 Q3, Q3, Q2 -// Release input[672] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[612]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q3, Q3, Q6 -// Release input[96] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// input[612]: Already loaded as Q7 -// input[228]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -vadd.s32 Q7, Q7, Q5 -// Release input[228] from Q5 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[744]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[420] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[360]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[612] from Q7 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-336)] -// input[360]: Already loaded as Q6 -// input[744]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[744] from Q5 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[492]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q6 -// Release input[360] from Q6 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[108]: Already loaded as Q7 -// input[492]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[684]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[492] from Q5 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[684] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[624]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[108] from Q7 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-288)] -// input[624]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[756]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[432] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[372]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(480)] -vadd.s32 Q3, Q3, Q6 -// Release input[624] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-288)] -// input[372]: Already loaded as Q7 -// input[756]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[756] from Q5 -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vsub.s32 Q4, Q3, Q2 -// input[504]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[372] from Q7 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q6 -// input[504]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[696]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vadd.s32 Q6, Q6, Q5 -// Release input[504] from Q5 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[696] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[636]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q6 -// Release input[120] from Q6 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-240)] -// input[636]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vsub.s32 Q4, Q3, Q2 -// input[448]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -56)] -vadd.s32 Q3, Q3, Q2 -// Release input[444] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q3, Q3, Q7 -// Release input[636] from Q7 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-240)] -// input[64]: Already loaded as Q6 -// input[448]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[640]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vadd.s32 Q6, Q6, Q5 -// Release input[448] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q3, Q3, Q2 -// Release input[640] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[580]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(256)] -vadd.s32 Q3, Q3, Q6 -// Release input[64] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-464)] -// input[580]: Already loaded as Q7 -// input[196]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vadd.s32 Q7, Q7, Q5 -// Release input[196] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[712]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[388] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[328]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[580] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-464)] -// input[328]: Already loaded as Q6 -// input[712]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.s32 Q6, Q6, Q5 -// Release input[712] from Q5 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[460]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[136] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q6 -// Release input[328] from Q6 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-464)] -// input[76]: Already loaded as Q7 -// input[460]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vadd.s32 Q7, Q7, Q5 -// Release input[460] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q3, Q3, Q2 -// Release input[652] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[592]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[76] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-416)] -// input[592]: Already loaded as Q6 -// input[208]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[208] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[724]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[400] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[340]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[592] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-416)] -// input[340]: Already loaded as Q7 -// input[724]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q7, Q7, Q5 -// Release input[724] from Q5 -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[472]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[340] from Q7 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q6 -// input[472]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[664]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[472] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[664] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[604]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// input[604]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[736]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[412] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[352]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(400)] -vadd.s32 Q3, Q3, Q7 -// Release input[604] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-368)] -// input[352]: Already loaded as Q6 -// input[736]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[736] from Q5 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[484]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(400)] -vadd.s32 Q3, Q3, Q6 -// Release input[352] from Q6 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[100]: Already loaded as Q7 -// input[484]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[484] from Q5 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[676] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[616]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(400)] -vadd.s32 Q3, Q3, Q7 -// Release input[100] from Q7 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// input[616]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[424]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -80)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q2 -// input[748]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[424] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[364]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(448)] -vadd.s32 Q3, Q3, Q6 -// Release input[616] from Q6 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-320)] -// input[364]: Already loaded as Q7 -// input[748]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[748] from Q5 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[496]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(448)] -vadd.s32 Q3, Q3, Q7 -// Release input[364] from Q7 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[112]: Already loaded as Q6 -// input[496]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vadd.s32 Q6, Q6, Q5 -// Release input[496] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[688] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[628]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(448)] -vadd.s32 Q3, Q3, Q6 -// Release input[112] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-272)] -// input[628]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[760]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 4)] -vadd.s32 Q3, Q3, Q2 -// Release input[436] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[376]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[628] from Q7 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-272)] -// input[376]: Already loaded as Q6 -// input[760]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.s32 Q6, Q6, Q5 -// Release input[760] from Q5 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[508]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 4)] -vadd.s32 Q3, Q3, Q2 -// Release input[184] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q6 -// Release input[376] from Q6 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-272)] -// input[124]: Already loaded as Q7 -// input[508]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[508] from Q5 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[704]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -52)] -vadd.s32 Q3, Q3, Q2 -// Release input[700] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[320]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[124] from Q7 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// input[320]: Already loaded as Q6 -// input[704]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vadd.s32 Q6, Q6, Q5 -// Release input[704] from Q5 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[452]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -52)] -vadd.s32 Q3, Q3, Q2 -// Release input[128] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q6 -// Release input[320] from Q6 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-496)] -// input[68]: Already loaded as Q7 -// input[452]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -112)] -vadd.s32 Q7, Q7, Q5 -// Release input[452] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[200]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -52)] -vadd.s32 Q3, Q3, Q2 -// Release input[644] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[584]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q7 -// Release input[68] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-448)] -// input[584]: Already loaded as Q6 -// input[200]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -vadd.s32 Q6, Q6, Q5 -// Release input[200] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[716]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[392] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[584] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-448)] -// input[332]: Already loaded as Q7 -// input[716]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vadd.s32 Q7, Q7, Q5 -// Release input[716] from Q5 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[464]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[140] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q7 -// Release input[332] from Q7 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-448)] -// input[80]: Already loaded as Q6 -// input[464]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vadd.s32 Q6, Q6, Q5 -// Release input[464] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[212]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -40)] -vadd.s32 Q3, Q3, Q2 -// Release input[656] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[596]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-400)] -// input[596]: Already loaded as Q7 -// input[212]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -vadd.s32 Q7, Q7, Q5 -// Release input[212] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[728]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[404] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[344]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[596] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-160)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-400)] -// input[344]: Already loaded as Q6 -// input[728]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q6, Q6, Q5 -// Release input[728] from Q5 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[476]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[152] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q6 -// Release input[344] from Q6 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-400)] -// input[92]: Already loaded as Q7 -// input[476]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[668]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[476] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[668] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[608]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// input[608]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[740]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[416] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(416)] -vadd.s32 Q3, Q3, Q6 -// Release input[608] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-352)] -// input[356]: Already loaded as Q7 -// input[740]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[740] from Q5 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vsub.s32 Q4, Q3, Q2 -// input[488]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[356] from Q7 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q6 -// input[488]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[680]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vadd.s32 Q6, Q6, Q5 -// Release input[488] from Q5 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[680] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[620]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q6 -// Release input[104] from Q6 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-304)] -// input[620]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[428]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -76)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vsub.s32 Q4, Q3, Q2 -// input[752]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[428] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[368]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(464)] -vadd.s32 Q3, Q3, Q7 -// Release input[620] from Q7 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-304)] -// input[368]: Already loaded as Q6 -// input[752]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vadd.s32 Q6, Q6, Q5 -// Release input[752] from Q5 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[500]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[176] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q6 -// Release input[368] from Q6 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-304)] -// input[116]: Already loaded as Q7 -// input[500]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -64)] -vadd.s32 Q7, Q7, Q5 -// Release input[500] from Q5 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[692] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[632]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q7 -// Release input[116] from Q7 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-256)] -// input[632]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vsub.s32 Q4, Q3, Q2 -// input[764]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 8)] -vadd.s32 Q3, Q3, Q2 -// Release input[440] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-496)] -vadd.s32 Q3, Q3, Q6 -// Release input[632] from Q6 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r12,#(-256)] -// input[380]: Already loaded as Q7 -// input[764]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r4 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vadd.s32 Q7, Q7, Q5 -// Release input[764] from Q5 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[188] from Q2 -vmul.u32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// input[528]: Load as Q6 -vldrw.u32 Q6, [r12, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q7 -// Release input[380] from Q7 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-256)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r8 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q5, Q5, r7 -// input[528]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r10 -vqrdmulh.s32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r10 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r10 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r12,#(96)] -// Release input[528] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[564]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[516]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[312]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[312]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q0, Q0, r7 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(48)] -// Release input[516] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(240)] -// Release input[312] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmul.u32 Q1, Q1, r7 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(144)] -// Release input[540] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[304]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vmul.u32 Q2, Q2, r7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmul.u32 Q0, Q0, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[568]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[280]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[520]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 16)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(112)] -// Release input[280] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vmul.u32 Q2, Q2, r7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(64)] -// Release input[520] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[560]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[272]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(224)] -// Release input[560] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(80)] -// Release input[272] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[308]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vmul.u32 Q1, Q1, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[260]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q2, Q2, r7 -// input[536]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(32)] -// Release input[260] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(128)] -// Release input[536] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[572]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[524]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(272)] -// Release input[572] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(128)] -// Release input[284] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r7 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(80)] -// Release input[524] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(288)] -// Release input[576] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[120]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q0, Q0, r7 -// input[600]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 96)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[636]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(480)] -// Release input[120] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(384)] -// Release input[600] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r7 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vmul.u32 Q2, Q2, r7 -// input[592]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 88)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[628]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(352)] -// Release input[592] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[628]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 76)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(496)] -// Release input[628] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(304)] -// Release input[580] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[364]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 112)] -vmul.u32 Q2, Q2, r7 -// input[604]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(448)] -// Release input[364] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(400)] -// Release input[604] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmul.u32 Q1, Q1, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[68]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[632]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(464)] -// Release input[116] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r7 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r0,#(272)] -// Release input[68] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vmul.u32 Q0, Q0, r7 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(320)] -// Release input[584] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vmul.u32 Q1, Q1, r7 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[420]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-480)] -// Release input[384] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[696]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-336)] -// Release input[420] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[696]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r7 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[648]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[444]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-240)] -// Release input[696] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vmul.u32 Q1, Q1, r7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-432)] -// Release input[648] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[688]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[688]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r7 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[640]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[436]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-272)] -// Release input[688] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[436]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vmul.u32 Q0, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-272)] -// Release input[436] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[136]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r7 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-464)] -// Release input[136] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[652]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[416]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -88)] -vmul.u32 Q0, Q0, r7 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[692]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-352)] -// Release input[416] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[692]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[644]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-256)] -// Release input[692] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q2, Q2, r7 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-448)] -// Release input[644] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-448)] -// Release input[392] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-304)] -// Release input[428] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[480]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -24)] -vmul.u32 Q1, Q1, r7 -// input[720]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -36)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-96)] -// Release input[480] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-144)] -// Release input[720] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[756]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q2, Q2, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[504]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release input[756] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[504]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q0, Q0, r7 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(0)] -// Release input[504] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -12)] -vmul.u32 Q1, Q1, r7 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-48)] -// Release input[492] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-96)] -// Release input[732] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[496]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vmul.u32 Q2, Q2, r7 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[448]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[244]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vmul.u32 Q0, Q0, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-32)] -// Release input[244] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[760]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[472]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -32)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[712]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -44)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(16)] -// Release input[760] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-128)] -// Release input[472] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmul.u32 Q2, Q2, r7 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-176)] -// Release input[712] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[460]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[752]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[464]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -40)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[500]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-16)] -// Release input[752] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-160)] -// Release input[464] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[500]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmul.u32 Q1, Q1, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r10 -// input[452]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r10 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-16)] -// Release input[500] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q2, Q2, r7 -// input[728]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -28)] -vqrdmlah.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r12,#(-208)] -// Release input[452] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-112)] -// Release input[728] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[764]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r10 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r10 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(32)] -// Release input[764] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-112)] -// Release input[476] from Q4 -vadd.s32 Q2, Q2, Q6 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r8 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q1, Q1, r7 -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vqrdmlah.s32 Q0, Q1, r10 -// Use r14 as marker for r0 + 1024 -add r14, r0, #1024 -// Use r12 as marker for r0 + 2048 -add r12, r14, #1024 -// Use r11 as marker for r1 + 2048 -add r11, r1, #2048 -// Use r2 as marker for r1 + 4096 -add r2, r11, #2048 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q1, Q3, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q4, Q2, r10 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)]! -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q1, r10 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)]! -vqrdmulh.s32 Q6, Q3, r6 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(96)] -vqrdmulh.s32 Q7, Q1, r4 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(64)] -vqrdmlah.s32 Q7, Q1, r10 -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q3, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q2, r4 -vsub.s32 Q3, Q0, Q6 -vmul.u32 Q2, Q2, r3 -vstrw.u32 Q3, [r2,#(32)] -vqrdmlah.s32 Q7, Q2, r10 -vstrw.u32 Q7, [r11,#(80)] -// Release input[264] from Q2 -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q0, [r1,#(0)]! -vqrdmlah.s32 Q7, Q3, r10 -vneg.s32 Q7, Q7 -// Release input[516] from Q3 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[0] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 8)]! -vmul.u32 Q4, Q4, r7 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(96)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[268] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(112)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(80)] -// Release input[520] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(0)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[4] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[256] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[524]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[512]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[540]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(96)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[524] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(112)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(0)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[260] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[512] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[540]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[528]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[540] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[276] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[528] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[28]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[532]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[284]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[28] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[280] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[532] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[16] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[284]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[536]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[272]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[284] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(80)] -// Release input[536] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[20] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[272] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[552]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[556]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[300] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[552] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[288] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[556]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[556] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[292] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[544] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[548]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[44] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(80)] -// Release input[296] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[548] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[60]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[316]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[60] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[312] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[564] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[316]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[572]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[316] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[568] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[304] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[572]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[560]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[572] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(80)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[308] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[560] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[588]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[576]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[588] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[324] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[576] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[328]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[580]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[328] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[580] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[584]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[320]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[332] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(80)] -// Release input[584] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[320] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[600]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[336]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[348] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[600] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[336] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[604]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[604] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[340] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[592] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[596]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[92] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(80)] -// Release input[344] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[596] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[612]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[108] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[360] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[612] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[616]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[364] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[616] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[352] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[620]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[608]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[636]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[620] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(80)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[356] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[608] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[636]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[624]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[636] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[372] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[624] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[628]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[380]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[376] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[628] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[380]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[632]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[396]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[380] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(80)] -// Release input[632] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[368] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[396]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[648]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[384]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[396] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[648] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[384] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[652]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[388]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[652] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[388] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[640] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[392]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[644]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[140] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(80)] -// Release input[392] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[644] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[156]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[408]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[660]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[412]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[156] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[408] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[660] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[412]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[664]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[412] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[664] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[400] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[668]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[404]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[656]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[684]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[668] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(80)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[404] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[656] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[684]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[420]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[672]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[684] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[420] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[672] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[424]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[676]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[428]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[424] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[676] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[428]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[680]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[416]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[428] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(80)] -// Release input[680] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[416] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[444]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[696]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[432]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[700]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[444] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[696] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[432] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[700]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[436]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[700] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[436] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[688] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[188]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[440]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[692]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[188] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(80)] -// Release input[440] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[692] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[708]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[460]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[204] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[456] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[708] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[460]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[712]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[716]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[460] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[712] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[448] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[716]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[704]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[732]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r2,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r1,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[716] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(80)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[452] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[704] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[468]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[720]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[732] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[468] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[720] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[472]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[724]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[476]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[472] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[724] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[476]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[728]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[464]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[492]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r11,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r2,#(64)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[476] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(80)] -// Release input[728] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[464] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[492]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)]! -vmul.u32 Q3, Q3, r7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 16)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[480]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[748]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[492] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[744] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[480] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[748]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[484]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[748] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r2,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[484] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[736] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[488]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[740]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r11,#(64)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[236] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(80)] -// Release input[488] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[740] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[252]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)]! -vmul.u32 Q4, Q4, r7 -// input[756]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 16)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r11,#(192)] -vqrdmlah.s32 Q7, Q4, r10 -// Release input[252] from Q4 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r2,#(160)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r11,#(208)] -// Release input[504] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[756] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r2,#(176)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[760]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 4)]! -vmul.u32 Q3, Q3, r7 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)]! -vqrdmlah.s32 Q0, Q3, r10 -vqrdmulh.s32 Q4, Q1, r8 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r10 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)]! -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r10 -// input[764]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r6 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r11,#(224)] -vqrdmulh.s32 Q7, Q3, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r3 -vstrw.u32 Q1, [r2,#(192)] -vqrdmlah.s32 Q7, Q3, r10 -// Release input[508] from Q3 -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r11,#(240)] -vqrdmulh.s32 Q7, Q1, r4 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r1,#(32)] -vqrdmlah.s32 Q7, Q1, r10 -vstrw.u32 Q7, [r2,#(208)] -// Release input[760] from Q1 -vqrdmulh.s32 Q7, Q2, r6 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r11,#(128)]! -vqrdmlah.s32 Q7, Q2, r10 -vneg.s32 Q7, Q7 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r11,#(16)] -// Release input[496] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[764]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r8 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)]! -vmul.u32 Q4, Q4, r7 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)]! -vqrdmlah.s32 Q0, Q4, r10 -vqrdmulh.s32 Q3, Q1, r8 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r7 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r10 -// input[752]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)]! -vqrdmulh.s32 Q5, Q4, r4 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r3 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r10 -vqrdmulh.s32 Q3, Q2, r6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q4, [r2,#(224)] -vqrdmulh.s32 Q6, Q4, r4 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r3 -vstrw.u32 Q1, [r1,#(64)] -vqrdmlah.s32 Q6, Q4, r10 -// Release input[764] from Q4 -vqrdmlah.s32 Q3, Q2, r10 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r2,#(240)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q1, Q1, r3 -vstrw.u32 Q2, [r11,#(32)] -vqrdmlah.s32 Q6, Q1, r10 -vstrw.u32 Q6, [r1,#(80)] -// Release input[248] from Q1 -vqrdmulh.s32 Q6, Q2, r6 -vadd.s32 Q0, Q0, Q3 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q0, [r2,#(128)]! -vqrdmlah.s32 Q6, Q2, r10 -vneg.s32 Q6, Q6 -// Release input[500] from Q2 -vqrdmulh.s32 Q1, Q0, r6 -vstrw.u32 Q6, [r11,#(48)] -vmul.u32 Q0, Q0, r5 -ldrd r8, r7, [r9], #+8 -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q1, [r2,#(16)] -// Release input[752] from Q0 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 8068 -// Instruction count: 6203 \ No newline at end of file diff --git a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_rev4.s b/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_rev4.s deleted file mode 100644 index 04965a9..0000000 --- a/tests/ntt_768/auto/ntt_768_u32_33556993_299353_incomplete_rev4.s +++ /dev/null @@ -1,7378 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 893127 /// zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 66384763 /// zeta^512 * 2^31 = 299353^512 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^512 * f(q^(-1) mod 2^32) * 2^31 = 299353^512 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 299353^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 38018305 // zeta^192 * 2^31 = 299353^192 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 299353^192 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 43317805 // zeta^ 96 * 2^31 = 299353^ 96 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 96 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 14476917 // zeta^288 * 2^31 = 299353^288 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 299353^288 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 64683161 // zeta^ 48 * 2^31 = 299353^ 48 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 48 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 32686385 // zeta^240 * 2^31 = 299353^240 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 299353^240 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 48515911 // zeta^144 * 2^31 = 299353^144 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 299353^144 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 4885007 // zeta^336 * 2^31 = 299353^336 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 299353^336 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 59281651 // zeta^ 24 * 2^31 = 299353^ 24 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 24 * 375649793 * 2^31 -.word 40872659 // zeta^ 12 * 2^31 = 299353^ 12 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 12 * 375649793 * 2^31 -.word 5033605 // zeta^204 * 2^31 = 299353^204 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 299353^204 * 375649793 * 2^31 -.word 26613973 // zeta^216 * 2^31 = 299353^216 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 299353^216 * 375649793 * 2^31 -.word 50479773 // zeta^108 * 2^31 = 299353^108 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 299353^108 * 375649793 * 2^31 -.word 58797193 // zeta^300 * 2^31 = 299353^300 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 299353^300 * 375649793 * 2^31 -.word 8356523 // zeta^120 * 2^31 = 299353^120 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 299353^120 * 375649793 * 2^31 -.word 59392861 // zeta^ 60 * 2^31 = 299353^ 60 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 60 * 375649793 * 2^31 -.word 9383201 // zeta^252 * 2^31 = 299353^252 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 299353^252 * 375649793 * 2^31 -.word 25917637 // zeta^312 * 2^31 = 299353^312 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 299353^312 * 375649793 * 2^31 -.word 63329695 // zeta^156 * 2^31 = 299353^156 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 299353^156 * 375649793 * 2^31 -.word 57130935 // zeta^348 * 2^31 = 299353^348 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 299353^348 * 375649793 * 2^31 -.word 45317587 // zeta^ 72 * 2^31 = 299353^ 72 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 72 * 375649793 * 2^31 -.word 65797823 // zeta^ 36 * 2^31 = 299353^ 36 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 36 * 375649793 * 2^31 -.word 10391631 // zeta^228 * 2^31 = 299353^228 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 299353^228 * 375649793 * 2^31 -.word 39999747 // zeta^264 * 2^31 = 299353^264 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 299353^264 * 375649793 * 2^31 -.word 31719253 // zeta^132 * 2^31 = 299353^132 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 299353^132 * 375649793 * 2^31 -.word 12271567 // zeta^324 * 2^31 = 299353^324 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 299353^324 * 375649793 * 2^31 -.word 18302687 // zeta^168 * 2^31 = 299353^168 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 299353^168 * 375649793 * 2^31 -.word 21111903 // zeta^ 84 * 2^31 = 299353^ 84 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 84 * 375649793 * 2^31 -.word 12778219 // zeta^276 * 2^31 = 299353^276 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 299353^276 * 375649793 * 2^31 -.word 54571669 // zeta^360 * 2^31 = 299353^360 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 299353^360 * 375649793 * 2^31 -.word 35733845 // zeta^180 * 2^31 = 299353^180 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 299353^180 * 375649793 * 2^31 -.word 6014597 // zeta^372 * 2^31 = 299353^372 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 299353^372 * 375649793 * 2^31 -.word 65310821 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 3561709979 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 22138503 // zeta^ 4 * 2^31 = 299353^ 4 * 2^31 = 27792935 * 2^31 -.word 3926095737 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 4 * 375649793 * 2^31 -.word 33080685 // zeta^196 * 2^31 = 299353^196 * 2^31 = 14985834 * 2^31 -.word 959020179 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 299353^196 * 375649793 * 2^31 -.word 1555569 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 2602610063 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 2867655 // zeta^100 * 2^31 = 299353^100 * 2^31 = 27701331 * 2^31 -.word 3920233529 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 299353^100 * 375649793 * 2^31 -.word 16116991 // zeta^292 * 2^31 = 299353^292 * 2^31 = 7520866 * 2^31 -.word 481298689 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 299353^292 * 375649793 * 2^31 -.word 6082985 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 869255255 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 62987623 // zeta^ 52 * 2^31 = 299353^ 52 * 2^31 = 12887930 * 2^31 -.word 824764569 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 52 * 375649793 * 2^31 -.word 21603065 // zeta^244 * 2^31 = 299353^244 * 2^31 = 20924057 * 2^31 -.word 3486521095 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 299353^244 * 375649793 * 2^31 -.word 9182701 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1779115027 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 52075375 // zeta^148 * 2^31 = 299353^148 * 2^31 = 18248795 * 2^31 -.word 3315317393 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 299353^148 * 375649793 * 2^31 -.word 41362929 // zeta^340 * 2^31 = 299353^340 * 2^31 = 7570258 * 2^31 -.word 484459535 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 299353^340 * 375649793 * 2^31 -.word 36605521 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1102879663 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 32011901 // zeta^ 28 * 2^31 = 299353^ 28 * 2^31 = 3117724 * 2^31 -.word 199519107 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 28 * 375649793 * 2^31 -.word 794339 // zeta^220 * 2^31 = 299353^220 * 2^31 = 26868479 * 2^31 -.word 3866935069 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 299353^220 * 375649793 * 2^31 -.word 57238029 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 2941722611 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 56716901 // zeta^124 * 2^31 = 299353^124 * 2^31 = 32895965 * 2^31 -.word 4252664731 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 299353^124 * 375649793 * 2^31 -.word 37067083 // zeta^316 * 2^31 = 299353^316 * 2^31 = 17429125 * 2^31 -.word 3262862517 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 299353^316 * 375649793 * 2^31 -.word 41992621 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 2138832979 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 37641785 // zeta^ 76 * 2^31 = 299353^ 76 * 2^31 = 14988263 * 2^31 -.word 3106659271 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 76 * 375649793 * 2^31 -.word 9036599 // zeta^268 * 2^31 = 299353^268 * 2^31 = 26964245 * 2^31 -.word 3873063625 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 299353^268 * 375649793 * 2^31 -.word 53062965 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1726375115 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 33892281 // zeta^172 * 2^31 = 299353^172 * 2^31 = 18683355 * 2^31 -.word 3343127111 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 299353^172 * 375649793 * 2^31 -.word 27392067 // zeta^364 * 2^31 = 299353^364 * 2^31 = 5739597 * 2^31 -.word 2514789821 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 299353^364 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 42676979 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 2760264461 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 27097661 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 659875779 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 4639589 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2721523355 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 56908961 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 3814059359 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 7108001 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 1934087263 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 16267963 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 2923893573 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 28966165 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2549404907 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 15324513 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 1904735391 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 29170123 // zeta^ 16 * 2^31 = 299353^ 16 * 2^31 = 24111745 * 2^31 -.word 3690517557 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 16 * 375649793 * 2^31 -.word 65310821 // zeta^ 8 * 2^31 = 299353^ 8 * 2^31 = 22098973 * 2^31 -.word 3561709979 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 8 * 375649793 * 2^31 -.word 1555569 // zeta^200 * 2^31 = 299353^200 * 2^31 = 7111893 * 2^31 -.word 2602610063 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 299353^200 * 375649793 * 2^31 -.word 22561577 // zeta^208 * 2^31 = 299353^208 * 2^31 = 12390669 * 2^31 -.word 2940425943 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 299353^208 * 375649793 * 2^31 -.word 6082985 // zeta^104 * 2^31 = 299353^104 * 2^31 = 13583150 * 2^31 -.word 869255255 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 299353^104 * 375649793 * 2^31 -.word 9182701 // zeta^296 * 2^31 = 299353^296 * 2^31 = 27800794 * 2^31 -.word 1779115027 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 299353^296 * 375649793 * 2^31 -.word 18052069 // zeta^112 * 2^31 = 299353^112 * 2^31 = 20448273 * 2^31 -.word 3456073243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 299353^112 * 375649793 * 2^31 -.word 36605521 // zeta^ 56 * 2^31 = 299353^ 56 * 2^31 = 17233810 * 2^31 -.word 1102879663 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 56 * 375649793 * 2^31 -.word 57238029 // zeta^248 * 2^31 = 299353^248 * 2^31 = 12410931 * 2^31 -.word 2941722611 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 299353^248 * 375649793 * 2^31 -.word 64406963 // zeta^304 * 2^31 = 299353^304 * 2^31 = 17817137 * 2^31 -.word 3287693389 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 299353^304 * 375649793 * 2^31 -.word 41992621 // zeta^152 * 2^31 = 299353^152 * 2^31 = 33421816 * 2^31 -.word 2138832979 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 299353^152 * 375649793 * 2^31 -.word 53062965 // zeta^344 * 2^31 = 299353^344 * 2^31 = 26976670 * 2^31 -.word 1726375115 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 299353^344 * 375649793 * 2^31 -.word 16802007 // zeta^ 64 * 2^31 = 299353^ 64 * 2^31 = 13841461 * 2^31 -.word 3033269545 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 64 * 375649793 * 2^31 -.word 7518129 // zeta^ 32 * 2^31 = 299353^ 32 * 2^31 = 940305 * 2^31 -.word 2207658575 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 32 * 375649793 * 2^31 -.word 728237 // zeta^224 * 2^31 = 299353^224 * 2^31 = 24511972 * 2^31 -.word 1568646483 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 299353^224 * 375649793 * 2^31 -.word 893127 // zeta^256 * 2^31 = 299353^256 * 2^31 = 8518431 * 2^31 -.word 2692621625 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 299353^256 * 375649793 * 2^31 -.word 729223 // zeta^128 * 2^31 = 299353^128 * 2^31 = 8518432 * 2^31 -.word 545138041 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 299353^128 * 375649793 * 2^31 -.word 54773291 // zeta^320 * 2^31 = 299353^320 * 2^31 = 2013241 * 2^31 -.word 2276321237 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 299353^320 * 375649793 * 2^31 -.word 55552039 // zeta^160 * 2^31 = 299353^160 * 2^31 = 4200632 * 2^31 -.word 268819929 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 299353^160 * 375649793 * 2^31 -.word 51233563 // zeta^ 80 * 2^31 = 299353^ 80 * 2^31 = 33038085 * 2^31 -.word 4261759717 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 80 * 375649793 * 2^31 -.word 52902781 // zeta^272 * 2^31 = 299353^272 * 2^31 = 2711401 * 2^31 -.word 2321000067 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 299353^272 * 375649793 * 2^31 -.word 58081411 // zeta^352 * 2^31 = 299353^352 * 2^31 = 9932396 * 2^31 -.word 635624829 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 299353^352 * 375649793 * 2^31 -.word 28419145 // zeta^176 * 2^31 = 299353^176 * 2^31 = 32562828 * 2^31 -.word 2083861943 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 299353^176 * 375649793 * 2^31 -.word 48191309 // zeta^368 * 2^31 = 299353^368 * 2^31 = 33153165 * 2^31 -.word 4269124275 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 299353^368 * 375649793 * 2^31 -.word 42676979 // zeta^ 40 * 2^31 = 299353^ 40 * 2^31 = 9575431 * 2^31 -.word 2760264461 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 40 * 375649793 * 2^31 -.word 5740163 // zeta^ 20 * 2^31 = 299353^ 20 * 2^31 = 24739198 * 2^31 -.word 1583187837 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 20 * 375649793 * 2^31 -.word 28917839 // zeta^212 * 2^31 = 299353^212 * 2^31 = 21478846 * 2^31 -.word 1374541233 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 299353^212 * 375649793 * 2^31 -.word 27097661 // zeta^232 * 2^31 = 299353^232 * 2^31 = 10311346 * 2^31 -.word 659875779 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 299353^232 * 375649793 * 2^31 -.word 49145461 // zeta^116 * 2^31 = 299353^116 * 2^31 = 13729478 * 2^31 -.word 878619531 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 299353^116 * 375649793 * 2^31 -.word 6303215 // zeta^308 * 2^31 = 299353^308 * 2^31 = 18367002 * 2^31 -.word 1175398417 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 299353^308 * 375649793 * 2^31 -.word 4639589 // zeta^136 * 2^31 = 299353^136 * 2^31 = 8970055 * 2^31 -.word 2721523355 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 299353^136 * 375649793 * 2^31 -.word 54366111 // zeta^ 68 * 2^31 = 299353^ 68 * 2^31 = 8457503 * 2^31 -.word 2688722529 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 68 * 375649793 * 2^31 -.word 43137743 // zeta^260 * 2^31 = 299353^260 * 2^31 = 29589567 * 2^31 -.word 4041071409 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 299353^260 * 375649793 * 2^31 -.word 56908961 // zeta^328 * 2^31 = 299353^328 * 2^31 = 26042233 * 2^31 -.word 3814059359 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 299353^328 * 375649793 * 2^31 -.word 48357821 // zeta^164 * 2^31 = 299353^164 * 2^31 = 26244564 * 2^31 -.word 1679523907 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 299353^164 * 375649793 * 2^31 -.word 41080969 // zeta^356 * 2^31 = 299353^356 * 2^31 = 7994472 * 2^31 -.word 511607159 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 299353^356 * 375649793 * 2^31 -.word 7108001 // zeta^ 88 * 2^31 = 299353^ 88 * 2^31 = 30222420 * 2^31 -.word 1934087263 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 88 * 375649793 * 2^31 -.word 8652081 // zeta^ 44 * 2^31 = 299353^ 44 * 2^31 = 27932647 * 2^31 -.word 3935036623 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 44 * 375649793 * 2^31 -.word 44314847 // zeta^236 * 2^31 = 299353^236 * 2^31 = 10003728 * 2^31 -.word 640189729 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 299353^236 * 375649793 * 2^31 -.word 16267963 // zeta^280 * 2^31 = 299353^280 * 2^31 = 12132331 * 2^31 -.word 2923893573 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 299353^280 * 375649793 * 2^31 -.word 16352265 // zeta^140 * 2^31 = 299353^140 * 2^31 = 26391350 * 2^31 -.word 1688917495 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 299353^140 * 375649793 * 2^31 -.word 948813 // zeta^332 * 2^31 = 299353^332 * 2^31 = 11703708 * 2^31 -.word 748980147 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 299353^332 * 375649793 * 2^31 -.word 28966165 // zeta^184 * 2^31 = 299353^184 * 2^31 = 6280499 * 2^31 -.word 2549404907 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 299353^184 * 375649793 * 2^31 -.word 44334383 // zeta^ 92 * 2^31 = 299353^ 92 * 2^31 = 31954666 * 2^31 -.word 2044942545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 299353^ 92 * 375649793 * 2^31 -.word 64874787 // zeta^284 * 2^31 = 299353^284 * 2^31 = 5130075 * 2^31 -.word 2475783389 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 299353^284 * 375649793 * 2^31 -.word 15324513 // zeta^376 * 2^31 = 299353^376 * 2^31 = 29763762 * 2^31 -.word 1904735391 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 299353^376 * 375649793 * 2^31 -.word 62902951 // zeta^188 * 2^31 = 299353^188 * 2^31 = 22872479 * 2^31 -.word 3611210585 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 299353^188 * 375649793 * 2^31 -.word 53337279 // zeta^380 * 2^31 = 299353^380 * 2^31 = 9132318 * 2^31 -.word 584423745 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 299353^380 * 375649793 * 2^31 -.text -rev4: .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_768_u32_33556993_299353_incomplete_rev4, %function -.global ntt_768_u32_33556993_299353_incomplete_rev4 -ntt_768_u32_33556993_299353_incomplete_rev4: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -adr r4, rev4 -vldrb.u32 Q0, [r4] -vadd.u32 Q0, Q0, r0 -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -// Use r11 as marker for r0 + 3024 -add r11, r12, #1008 -.equ modulus, 33556993 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(0)] -vsub.s32 Q4, Q1, Q3 -// Release input[0] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -// input[516]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 12)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(16)] -vsub.s32 Q4, Q1, Q3 -// Release input[4] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(48)] -// Release input[516] from Q3 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(32)] -vsub.s32 Q4, Q1, Q3 -// Release input[8] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(48)] -vsub.s32 Q4, Q1, Q3 -// Release input[12] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -// input[528]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 24)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(64)] -vsub.s32 Q4, Q1, Q3 -// Release input[16] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(96)] -// Release input[528] from Q3 -// input[20]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 20)] -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -// input[532]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 28)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(80)] -vsub.s32 Q4, Q1, Q3 -// Release input[20] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(112)] -// Release input[532] from Q3 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(96)] -vsub.s32 Q4, Q1, Q3 -// Release input[24] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -// input[540]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 36)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(112)] -vsub.s32 Q4, Q1, Q3 -// Release input[28] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(144)] -// Release input[540] from Q3 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -// input[544]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 40)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(128)] -vsub.s32 Q4, Q1, Q3 -// Release input[32] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(160)] -// Release input[544] from Q3 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(144)] -vsub.s32 Q4, Q1, Q3 -// Release input[36] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -// input[296]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 44)] -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(160)] -vsub.s32 Q4, Q1, Q3 -// Release input[40] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(176)] -// Release input[296] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -// input[300]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 48)] -// input[556]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 52)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(176)] -vsub.s32 Q4, Q1, Q3 -// Release input[44] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(192)] -// Release input[300] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(208)] -// Release input[556] from Q3 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(192)] -vsub.s32 Q4, Q1, Q3 -// Release input[48] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -// input[564]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 60)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(208)] -vsub.s32 Q4, Q1, Q3 -// Release input[52] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(240)] -// Release input[564] from Q3 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -// input[312]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 60)] -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(224)] -vsub.s32 Q4, Q1, Q3 -// Release input[56] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(240)] -vsub.s32 Q4, Q1, Q3 -// Release input[60] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -// input[576]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 72)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(256)] -vsub.s32 Q4, Q1, Q3 -// Release input[64] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(288)] -// Release input[576] from Q3 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -// input[580]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 76)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(272)] -vsub.s32 Q4, Q1, Q3 -// Release input[68] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(304)] -// Release input[580] from Q3 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(288)] -vsub.s32 Q4, Q1, Q3 -// Release input[72] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -// input[588]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 84)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(304)] -vsub.s32 Q4, Q1, Q3 -// Release input[76] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(336)] -// Release input[588] from Q3 -// input[80]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 80)] -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -// input[592]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 88)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(320)] -vsub.s32 Q4, Q1, Q3 -// Release input[80] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(352)] -// Release input[592] from Q3 -// input[84]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 84)] -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -// input[596]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 92)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(336)] -vsub.s32 Q4, Q1, Q3 -// Release input[84] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(352)] -// Release input[340] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(368)] -// Release input[596] from Q3 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(352)] -vsub.s32 Q4, Q1, Q3 -// Release input[88] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -// input[604]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 100)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(368)] -vsub.s32 Q4, Q1, Q3 -// Release input[92] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(400)] -// Release input[604] from Q3 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -// input[608]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 104)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(384)] -vsub.s32 Q4, Q1, Q3 -// Release input[96] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(416)] -// Release input[608] from Q3 -// input[100]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 100)] -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -// input[612]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 108)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(400)] -vsub.s32 Q4, Q1, Q3 -// Release input[100] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(432)] -// Release input[612] from Q3 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -// input[360]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 108)] -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(416)] -vsub.s32 Q4, Q1, Q3 -// Release input[104] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(432)] -// Release input[360] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -// input[620]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 116)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(432)] -vsub.s32 Q4, Q1, Q3 -// Release input[108] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(464)] -// Release input[620] from Q3 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -// input[624]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 120)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(448)] -vsub.s32 Q4, Q1, Q3 -// Release input[112] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(480)] -// Release input[624] from Q3 -// input[116]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 116)] -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -// input[628]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 124)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(464)] -vsub.s32 Q4, Q1, Q3 -// Release input[116] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(480)] -vsub.s32 Q4, Q1, Q3 -// Release input[120] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -// input[380]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -124)] -// input[636]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(496)] -vsub.s32 Q4, Q1, Q3 -// Release input[124] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-480)] -// Release input[636] from Q3 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -// input[384]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -120)] -// input[640]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-496)] -vsub.s32 Q4, Q1, Q3 -// Release input[128] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-464)] -// Release input[640] from Q3 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -// input[388]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -116)] -// input[644]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-480)] -vsub.s32 Q4, Q1, Q3 -// Release input[132] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-464)] -// Release input[388] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -// input[392]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -112)] -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-464)] -vsub.s32 Q4, Q1, Q3 -// Release input[136] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-448)] -// Release input[392] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -// input[396]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -108)] -// input[652]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-448)] -vsub.s32 Q4, Q1, Q3 -// Release input[140] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-416)] -// Release input[652] from Q3 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -// input[400]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -104)] -// input[656]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.s32 Q4, Q1, Q3 -// Release input[144] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-416)] -// Release input[400] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[656] from Q3 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -// input[404]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -100)] -// input[660]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-416)] -vsub.s32 Q4, Q1, Q3 -// Release input[148] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-400)] -// Release input[404] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-384)] -// Release input[660] from Q3 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -// input[408]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -96)] -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-400)] -vsub.s32 Q4, Q1, Q3 -// Release input[152] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-384)] -// Release input[408] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -// input[668]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -88)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-384)] -vsub.s32 Q4, Q1, Q3 -// Release input[156] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-352)] -// Release input[668] from Q3 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -// input[672]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-368)] -vsub.s32 Q4, Q1, Q3 -// Release input[160] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-336)] -// Release input[672] from Q3 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -// input[420]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -84)] -// input[676]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -80)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-352)] -vsub.s32 Q4, Q1, Q3 -// Release input[164] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-336)] -// Release input[420] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-320)] -// Release input[676] from Q3 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -// input[424]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -80)] -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.s32 Q4, Q1, Q3 -// Release input[168] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-320)] -// Release input[424] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -// input[172]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -80)] -// input[428]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -76)] -// input[684]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-320)] -vsub.s32 Q4, Q1, Q3 -// Release input[172] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-304)] -// Release input[428] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-288)] -// Release input[684] from Q3 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -// input[432]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -72)] -// input[688]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-304)] -vsub.s32 Q4, Q1, Q3 -// Release input[176] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-288)] -// Release input[432] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -// input[436]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -68)] -// input[692]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-288)] -vsub.s32 Q4, Q1, Q3 -// Release input[180] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-272)] -// Release input[436] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-256)] -// Release input[692] from Q3 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -// input[440]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -64)] -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-272)] -vsub.s32 Q4, Q1, Q3 -// Release input[184] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -// input[444]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -60)] -// input[700]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-256)] -vsub.s32 Q4, Q1, Q3 -// Release input[188] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-240)] -// Release input[444] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -// input[448]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -56)] -// input[704]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -52)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-240)] -vsub.s32 Q4, Q1, Q3 -// Release input[192] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-224)] -// Release input[448] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -// input[708]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-224)] -vsub.s32 Q4, Q1, Q3 -// Release input[196] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-208)] -// Release input[452] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-192)] -// Release input[708] from Q3 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -// input[456]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -48)] -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-208)] -vsub.s32 Q4, Q1, Q3 -// Release input[200] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -// input[716]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-192)] -vsub.s32 Q4, Q1, Q3 -// Release input[204] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-160)] -// Release input[716] from Q3 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -// input[720]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -36)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.s32 Q4, Q1, Q3 -// Release input[208] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-144)] -// Release input[720] from Q3 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -// input[468]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -36)] -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-160)] -vsub.s32 Q4, Q1, Q3 -// Release input[212] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-144)] -// Release input[468] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -// input[472]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -32)] -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-144)] -vsub.s32 Q4, Q1, Q3 -// Release input[216] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-128)] -// Release input[472] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -// input[476]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -28)] -// input[732]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -24)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-128)] -vsub.s32 Q4, Q1, Q3 -// Release input[220] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-112)] -// Release input[476] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-96)] -// Release input[732] from Q3 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -// input[480]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -24)] -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.s32 Q4, Q1, Q3 -// Release input[224] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-96)] -// Release input[480] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -// input[484]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -20)] -// input[740]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-96)] -vsub.s32 Q4, Q1, Q3 -// Release input[228] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-80)] -// Release input[484] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-64)] -// Release input[740] from Q3 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -// input[488]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -16)] -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-80)] -vsub.s32 Q4, Q1, Q3 -// Release input[232] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-64)] -// Release input[488] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -// input[492]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -12)] -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-64)] -vsub.s32 Q4, Q1, Q3 -// Release input[236] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-48)] -// Release input[492] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -// input[496]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -8)] -// input[752]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -4)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.s32 Q4, Q1, Q3 -// Release input[240] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-32)] -// Release input[496] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-16)] -// Release input[752] from Q3 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -// input[756]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-32)] -vsub.s32 Q4, Q1, Q3 -// Release input[244] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(0)] -// Release input[756] from Q3 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -// input[504]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 0)] -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-16)] -vsub.s32 Q4, Q1, Q3 -// Release input[248] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(0)] -// Release input[504] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -// input[764]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vadd.s32 Q4, Q1, Q2 -vadd.s32 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(0)] -vsub.s32 Q4, Q1, Q3 -// Release input[252] from Q1 -vsub.s32 Q1, Q2, Q3 -vqrdmulh.s32 Q5, Q1, r8 -vmul.u32 Q6, Q1, r7 -vqrdmlah.s32 Q5, Q6, r10 -vadd.s32 Q2, Q4, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vsub.s32 Q3, Q4, Q5 -vsub.s32 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(32)] -// Release input[764] from Q3 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q1, r8 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vmul.u32 Q1, Q1, r7 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmlah.s32 Q2, Q1, r10 -vqrdmulh.s32 Q5, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q3, r10 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q3, Q2, Q5 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q1, r10 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q4, r6 -vsub.s32 Q1, Q3, Q6 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q6 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmlah.s32 Q7, Q4, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q4, Q2, Q7 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vadd.s32 Q2, Q2, Q7 -// input[196]: Already loaded as Q5 -vqrdmulh.s32 Q1, Q5, r8 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vmul.u32 Q5, Q5, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q5, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q5, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q6, Q5, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q5, Q5, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q5, r10 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q7, Q4, r6 -vsub.s32 Q5, Q3, Q6 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q6 -vstrw.u32 Q5, [r14,#(-224)] -// Release input[196] from Q5 -vqrdmlah.s32 Q7, Q4, r10 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q4, Q1, Q7 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q7 -// input[200]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r7 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmul.u32 Q1, Q1, r7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(304)] -// Release input[76] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmul.u32 Q3, Q3, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(320)] -// Release input[80] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[212]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r7 -// input[84]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-160)] -// Release input[212] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(352)] -// Release input[88] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmul.u32 Q3, Q3, r7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[224]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmul.u32 Q2, Q2, r7 -// input[96]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 96)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(384)] -// Release input[96] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[228]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmul.u32 Q1, Q1, r7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(400)] -// Release input[100] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[232]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r7 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(416)] -// Release input[104] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmul.u32 Q2, Q2, r7 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(160)] -// Release input[40] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(432)] -// Release input[108] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vmul.u32 Q1, Q1, r7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r7 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(464)] -// Release input[116] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r7 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-272)] -// Release input[184] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(480)] -// Release input[120] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vmul.u32 Q1, Q1, r7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[448]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -56)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[448]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[384]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -120)] -vmul.u32 Q3, Q3, r7 -// input[320]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[452]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-224)] -// Release input[448] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-480)] -// Release input[384] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(272)] -// Release input[320] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[452]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vmul.u32 Q2, Q2, r7 -// input[324]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[456]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-208)] -// Release input[452] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(288)] -// Release input[324] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[456]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[392]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -112)] -vmul.u32 Q1, Q1, r7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[460]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-192)] -// Release input[456] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-448)] -// Release input[392] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(304)] -// Release input[328] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[460]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[396]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -108)] -vmul.u32 Q3, Q3, r7 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-176)] -// Release input[460] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-432)] -// Release input[396] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(320)] -// Release input[332] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[464]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[400]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -104)] -vmul.u32 Q2, Q2, r7 -// input[336]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[468]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-416)] -// Release input[400] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[468]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vmul.u32 Q1, Q1, r7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-144)] -// Release input[468] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[472]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[408]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -96)] -vmul.u32 Q3, Q3, r7 -// input[344]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[476]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-384)] -// Release input[408] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[476]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[412]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -92)] -vmul.u32 Q2, Q2, r7 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-112)] -// Release input[476] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-368)] -// Release input[412] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[480]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[416]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -88)] -vmul.u32 Q1, Q1, r7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[484]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(400)] -// Release input[352] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[484]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vmul.u32 Q3, Q3, r7 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[292]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 40)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[488]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-80)] -// Release input[484] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(416)] -// Release input[356] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[488]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[424]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -80)] -vmul.u32 Q2, Q2, r7 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(160)] -// Release input[292] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[492]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-64)] -// Release input[488] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(432)] -// Release input[360] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[492]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[428]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -76)] -vmul.u32 Q1, Q1, r7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[300]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[496]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-48)] -// Release input[492] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(448)] -// Release input[364] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[496]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[432]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -72)] -vmul.u32 Q3, Q3, r7 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(192)] -// Release input[300] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[500]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-288)] -// Release input[432] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(464)] -// Release input[368] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vmul.u32 Q2, Q2, r7 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[504]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(480)] -// Release input[372] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[440]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -64)] -vmul.u32 Q1, Q1, r7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[312]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[508]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-256)] -// Release input[440] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(496)] -// Release input[376] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[444]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -60)] -vmul.u32 Q3, Q3, r7 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(240)] -// Release input[312] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(16)] -// Release input[508] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-240)] -// Release input[444] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-496)] -// Release input[380] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[704]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[640]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmul.u32 Q2, Q2, r7 -// input[576]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[512]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 8)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[708]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -48)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-464)] -// Release input[640] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(288)] -// Release input[576] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[708]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmul.u32 Q1, Q1, r7 -// input[580]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 76)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[516]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 12)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-192)] -// Release input[708] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(304)] -// Release input[580] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[712]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[648]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -108)] -vmul.u32 Q3, Q3, r7 -// input[584]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 80)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[520]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 16)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[716]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-432)] -// Release input[648] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(320)] -// Release input[584] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[716]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[652]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmul.u32 Q2, Q2, r7 -// input[588]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[524]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-160)] -// Release input[716] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[652] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(336)] -// Release input[588] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[720]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[656]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -100)] -vmul.u32 Q1, Q1, r7 -// input[592]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 88)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[528]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 24)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[724]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -32)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-400)] -// Release input[656] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[724]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vmul.u32 Q3, Q3, r7 -// input[596]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 92)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(96)] -// Release input[528] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[532]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 28)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[728]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-128)] -// Release input[724] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[728]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[664]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -92)] -vmul.u32 Q2, Q2, r7 -// input[600]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(112)] -// Release input[532] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[732]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-112)] -// Release input[728] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-368)] -// Release input[664] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[732]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[668]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmul.u32 Q1, Q1, r7 -// input[604]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[540]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[736]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -20)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-96)] -// Release input[732] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-352)] -// Release input[668] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[736]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[672]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmul.u32 Q3, Q3, r7 -// input[608]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 104)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(144)] -// Release input[540] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[544]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 40)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[740]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -16)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-80)] -// Release input[736] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(416)] -// Release input[608] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[740]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmul.u32 Q2, Q2, r7 -// input[612]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(160)] -// Release input[544] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[548]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 44)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[744]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -12)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-64)] -// Release input[740] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(432)] -// Release input[612] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[744]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[680]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -76)] -vmul.u32 Q1, Q1, r7 -// input[616]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 112)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(176)] -// Release input[548] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[552]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 48)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[748]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-48)] -// Release input[744] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(448)] -// Release input[616] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[748]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[684]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmul.u32 Q3, Q3, r7 -// input[620]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 116)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(192)] -// Release input[552] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[556]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-32)] -// Release input[748] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(464)] -// Release input[620] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[752]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[688]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -68)] -vmul.u32 Q2, Q2, r7 -// input[624]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(208)] -// Release input[556] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[560]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 56)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[756]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 0)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-272)] -// Release input[688] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(480)] -// Release input[624] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vmul.u32 Q1, Q1, r7 -// input[628]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 124)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(224)] -// Release input[560] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[564]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 60)] -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(496)] -// Release input[628] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[760]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[696]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vmul.u32 Q3, Q3, r7 -// input[632]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -124)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(240)] -// Release input[564] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[568]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 64)] -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[764]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-240)] -// Release input[696] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[632] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[700]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -56)] -vmul.u32 Q2, Q2, r7 -// input[636]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -120)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(256)] -// Release input[568] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[572]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -vqrdmulh.s32 Q1, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q1, Q5, r10 -vstrw.u32 Q4, [r11,#(-224)] -// Release input[700] from Q4 -vsub.s32 Q5, Q3, Q1 -vstrw.u32 Q5, [r11,#(-480)] -// Release input[636] from Q5 -vadd.s32 Q3, Q3, Q1 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[48]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 48)] -vqrdmulh.s32 Q2, Q1, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 32)] -vmul.u32 Q1, Q1, r7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 16)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(272)] -// Release input[572] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[0]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 0)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[52]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(64)] -// Release input[16] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[20]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[4]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[56]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(80)] -// Release input[20] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[24]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[60]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[12]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[112]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(112)] -// Release input[28] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[112]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[64]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[116]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(320)] -// Release input[80] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[116]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[84]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[68]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[120]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r0,#(464)] -// Release input[116] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r0,#(336)] -// Release input[84] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[72]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[124]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r0,#(352)] -// Release input[88] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[76]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[176]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r7 -// input[144]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[128]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[180]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[184]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[152]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[188]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[156]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[240]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[192]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[244]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[212]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[196]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[248]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(-160)] -// Release input[212] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[216]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[252]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(-144)] -// Release input[216] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[204]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[304]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(-128)] -// Release input[220] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[304]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r7 -// input[272]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[256]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[308]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(144)] -// Release input[288] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(80)] -// Release input[272] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[276]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[260]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[312]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(96)] -// Release input[276] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[264]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[316]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(112)] -// Release input[280] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[316]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[284]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[268]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[368]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(256)] -// Release input[316] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(128)] -// Release input[284] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[368]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r7 -// input[336]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[320]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[372]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(336)] -// Release input[336] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[372]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[324]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[376]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r14,#(480)] -// Release input[372] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r14,#(352)] -// Release input[340] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[376]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[344]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r14,#(288)] -// Release input[324] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[328]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[380]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r14,#(368)] -// Release input[344] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[380]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[348]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r14,#(304)] -// Release input[328] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[332]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[432]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-496)] -// Release input[380] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r14,#(384)] -// Release input[348] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[432]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[416]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r7 -// input[400]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[384]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[436]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-352)] -// Release input[416] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-416)] -// Release input[400] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[436]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[420]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[404]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(-480)] -// Release input[384] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[388]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[440]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-272)] -// Release input[436] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-400)] -// Release input[404] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[440]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[424]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[408]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(-464)] -// Release input[388] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[392]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[444]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-256)] -// Release input[440] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-320)] -// Release input[424] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-384)] -// Release input[408] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[444]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[428]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[412]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[396]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[496]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(-240)] -// Release input[444] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-304)] -// Release input[428] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-368)] -// Release input[412] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[496]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[480]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r7 -// input[464]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(-432)] -// Release input[396] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[448]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[500]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(-32)] -// Release input[496] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-96)] -// Release input[480] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-160)] -// Release input[464] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[500]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[484]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[468]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(-224)] -// Release input[448] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[452]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[504]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(-16)] -// Release input[500] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(-144)] -// Release input[468] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[504]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[488]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[472]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(-208)] -// Release input[452] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[456]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[508]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(0)] -// Release input[504] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-64)] -// Release input[488] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(-128)] -// Release input[472] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[508]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[492]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[476]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(-192)] -// Release input[456] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[460]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(16)] -// Release input[508] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(-48)] -// Release input[492] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(-112)] -// Release input[476] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[560]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[544]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r7 -// input[528]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(-176)] -// Release input[460] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[512]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[564]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(160)] -// Release input[544] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(96)] -// Release input[528] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[564]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[548]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[532]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(32)] -// Release input[512] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[516]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[568]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(240)] -// Release input[564] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(112)] -// Release input[532] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[568]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[552]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[536]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(48)] -// Release input[516] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[520]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[572]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(192)] -// Release input[552] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(128)] -// Release input[536] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[572]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[556]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[540]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(64)] -// Release input[520] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[524]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[624]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r12,#(272)] -// Release input[572] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(208)] -// Release input[556] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(144)] -// Release input[540] from Q5 -vadd.s32 Q3, Q3, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[624]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[608]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q1, Q1, r7 -// input[592]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(80)] -// Release input[524] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[576]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[628]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(416)] -// Release input[608] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(352)] -// Release input[592] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[628]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[612]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[596]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(288)] -// Release input[576] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[580]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[632]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r12,#(496)] -// Release input[628] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r12,#(368)] -// Release input[596] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[632]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[616]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[600]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r12,#(304)] -// Release input[580] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[584]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[636]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-496)] -// Release input[632] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(448)] -// Release input[616] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r12,#(384)] -// Release input[600] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[636]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[620]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[604]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[588]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[688]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-480)] -// Release input[636] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r12,#(464)] -// Release input[620] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r12,#(400)] -// Release input[604] from Q5 -vadd.s32 Q2, Q2, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[688]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[672]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q3, Q3, r7 -// input[656]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r12,#(336)] -// Release input[588] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[640]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[692]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-272)] -// Release input[688] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-336)] -// Release input[672] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-400)] -// Release input[656] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[692]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[676]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[660]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r11,#(-464)] -// Release input[640] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[644]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[696]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-256)] -// Release input[692] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-384)] -// Release input[660] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[696]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[680]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[664]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[644] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[648]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[700]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(-240)] -// Release input[696] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-304)] -// Release input[680] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-368)] -// Release input[664] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[700]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[684]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[668]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r11,#(-432)] -// Release input[648] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[652]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[752]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 100)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(-224)] -// Release input[700] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-288)] -// Release input[684] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-352)] -// Release input[668] from Q5 -vadd.s32 Q1, Q1, Q7 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[752]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[736]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 84)] -vmul.u32 Q2, Q2, r7 -// input[720]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 68)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r11,#(-416)] -// Release input[652] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[704]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -// input[756]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-80)] -// Release input[736] from Q4 -vsub.s32 Q5, Q3, Q7 -vstrw.u32 Q5, [r11,#(-144)] -// Release input[720] from Q5 -vadd.s32 Q3, Q3, Q7 -// input[756]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[740]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q1, Q1, r7 -// input[724]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q3, [r11,#(-208)] -// Release input[704] from Q3 -vqrdmulh.s32 Q3, Q4, r8 -vsub.s32 Q1, Q5, Q2 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q4, r10 -// input[708]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q1, r4 -vsub.s32 Q4, Q2, Q3 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q1, r10 -// input[760]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q1, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q1, [r11,#(0)] -// Release input[756] from Q1 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vsub.s32 Q5, Q2, Q7 -vstrw.u32 Q5, [r11,#(-128)] -// Release input[724] from Q5 -vadd.s32 Q2, Q2, Q7 -// input[760]: Already loaded as Q3 -vqrdmulh.s32 Q1, Q3, r8 -// input[744]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q3, Q3, r7 -// input[728]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q1, Q3, r10 -vstrw.u32 Q2, [r11,#(-192)] -// Release input[708] from Q2 -vqrdmulh.s32 Q2, Q4, r8 -vsub.s32 Q3, Q5, Q1 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q2, Q4, r10 -// input[712]: Load as Q1 -vldrw.u32 Q1, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q3, r4 -vsub.s32 Q4, Q1, Q2 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q6, Q3, r10 -// input[764]: Load as Q2 -vldrw.u32 Q2, [Q0, #(4 * 52)] -vqrdmulh.s32 Q7, Q5, r6 -vsub.s32 Q3, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vqrdmlah.s32 Q7, Q5, r10 -vstrw.u32 Q4, [r11,#(-48)] -// Release input[744] from Q4 -vsub.s32 Q5, Q1, Q7 -vstrw.u32 Q5, [r11,#(-112)] -// Release input[728] from Q5 -vadd.s32 Q1, Q1, Q7 -// input[764]: Already loaded as Q2 -vqrdmulh.s32 Q3, Q2, r8 -// input[748]: Load as Q4 -vldrw.u32 Q4, [Q0, #(4 * 36)] -vmul.u32 Q2, Q2, r7 -// input[732]: Load as Q5 -vldrw.u32 Q5, [Q0, #(4 * 20)] -vqrdmlah.s32 Q3, Q2, r10 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[712] from Q1 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q2, Q5, Q3 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q4, r10 -// input[716]: Load as Q3 -vldrw.u32 Q3, [Q0, #(4 * 4)]! -vqrdmulh.s32 Q6, Q2, r4 -vsub.s32 Q4, Q3, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q6, Q2, r10 -vqrdmulh.s32 Q1, Q5, r6 -vsub.s32 Q2, Q4, Q6 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q4, Q4, Q6 -vstrw.u32 Q2, [r11,#(32)] -// Release input[764] from Q2 -vqrdmlah.s32 Q1, Q5, r10 -vstrw.u32 Q4, [r11,#(-32)] -// Release input[748] from Q4 -vsub.s32 Q5, Q3, Q1 -vstrw.u32 Q5, [r11,#(-96)] -// Release input[732] from Q5 -vadd.s32 Q3, Q3, Q1 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q0, r8 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r7 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q3, [r11,#(-160)] -// Release input[716] from Q3 -vqrdmulh.s32 Q3, Q2, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q3, Q2, r10 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q2, Q1, Q3 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q3 -vqrdmlah.s32 Q5, Q0, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[28]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r8 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r7 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r10 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q3, Q4, Q0 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q2, r10 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q3, r4 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q3, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q3, r10 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q3, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[44]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q1, Q1, r7 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[60]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[60]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r7 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(240)] -// Release input[60] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q2, Q2, r7 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[92]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[92]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r7 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[108]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(368)] -// Release input[92] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[108]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r7 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(432)] -// Release input[108] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q2, Q2, r7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r0,#(384)] -// Release input[96] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[140]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r7 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r0,#(448)] -// Release input[112] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[156]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[156]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-384)] -// Release input[156] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r7 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[188]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r7 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-368)] -// Release input[160] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q0, Q0, r7 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r7 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q1, Q1, r7 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(-176)] -// Release input[208] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q0, Q0, r7 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[240]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[268]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vmul.u32 Q2, Q2, r7 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(-48)] -// Release input[240] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[284]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[284]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q1, Q1, r7 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(16)] -// Release input[256] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[300]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(128)] -// Release input[284] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[300]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[296]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 44)] -vmul.u32 Q0, Q0, r7 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(192)] -// Release input[300] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(176)] -// Release input[296] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[316]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[312]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 60)] -vmul.u32 Q2, Q2, r7 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[332]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 80)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(240)] -// Release input[312] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[332]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmul.u32 Q1, Q1, r7 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[348]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(320)] -// Release input[332] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[348]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r7 -// input[340]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[336]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[364]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(384)] -// Release input[348] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(352)] -// Release input[340] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[364]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmul.u32 Q2, Q2, r7 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r14,#(336)] -// Release input[336] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[352]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[380]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(448)] -// Release input[364] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[380]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[376]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 124)] -vmul.u32 Q1, Q1, r7 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r14,#(400)] -// Release input[352] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[396]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -108)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-496)] -// Release input[380] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r14,#(496)] -// Release input[376] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[396]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[392]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -112)] -vmul.u32 Q0, Q0, r7 -// input[388]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[384]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[412]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -92)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-432)] -// Release input[396] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-448)] -// Release input[392] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-464)] -// Release input[388] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[412]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[408]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -96)] -vmul.u32 Q2, Q2, r7 -// input[404]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -100)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-480)] -// Release input[384] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[400]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -104)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[428]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-368)] -// Release input[412] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-384)] -// Release input[408] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-400)] -// Release input[404] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[428]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[424]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -80)] -vmul.u32 Q1, Q1, r7 -// input[420]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -84)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-416)] -// Release input[400] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[416]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -88)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[444]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -60)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-304)] -// Release input[428] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-320)] -// Release input[424] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-336)] -// Release input[420] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[444]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[440]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -64)] -vmul.u32 Q0, Q0, r7 -// input[436]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-352)] -// Release input[416] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[432]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[460]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-240)] -// Release input[444] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-256)] -// Release input[440] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-272)] -// Release input[436] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[460]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[456]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -48)] -vmul.u32 Q2, Q2, r7 -// input[452]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -52)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-288)] -// Release input[432] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[448]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -56)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[476]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(-176)] -// Release input[460] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-192)] -// Release input[456] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-208)] -// Release input[452] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[476]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[472]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -32)] -vmul.u32 Q1, Q1, r7 -// input[468]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -36)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-224)] -// Release input[448] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[464]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * -40)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[492]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(-112)] -// Release input[476] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-128)] -// Release input[472] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(-144)] -// Release input[468] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[492]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[488]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -16)] -vmul.u32 Q0, Q0, r7 -// input[484]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(-160)] -// Release input[464] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[480]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * -24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[508]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 4)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-48)] -// Release input[492] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(-64)] -// Release input[488] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(-80)] -// Release input[484] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[508]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[504]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 0)] -vmul.u32 Q2, Q2, r7 -// input[500]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * -4)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(-96)] -// Release input[480] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[496]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -8)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[524]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 20)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(16)] -// Release input[508] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(0)] -// Release input[504] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(-16)] -// Release input[500] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[524]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[520]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 16)] -vmul.u32 Q1, Q1, r7 -// input[516]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 12)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(-32)] -// Release input[496] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[512]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[540]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 36)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(80)] -// Release input[524] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(64)] -// Release input[520] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(48)] -// Release input[516] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[540]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[536]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 32)] -vmul.u32 Q0, Q0, r7 -// input[532]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(32)] -// Release input[512] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[528]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[556]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(144)] -// Release input[540] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(128)] -// Release input[536] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(112)] -// Release input[532] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[556]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[552]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 48)] -vmul.u32 Q2, Q2, r7 -// input[548]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(96)] -// Release input[528] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[544]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[572]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 68)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(208)] -// Release input[556] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(192)] -// Release input[552] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(176)] -// Release input[548] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[572]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[568]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 64)] -vmul.u32 Q1, Q1, r7 -// input[564]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 60)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(160)] -// Release input[544] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[560]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[588]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 84)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(272)] -// Release input[572] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(256)] -// Release input[568] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(240)] -// Release input[564] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[588]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[584]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 80)] -vmul.u32 Q0, Q0, r7 -// input[580]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(224)] -// Release input[560] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[576]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 72)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[604]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 100)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(336)] -// Release input[588] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(320)] -// Release input[584] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(304)] -// Release input[580] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[604]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[600]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 96)] -vmul.u32 Q2, Q2, r7 -// input[596]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(288)] -// Release input[576] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[592]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 88)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[620]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r12,#(400)] -// Release input[604] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(384)] -// Release input[600] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r12,#(368)] -// Release input[596] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[620]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[616]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * 112)] -vmul.u32 Q1, Q1, r7 -// input[612]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r12,#(352)] -// Release input[592] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[608]: Load as Q2 -vldrw.u32 Q2, [r12, #(4 * 104)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[636]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r12,#(464)] -// Release input[620] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r12,#(448)] -// Release input[616] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r12,#(432)] -// Release input[612] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[636]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[632]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vmul.u32 Q0, Q0, r7 -// input[628]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r12,#(416)] -// Release input[608] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[624]: Load as Q1 -vldrw.u32 Q1, [r12, #(4 * 120)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[652]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -104)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-480)] -// Release input[636] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-496)] -// Release input[632] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r12,#(496)] -// Release input[628] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[652]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[648]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vmul.u32 Q2, Q2, r7 -// input[644]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r12,#(480)] -// Release input[624] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[640]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[668]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -88)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-416)] -// Release input[652] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-432)] -// Release input[648] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release input[644] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[668]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[664]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -92)] -vmul.u32 Q1, Q1, r7 -// input[660]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -96)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-464)] -// Release input[640] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[656]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[684]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-352)] -// Release input[668] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-368)] -// Release input[664] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-384)] -// Release input[660] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[684]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[680]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vmul.u32 Q0, Q0, r7 -// input[676]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -// Release input[656] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[672]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[700]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-288)] -// Release input[684] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-304)] -// Release input[680] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release input[676] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[700]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[696]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -60)] -vmul.u32 Q2, Q2, r7 -// input[692]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -64)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-336)] -// Release input[672] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[688]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[716]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-224)] -// Release input[700] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-240)] -// Release input[696] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-256)] -// Release input[692] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[716]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[712]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vmul.u32 Q1, Q1, r7 -// input[708]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-272)] -// Release input[688] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[704]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -// input[732]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-160)] -// Release input[716] from Q1 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-176)] -// Release input[712] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release input[708] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[732]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r8 -// input[728]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -28)] -vmul.u32 Q0, Q0, r7 -// input[724]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vqrdmlah.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-208)] -// Release input[704] from Q2 -vqrdmulh.s32 Q2, Q3, r8 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r10 -// input[720]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q5, Q0, r4 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r3 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r10 -// input[748]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-96)] -// Release input[732] from Q0 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-112)] -// Release input[728] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-128)] -// Release input[724] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[748]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r8 -// input[744]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vmul.u32 Q2, Q2, r7 -// input[740]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vqrdmlah.s32 Q0, Q2, r10 -vstrw.u32 Q1, [r11,#(-144)] -// Release input[720] from Q1 -vqrdmulh.s32 Q1, Q3, r8 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r10 -// input[736]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -20)] -vqrdmulh.s32 Q5, Q2, r4 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r3 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r10 -// input[764]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 8)] -vqrdmulh.s32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-32)] -// Release input[748] from Q2 -vqrdmlah.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-48)] -// Release input[744] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-64)] -// Release input[740] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// input[764]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r8 -// input[760]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vmul.u32 Q1, Q1, r7 -// input[756]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 0)] -vqrdmlah.s32 Q2, Q1, r10 -vstrw.u32 Q0, [r11,#(-80)] -// Release input[736] from Q0 -vqrdmulh.s32 Q0, Q3, r8 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r10 -// input[752]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vqrdmulh.s32 Q5, Q1, r4 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r3 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r10 -vqrdmulh.s32 Q0, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(32)] -// Release input[764] from Q1 -vqrdmlah.s32 Q0, Q4, r10 -vstrw.u32 Q3, [r11,#(16)] -// Release input[760] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(0)] -// Release input[756] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -// Release input[752] from Q2 -.equ modulus_inv, 3919317503 -movw r8, #:lower16:modulus_inv -movt r8, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 7346 -// Instruction count: 5662 \ No newline at end of file diff --git a/tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_complete.s b/tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index 02c5c34..0000000 --- a/tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,3394 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 20558213 -.word 66424611 -.word 59465515 -.word 39560591 -.word 2042724475 -.word 2817904349 -.word 2405453525 -.word 2621436017 -.word 35339857 -.word 13377101 -.word 33252123 -.word 16713319 -.word 10815985 -.word 56247925 -.word 26943959 -.word 51316823 -.word 3650773007 -.word 4021439371 -.word 1538999337 -.word 3611844009 -.word 42042379 -.word 26419651 -.word 61522009 -.word 23758817 -.word 2254105077 -.word 3415374909 -.word 3742677415 -.word 3187687967 -.word 35776599 -.word 6731445 -.word 3030459 -.word 41085059 -.word 6685305 -.word 24840267 -.word 21119839 -.word 32376869 -.word 2658056071 -.word 495707573 -.word 440627873 -.word 3991890395 -.word 11319751 -.word 57449959 -.word 47736605 -.word 25310795 -.word 316214329 -.word 2994890777 -.word 2883238627 -.word 1834006453 -.word 5649915 -.word 25847843 -.word 62444027 -.word 57855139 -.word 43953263 -.word 3973257 -.word 45754835 -.word 47438647 -.word 1254205841 -.word 3800349047 -.word 3397129261 -.word 3896527561 -.word 34946213 -.word 33401995 -.word 57707227 -.word 43655235 -.word 4090836315 -.word 2389950837 -.word 1383072549 -.word 2793176509 -.word 30218957 -.word 13073717 -.word 41547715 -.word 51082899 -.word 6539853 -.word 52712977 -.word 15171525 -.word 41070365 -.word 1097807795 -.word 1402229743 -.word 857879099 -.word 2467328739 -.word 1421525 -.word 5608953 -.word 3344309 -.word 54192527 -.word 2006884651 -.word 1547838471 -.word 1835403851 -.word 3288902769 -.word 55532487 -.word 25878283 -.word 7519477 -.word 10400227 -.word 66449241 -.word 4428811 -.word 30618985 -.word 46942975 -.word 1923058343 -.word 3711490549 -.word 1530848407 -.word 3263539969 -.word 34238409 -.word 7278675 -.word 26316985 -.word 1738533 -.word 1976527415 -.word 3553111469 -.word 1070704967 -.word 280554203 -.word 29493541 -.word 46179537 -.word 61070425 -.word 47641435 -.word 8700655 -.word 49217369 -.word 14037329 -.word 57068693 -.word 2143064849 -.word 3997596327 -.word 594737327 -.word 1214449003 -.word 5988919 -.word 27781261 -.word 33650523 -.word 40314383 -.word 2046739401 -.word 2556008819 -.word 2602309285 -.word 3711528945 -.word 25356533 -.word 59712043 -.word 59431885 -.word 42783775 -.word 15118727 -.word 16104593 -.word 66551101 -.word 27099659 -.word 256676985 -.word 2042883439 -.word 2098783427 -.word 1730866165 -.word 52622279 -.word 48542309 -.word 28412919 -.word 61490063 -.word 111596089 -.word 2392801179 -.word 122296841 -.word 4112339569 -.word 17544659 -.word 26761761 -.word 28138345 -.word 6006005 -.word 49338991 -.word 59052279 -.word 54131019 -.word 49172137 -.word 2285599633 -.word 1420334345 -.word 1832318133 -.word 203443031 -.word 41164657 -.word 23553921 -.word 51075303 -.word 11244857 -.word 2292337295 -.word 2218762879 -.word 3660688665 -.word 2196022471 -.word 27161421 -.word 12259351 -.word 42183787 -.word 260949 -.word 49379395 -.word 45318697 -.word 65417737 -.word 60522221 -.word 2945787325 -.word 2724075479 -.word 2827626487 -.word 482722579 -.word 3629237 -.word 60326323 -.word 30569867 -.word 31921231 -.word 3571167563 -.word 3851189325 -.word 1517877365 -.word 1275593137 -.word 51477925 -.word 23177153 -.word 42516129 -.word 23261199 -.word 50523083 -.word 29024109 -.word 62634975 -.word 5116371 -.word 2363949621 -.word 2792055443 -.word 3296655905 -.word 4093127725 -.word 55626043 -.word 15630981 -.word 43717491 -.word 14342369 -.word 2004845765 -.word 3862343547 -.word 2436590221 -.word 2109337887 -.word 6776583 -.word 33530533 -.word 43598203 -.word 59373651 -.word 37946425 -.word 47668559 -.word 10775673 -.word 3826249 -.word 262354375 -.word 703707313 -.word 2790542727 -.word 2635626423 -.word 53733071 -.word 10734019 -.word 25306471 -.word 54139625 -.word 284438321 -.word 3541161021 -.word 2646073497 -.word 3100573463 -.word 1468391 -.word 4426959 -.word 42735737 -.word 38665093 -.word 33133879 -.word 7139481 -.word 8438111 -.word 50341189 -.word 3126759625 -.word 523569511 -.word 1408300193 -.word 2172685499 -.word 47558821 -.word 33268441 -.word 63536237 -.word 26272521 -.word 664584539 -.word 2409420583 -.word 3799958931 -.word 835286775 -.word 1854317 -.word 2223865 -.word 22962475 -.word 36888515 -.word 59868297 -.word 15191207 -.word 59108143 -.word 4355773 -.word 538432887 -.word 3252336985 -.word 1330506449 -.word 4169984835 -.word 27411989 -.word 52176833 -.word 52660121 -.word 23140553 -.word 652643307 -.word 4178403903 -.word 1113879143 -.word 3574776119 -.word 50275685 -.word 12903773 -.word 25228433 -.word 55395235 -.word 3868449 -.word 66432231 -.word 31236859 -.word 13658415 -.word 2938651359 -.word 814700825 -.word 1618291461 -.word 49245393 -.word 34409967 -.word 12619783 -.word 54561811 -.word 61632377 -.word 2233616401 -.word 2820912633 -.word 684470765 -.word 3345631879 -.word 7605279 -.word 58319315 -.word 16342937 -.word 48148431 -.word 62377755 -.word 35459369 -.word 27513701 -.word 18346679 -.word 4057153253 -.word 3867838679 -.word 589962907 -.word 1692873545 -.word 1824951 -.word 40410247 -.word 25935987 -.word 53409853 -.word 3034533193 -.word 1425582457 -.word 1695333773 -.word 2628741571 -.word 44896477 -.word 66621379 -.word 35702907 -.word 44158149 -.word 32881793 -.word 18033685 -.word 29367795 -.word 16787671 -.word 3741535615 -.word 3094455787 -.word 3934216205 -.word 2459712809 -.word 57730785 -.word 3752846111 -.word 42601623 -.word 2096617833 -.word 43352521 -.word 3690485815 -.word 59392861 -.word 348348067 -.word 65052633 -.word 2878986791 -.word 58217677 -.word 4056132915 -.word 57130935 -.word 1821992521 -.word 14439459 -.word 3133213149 -.word 30030779 -.word 2105479749 -.word 3784291 -.word 1619664797 -.word 48646815 -.word 736619361 -.word 15892551 -.word 1112819129 -.word 50479773 -.word 2420367203 -.word 20532335 -.word 3597076881 -.word 46242673 -.word 523030159 -.word 58797193 -.word 3703057783 -.word 34903951 -.word 1308294769 -.word 48022295 -.word 1841701609 -.word 62080381 -.word 439327875 -.word 55892463 -.word 2714926097 -.word 5286953 -.word 1617227223 -.word 40872659 -.word 2110821165 -.word 42133307 -.word 3044632261 -.word 54343827 -.word 2777449837 -.word 6014597 -.word 2607901563 -.word 25291403 -.word 2170258293 -.word 14166063 -.word 3026038225 -.word 31380141 -.word 2294804307 -.word 31709009 -.word 3537982127 -.word 12550399 -.word 749630721 -.word 21111903 -.word 890081697 -.word 65984707 -.word 3797730621 -.word 52266271 -.word 2046406881 -.word 12778219 -.word 1732129557 -.word 39517177 -.word 3090726407 -.word 12656259 -.word 1564737405 -.word 56722355 -.word 4158093901 -.word 27185869 -.word 1015623475 -.word 14750755 -.word 3929819613 -.word 65797823 -.word 1198225217 -.word 13164949 -.word 3956469867 -.word 1145583 -.word 526375697 -.word 12271567 -.word 2565264945 -.word 22449375 -.word 789457185 -.word 31982975 -.word 4273139841 -.word 35394733 -.word 622767443 -.word 23998611 -.word 2970324333 -.word 62038423 -.word 3718333545 -.word 32686385 -.word 3430230223 -.word 58757463 -.word 3257980073 -.word 41196349 -.word 2848442051 -.word 2430825 -.word 1203831447 -.word 26613973 -.word 3660462379 -.word 7832335 -.word 785060593 -.word 62228979 -.word 1321333773 -.word 12542317 -.word 209379475 -.word 18302687 -.word 244412193 -.word 48515911 -.word 1716550329 -.word 21796399 -.word 1211449297 -.word 27114239 -.word 840186625 -.word 33591847 -.word 2457400281 -.word 23796181 -.word 3361945643 -.word 52637069 -.word 1938838643 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_n256_u32_33556993_28678040, %function -.global inv_ntt_n256_u32_33556993_28678040 -inv_ntt_n256_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -.equ modulus_inv, 3919317503 -movw r4, #:lower16:modulus_inv -movt r4, #:upper16:modulus_inv -vldrw.s32 Q4, [r0, #0] -vldrw.s32 Q5, [r0, #16] -vsub.s32 Q6, Q4, Q5 -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vqrdmulh.s32 Q5, Q6, Q5 -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vmul.u32 Q6, Q6, Q5 -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vst41.s32 {Q0,Q1,Q2,Q3}, [r0] -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vst43.s32 {Q0,Q1,Q2,Q3}, [r0]! -sub r0, r0, #1024 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -8)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -108)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -120)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -44)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -36)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -32)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 64)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 50334209 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 4278190079 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 100)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3170 -// Instruction count: 2670 \ No newline at end of file diff --git a/tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s b/tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index 9b20445..0000000 --- a/tests/ntt_n256/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2526 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 57730785 -.word 3752846111 -.word 42601623 -.word 2096617833 -.word 43352521 -.word 3690485815 -.word 59392861 -.word 348348067 -.word 65052633 -.word 2878986791 -.word 58217677 -.word 4056132915 -.word 57130935 -.word 1821992521 -.word 14439459 -.word 3133213149 -.word 30030779 -.word 2105479749 -.word 3784291 -.word 1619664797 -.word 48646815 -.word 736619361 -.word 15892551 -.word 1112819129 -.word 50479773 -.word 2420367203 -.word 20532335 -.word 3597076881 -.word 46242673 -.word 523030159 -.word 58797193 -.word 3703057783 -.word 34903951 -.word 1308294769 -.word 48022295 -.word 1841701609 -.word 62080381 -.word 439327875 -.word 55892463 -.word 2714926097 -.word 5286953 -.word 1617227223 -.word 40872659 -.word 2110821165 -.word 42133307 -.word 3044632261 -.word 54343827 -.word 2777449837 -.word 6014597 -.word 2607901563 -.word 25291403 -.word 2170258293 -.word 14166063 -.word 3026038225 -.word 31380141 -.word 2294804307 -.word 31709009 -.word 3537982127 -.word 12550399 -.word 749630721 -.word 21111903 -.word 890081697 -.word 65984707 -.word 3797730621 -.word 52266271 -.word 2046406881 -.word 12778219 -.word 1732129557 -.word 39517177 -.word 3090726407 -.word 12656259 -.word 1564737405 -.word 56722355 -.word 4158093901 -.word 27185869 -.word 1015623475 -.word 14750755 -.word 3929819613 -.word 65797823 -.word 1198225217 -.word 13164949 -.word 3956469867 -.word 1145583 -.word 526375697 -.word 12271567 -.word 2565264945 -.word 22449375 -.word 789457185 -.word 31982975 -.word 4273139841 -.word 35394733 -.word 622767443 -.word 23998611 -.word 2970324333 -.word 62038423 -.word 3718333545 -.word 32686385 -.word 3430230223 -.word 58757463 -.word 3257980073 -.word 41196349 -.word 2848442051 -.word 2430825 -.word 1203831447 -.word 26613973 -.word 3660462379 -.word 7832335 -.word 785060593 -.word 62228979 -.word 1321333773 -.word 12542317 -.word 209379475 -.word 18302687 -.word 244412193 -.word 48515911 -.word 1716550329 -.word 21796399 -.word 1211449297 -.word 27114239 -.word 840186625 -.word 33696409 -.word 1239666535 -.word 23796181 -.word 3361945643 -.word 52637069 -.word 1938838643 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_n256_u32_33556993_28678040_incomplete, %function -.global inv_ntt_n256_u32_33556993_28678040_incomplete -inv_ntt_n256_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -8)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -108)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -120)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -44)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -36)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -32)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 64)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 33551871 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 4227858433 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 100)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2302 -// Instruction count: 1802 \ No newline at end of file diff --git a/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_complete.s b/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index 9fbf5a9..0000000 --- a/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,2889 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 -.word 3280343807 -.word 14476917 -.word 2356128651 -.word 43317805 -.word 933021651 -.word 18598075 -.word 2578416965 -.word 39999747 -.word 3454780669 -.word 45317587 -.word 3083517997 -.word 4885007 -.word 2973633521 -.word 48811299 -.word 4050555101 -.word 54571669 -.word 4085587819 -.word 64683161 -.word 3091135847 -.word 59281651 -.word 3509906701 -.word 40500013 -.word 634504915 -.word 34427601 -.word 864737071 -.word 25917637 -.word 1446525243 -.word 8356523 -.word 1036987221 -.word 31719253 -.word 3672199851 -.word 5075563 -.word 576633749 -.word 43115375 -.word 1324642961 -.word 54842419 -.word 1729702349 -.word 35131011 -.word 21827453 -.word 44664611 -.word 3505510109 -.word 1316163 -.word 3096742077 -.word 65968403 -.word 3768591597 -.word 53949037 -.word 338497427 -.word 10391631 -.word 136873393 -.word 52363231 -.word 365147681 -.word 39928117 -.word 3279343819 -.word 54335767 -.word 2562837737 -.word 54457727 -.word 2730229889 -.word 27596809 -.word 1204240887 -.word 46002083 -.word 3404885597 -.word 14847715 -.word 2248560413 -.word 1129279 -.word 497236673 -.word 35733845 -.word 2000162987 -.word 54563587 -.word 3545336573 -.word 35404977 -.word 756985167 -.word 61099389 -.word 1687065731 -.word 52947923 -.word 1268929069 -.word 41822583 -.word 2124709001 -.word 26241327 -.word 2184146129 -.word 12770159 -.word 1517517457 -.word 24980679 -.word 1250335033 -.word 5033605 -.word 3855639419 -.word 61827033 -.word 2677740071 -.word 11221523 -.word 1580041197 -.word 8316793 -.word 591909511 -.word 19091691 -.word 2453265685 -.word 32210035 -.word 2986672525 -.word 16634213 -.word 1874600091 -.word 20871313 -.word 3771937135 -.word 46581651 -.word 697890413 -.word 63329695 -.word 2675302497 -.word 51221435 -.word 3182148165 -.word 18467171 -.word 3558347933 -.word 9983051 -.word 2472974773 -.word 37083207 -.word 2189487545 -.word 52674527 -.word 1161754145 -.word 7721125 -.word 3946619227 -.word 8896309 -.word 238834379 -.word 2061353 -.word 1415980503 -.word 9383201 -.word 542121183 -.word 23761465 -.word 604481479 -.word 24512363 -.word 2198349461 -.word 13704133 -.word 41177999 -.word 26703739 -.word 65289035 -.word 1666225723 -.word 2599633521 -.word 2869384837 -.word 1260434101 -.word 50326315 -.word 37746191 -.word 49080301 -.word 34232193 -.word 1835254485 -.word 360751089 -.word 1200511507 -.word 553431679 -.word 22955837 -.word 31411079 -.word 492607 -.word 22217509 -.word 5481609 -.word 12552175 -.word 54494203 -.word 32704019 -.word 949335415 -.word 3610496529 -.word 1474054661 -.word 2061350893 -.word 48767307 -.word 39600285 -.word 31654617 -.word 4736231 -.word 2602093749 -.word 3705004387 -.word 427128615 -.word 237814041 -.word 18965555 -.word 50771049 -.word 8794671 -.word 59508707 -.word 43973433 -.word 14453865 -.word 14937153 -.word 39701997 -.word 720191175 -.word 3181088151 -.word 116563391 -.word 3642323987 -.word 53455571 -.word 35877127 -.word 681755 -.word 63245537 -.word 4245721901 -.word 2676675833 -.word 3480266469 -.word 1356315935 -.word 11718751 -.word 41885553 -.word 54210213 -.word 16838301 -.word 40841465 -.word 3577749 -.word 33845545 -.word 19555165 -.word 3459680519 -.word 495008363 -.word 1885546711 -.word 3630382755 -.word 62758213 -.word 8005843 -.word 51922779 -.word 7245689 -.word 124982459 -.word 2964460845 -.word 1042630309 -.word 3756534407 -.word 30225471 -.word 44151511 -.word 64890121 -.word 65259669 -.word 12974361 -.word 41807515 -.word 56379967 -.word 13380915 -.word 1194393831 -.word 1648893797 -.word 753806273 -.word 4010528973 -.word 16772797 -.word 58675875 -.word 59974505 -.word 33980107 -.word 2122281795 -.word 2886667101 -.word 3771397783 -.word 1168207669 -.word 28448893 -.word 24378249 -.word 62687027 -.word 65645595 -.word 52771617 -.word 23396495 -.word 51483005 -.word 11487943 -.word 2185629407 -.word 1858377073 -.word 432623747 -.word 2290121529 -.word 63287737 -.word 56338313 -.word 19445427 -.word 29167561 -.word 1659340871 -.word 1504424567 -.word 3591259981 -.word 4032612919 -.word 7740335 -.word 23515783 -.word 33583453 -.word 60337403 -.word 35192755 -.word 36544119 -.word 6787663 -.word 63484749 -.word 3019374157 -.word 2777089929 -.word 443777969 -.word 723799731 -.word 61997615 -.word 4479011 -.word 38089877 -.word 16590903 -.word 201839569 -.word 998311389 -.word 1502911851 -.word 1931017673 -.word 43852787 -.word 24597857 -.word 43936833 -.word 15636061 -.word 55869129 -.word 16038683 -.word 43560065 -.word 25949329 -.word 2098944823 -.word 634278629 -.word 2076204415 -.word 2002629999 -.word 6591765 -.word 1696249 -.word 21795289 -.word 17734591 -.word 3812244715 -.word 1467340807 -.word 1570891815 -.word 1349179969 -.word 66853037 -.word 24930199 -.word 54854635 -.word 39952565 -.word 5623923 -.word 38701067 -.word 18571677 -.word 14491707 -.word 182627725 -.word 4172670453 -.word 1902166115 -.word 4183371205 -.word 17941849 -.word 12982967 -.word 8061707 -.word 17774995 -.word 4091524263 -.word 2462649161 -.word 2874632949 -.word 2009367661 -.word 61107981 -.word 38975641 -.word 40352225 -.word 49569327 -.word 26799603 -.word 33463463 -.word 39332725 -.word 61125067 -.word 583438349 -.word 1692658009 -.word 1738958475 -.word 2248227893 -.word 40014327 -.word 562885 -.word 51009393 -.word 51995259 -.word 2564101129 -.word 2196183867 -.word 2252083855 -.word 4038290309 -.word 24330211 -.word 7682101 -.word 7401943 -.word 41757453 -.word 65375453 -.word 40797001 -.word 59835311 -.word 32875577 -.word 4014413091 -.word 3224262327 -.word 741855825 -.word 2318439879 -.word 10045293 -.word 53076657 -.word 17896617 -.word 58413331 -.word 3080518291 -.word 3700229967 -.word 297370967 -.word 2151902445 -.word 19472551 -.word 6043561 -.word 20934449 -.word 37620445 -.word 12921459 -.word 63769677 -.word 61505033 -.word 65692461 -.word 1006064525 -.word 2459563443 -.word 2747128823 -.word 2288082643 -.word 20171011 -.word 36495001 -.word 62685175 -.word 664745 -.word 1031427325 -.word 2764118887 -.word 583476745 -.word 2371908951 -.word 56713759 -.word 59594509 -.word 41235703 -.word 11581499 -.word 23458751 -.word 9406759 -.word 33711991 -.word 32167773 -.word 1501790785 -.word 2911894745 -.word 1905016457 -.word 204130979 -.word 26043621 -.word 51942461 -.word 14401009 -.word 60574133 -.word 1827638555 -.word 3437088195 -.word 2892737551 -.word 3197159499 -.word 16031087 -.word 25566271 -.word 54040269 -.word 36895029 -.word 41803191 -.word 19377381 -.word 9664027 -.word 55794235 -.word 2460960841 -.word 1411728667 -.word 1300076517 -.word 3978752965 -.word 19675339 -.word 21359151 -.word 63140729 -.word 23160723 -.word 398439733 -.word 897838033 -.word 494618247 -.word 3040761453 -.word 9258847 -.word 4669959 -.word 41266143 -.word 61464071 -.word 43355169 -.word 5591977 -.word 40694335 -.word 25071607 -.word 1107279327 -.word 552289879 -.word 879592385 -.word 2040862217 -.word 34737117 -.word 45994147 -.word 42273719 -.word 60428681 -.word 303076899 -.word 3854339421 -.word 3799259721 -.word 1636911223 -.word 26028927 -.word 64083527 -.word 60382541 -.word 31337387 -.word 27553395 -.word 7648471 -.word 689375 -.word 46555773 -.word 1673531277 -.word 1889513769 -.word 1477062945 -.word 2252242819 -.word 15797163 -.word 40170027 -.word 10866061 -.word 56298001 -.word 683123285 -.word 2755967957 -.word 273527923 -.word 644194287 -.word 50400667 -.word 33861863 -.word 53736885 -.word 31774129 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040, %function -.global ntt_n256_u32_33556993_28678040 -ntt_n256_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r11, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q5, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r12 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2857 -// Instruction count: 2421 \ No newline at end of file diff --git a/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete.s b/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index 2ba06e0..0000000 --- a/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2025 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 -.word 3280343807 -.word 14476917 -.word 2356128651 -.word 43317805 -.word 933021651 -.word 18598075 -.word 2578416965 -.word 39999747 -.word 3454780669 -.word 45317587 -.word 3083517997 -.word 4885007 -.word 2973633521 -.word 48811299 -.word 4050555101 -.word 54571669 -.word 4085587819 -.word 64683161 -.word 3091135847 -.word 59281651 -.word 3509906701 -.word 40500013 -.word 634504915 -.word 34427601 -.word 864737071 -.word 25917637 -.word 1446525243 -.word 8356523 -.word 1036987221 -.word 31719253 -.word 3672199851 -.word 5075563 -.word 576633749 -.word 43115375 -.word 1324642961 -.word 54842419 -.word 1729702349 -.word 35131011 -.word 21827453 -.word 44664611 -.word 3505510109 -.word 1316163 -.word 3096742077 -.word 65968403 -.word 3768591597 -.word 53949037 -.word 338497427 -.word 10391631 -.word 136873393 -.word 52363231 -.word 365147681 -.word 39928117 -.word 3279343819 -.word 54335767 -.word 2562837737 -.word 54457727 -.word 2730229889 -.word 27596809 -.word 1204240887 -.word 46002083 -.word 3404885597 -.word 14847715 -.word 2248560413 -.word 1129279 -.word 497236673 -.word 35733845 -.word 2000162987 -.word 54563587 -.word 3545336573 -.word 35404977 -.word 756985167 -.word 61099389 -.word 1687065731 -.word 52947923 -.word 1268929069 -.word 41822583 -.word 2124709001 -.word 26241327 -.word 2184146129 -.word 12770159 -.word 1517517457 -.word 24980679 -.word 1250335033 -.word 5033605 -.word 3855639419 -.word 61827033 -.word 2677740071 -.word 11221523 -.word 1580041197 -.word 8316793 -.word 591909511 -.word 19091691 -.word 2453265685 -.word 32210035 -.word 2986672525 -.word 16634213 -.word 1874600091 -.word 20871313 -.word 3771937135 -.word 46581651 -.word 697890413 -.word 63329695 -.word 2675302497 -.word 51221435 -.word 3182148165 -.word 18467171 -.word 3558347933 -.word 9983051 -.word 2472974773 -.word 37083207 -.word 2189487545 -.word 52674527 -.word 1161754145 -.word 7721125 -.word 3946619227 -.word 8896309 -.word 238834379 -.word 2061353 -.word 1415980503 -.word 9383201 -.word 542121183 -.word 23761465 -.word 604481479 -.word 24512363 -.word 2198349461 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete, %function -.global ntt_n256_u32_33556993_28678040_incomplete -ntt_n256_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1993 -// Instruction count: 1557 \ No newline at end of file diff --git a/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s b/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s deleted file mode 100644 index a44e1d0..0000000 --- a/tests/ntt_n256/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s +++ /dev/null @@ -1,2316 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 -.word 3280343807 -.word 14476917 -.word 2356128651 -.word 43317805 -.word 933021651 -.word 18598075 -.word 2578416965 -.word 39999747 -.word 3454780669 -.word 45317587 -.word 3083517997 -.word 4885007 -.word 2973633521 -.word 48811299 -.word 4050555101 -.word 54571669 -.word 4085587819 -.word 64683161 -.word 3091135847 -.word 59281651 -.word 3509906701 -.word 40500013 -.word 634504915 -.word 34427601 -.word 864737071 -.word 25917637 -.word 1446525243 -.word 8356523 -.word 1036987221 -.word 31719253 -.word 3672199851 -.word 5075563 -.word 576633749 -.word 43115375 -.word 1324642961 -.word 54842419 -.word 1729702349 -.word 35131011 -.word 21827453 -.word 44664611 -.word 3505510109 -.word 1316163 -.word 3096742077 -.word 65968403 -.word 3768591597 -.word 53949037 -.word 338497427 -.word 10391631 -.word 136873393 -.word 52363231 -.word 365147681 -.word 39928117 -.word 3279343819 -.word 54335767 -.word 2562837737 -.word 54457727 -.word 2730229889 -.word 27596809 -.word 1204240887 -.word 46002083 -.word 3404885597 -.word 14847715 -.word 2248560413 -.word 1129279 -.word 497236673 -.word 35733845 -.word 2000162987 -.word 54563587 -.word 3545336573 -.word 35404977 -.word 756985167 -.word 61099389 -.word 1687065731 -.word 52947923 -.word 1268929069 -.word 41822583 -.word 2124709001 -.word 26241327 -.word 2184146129 -.word 12770159 -.word 1517517457 -.word 24980679 -.word 1250335033 -.word 5033605 -.word 3855639419 -.word 61827033 -.word 2677740071 -.word 11221523 -.word 1580041197 -.word 8316793 -.word 591909511 -.word 19091691 -.word 2453265685 -.word 32210035 -.word 2986672525 -.word 16634213 -.word 1874600091 -.word 20871313 -.word 3771937135 -.word 46581651 -.word 697890413 -.word 63329695 -.word 2675302497 -.word 51221435 -.word 3182148165 -.word 18467171 -.word 3558347933 -.word 9983051 -.word 2472974773 -.word 37083207 -.word 2189487545 -.word 52674527 -.word 1161754145 -.word 7721125 -.word 3946619227 -.word 8896309 -.word 238834379 -.word 2061353 -.word 1415980503 -.word 9383201 -.word 542121183 -.word 23761465 -.word 604481479 -.word 24512363 -.word 2198349461 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete_double, %function -.global ntt_n256_u32_33556993_28678040_incomplete_double -ntt_n256_u32_33556993_28678040_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q1, [r1,#(96)] -vqrdmulh.s32 Q7, Q1, r6 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q3, [r1,#(64)] -vqrdmlah.s32 Q7, Q1, r12 -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q3, r6 -vsub.s32 Q4, Q2, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q4, [r1,#(32)] -vqrdmlah.s32 Q7, Q3, r12 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r8 -vadd.s32 Q2, Q2, Q6 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q2, [r1,#(0)]! -vqrdmlah.s32 Q7, Q4, r12 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q1, Q2, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q2, Q2, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[0] from Q2 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q0, [r1,#(224)] -vqrdmulh.s32 Q7, Q0, r6 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q0, Q0, r5 -vstrw.u32 Q2, [r1,#(192)] -vqrdmlah.s32 Q7, Q0, r12 -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q2, r6 -vsub.s32 Q3, Q1, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(160)] -vqrdmlah.s32 Q7, Q2, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q2 -vqrdmulh.s32 Q7, Q3, r8 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q1, [r1,#(128)]! -vqrdmlah.s32 Q7, Q3, r12 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q0, Q1, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q1, Q1, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q0, [r1,#(16)] -// Release input[16] from Q1 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vmul.u32 Q3, Q3, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q4, Q4, r9 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r9 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q4, Q4, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vmul.u32 Q3, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q4, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q6, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q6, Q3, r12 -// Release input[252] from Q3 -vqrdmlah.s32 Q4, Q2, r12 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1,#(240)] -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q6, Q1, r12 -vstrw.u32 Q6, [r1,#(208)] -// Release input[248] from Q1 -vqrdmulh.s32 Q6, Q2, r8 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q6, Q6 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q6, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2284 -// Instruction count: 1848 \ No newline at end of file diff --git a/tests/ntt_n256/manual/intt_n256_l6_s32_twiddles.s b/tests/ntt_n256/manual/intt_n256_l6_s32_twiddles.s deleted file mode 100644 index 29cd7a9..0000000 --- a/tests/ntt_n256/manual/intt_n256_l6_s32_twiddles.s +++ /dev/null @@ -1,150 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -8471290 -.word -542121183 -.word -794839 -.word -50865814 -.word -9445744 -.word -604481480 -.word 5443354 -.word 348348069 -.word 11430609 -.word 731503145 -.word -3732072 -.word -238834379 -.word -5086187 -.word -325491125 -.word 15403199 -.word 985729501 -.word -656361 -.word -42003898 -.word -8247799 -.word -527818851 -.word 11510556 -.word 736619362 -.word -16167867 -.word -1034664519 -.word 4264131 -.word 272883557 -.word -10905370 -.word -697890414 -.word 8172970 -.word 523030160 -.word -9249292 -.word -591909511 -.word -13113327 -.word -839188878 -.word -4778209 -.word -305782038 -.word 6865022 -.word 439327877 -.word 8866965 -.word 567442451 -.word -8285889 -.word -530256425 -.word -572895 -.word -36662482 -.word 14019017 -.word 897148614 -.word 9843973 -.word 629966191 -.word 7194579 -.word 460417915 -.word 355881 -.word 22774646 -.word 13728463 -.word 878554577 -.word 2302061 -.word 147320660 -.word -11828796 -.word -756985168 -.word 11713874 -.word 749630721 -.word 13908588 -.word 890081698 -.word -7769916 -.word -497236673 -.word -1579445 -.word -101076765 -.word -6490403 -.word -415354091 -.word 14739293 -.word 943242760 -.word -9106105 -.word -582746243 -.word -2138810 -.word -136873393 -.word 15870328 -.word 1015623476 -.word -5705868 -.word -365147683 -.word -14833295 -.word -949258429 -.word -5289426 -.word -338497429 -.word 8225248 -.word 526375697 -.word 6528331 -.word 417781297 -.word 12336210 -.word 789457186 -.word -341080 -.word -21827454 -.word 9731484 -.word 622767444 -.word 12857867 -.word 822840686 -.word -9010590 -.word -576633749 -.word -13512548 -.word -864737072 -.word -16204162 -.word -1036987221 -.word 10953311 -.word 700958404 -.word -14745691 -.word -943652201 -.word -9914896 -.word -634504916 -.word 12267508 -.word 785060593 -.word -12909577 -.word -826149873 -.word 3271804 -.word 209379475 -.word 3819232 -.word 244412194 -.word -6733847 -.word -430933318 -.word -14626653 -.word -936034350 -.word 13128918 -.word 840186626 -.word 15854702 -.word 1014623488 -.word -14579576 -.word -933021652 -.word -3260327 -.word -208645003 \ No newline at end of file diff --git a/tests/ntt_n256/manual/intt_n256_l8_s32_twiddles.s b/tests/ntt_n256/manual/intt_n256_l8_s32_twiddles.s deleted file mode 100644 index a482ca5..0000000 --- a/tests/ntt_n256/manual/intt_n256_l8_s32_twiddles.s +++ /dev/null @@ -1,534 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -1636987 -.word 10476123 -.word 4031087 -.word 7406071 -.word -104759172 -.word 670420703 -.word 257969879 -.word 473952370 -.word -16598340 -.word 1129445 -.word -14853570 -.word 7083547 -.word -1062212688 -.word 72278963 -.word -950555930 -.word 453312409 -.word -10066304 -.word -4274200 -.word -9508293 -.word -10674616 -.word -644194289 -.word -273527923 -.word -608484310 -.word -683123285 -.word 1666087 -.word -13744680 -.word -8630188 -.word 16254433 -.word 106621430 -.word -879592386 -.word -552289879 -.word 1040204320 -.word 5494682 -.word 5776166 -.word 8175281 -.word -6957373 -.word 351632810 -.word 369646411 -.word 523178053 -.word -445237890 -.word 7978303 -.word 7746022 -.word 6885336 -.word -4735938 -.word 510572423 -.word 495707574 -.word 440627874 -.word -303076900 -.word 4941226 -.word 13241747 -.word 11497049 -.word -4898455 -.word 316214329 -.word 847407131 -.word 735754980 -.word -313477194 -.word 14084755 -.word 14381089 -.word 7454249 -.word 1801935 -.word 901355525 -.word 920319454 -.word 477035527 -.word 115315039 -.word -13958531 -.word -7729000 -.word -14029790 -.word -6226096 -.word -893277806 -.word -494618249 -.word -897838034 -.word -398439734 -.word -3189790 -.word 3788839 -.word -11944835 -.word 10089721 -.word -204130980 -.word 242467190 -.word -764411097 -.word 645692862 -.word -5461508 -.word -11690281 -.word 14426854 -.word -1005359 -.word -349509836 -.word -748120885 -.word 923248190 -.word -64338065 -.word -16402437 -.word -11645481 -.word 13405384 -.word 4997961 -.word -1049675853 -.word -745253903 -.word 857879099 -.word 319845092 -.word -2197027 -.word -9370171 -.word -4876619 -.word -15720958 -.word -140598997 -.word -599645177 -.word -312079797 -.word -1006064525 -.word 9055324 -.word -10257619 -.word 7663005 -.word 13542586 -.word 579496507 -.word -656437514 -.word 490394891 -.word 866659357 -.word -3506913 -.word -9117520 -.word -9635661 -.word -16117282 -.word -224425303 -.word -583476747 -.word -616635240 -.word -1031427326 -.word -2671395 -.word -11592382 -.word 16731042 -.word 4383994 -.word -170956232 -.word -741855827 -.word 1070704968 -.word 280554203 -.word -12021234 -.word 11547014 -.word 11063747 -.word 293505 -.word -769300260 -.word 738952496 -.word 708025769 -.word 18782886 -.word -69049 -.word -4646776 -.word 9293480 -.word -14579779 -.word -4418799 -.word -297370968 -.word 594737327 -.word -933034643 -.word -1574249 -.word 6383693 -.word 7107193 -.word -9116920 -.word -100744247 -.word 408525172 -.word 454825638 -.word -583438350 -.word 3640250 -.word 2353891 -.word -4465852 -.word -2195232 -.word 232958220 -.word 150637527 -.word -285792715 -.word -140484126 -.word 4010884 -.word -1634503 -.word -760999 -.word -6510145 -.word 256676985 -.word -104600209 -.word -48700219 -.word -416617482 -.word 1743822 -.word 3833379 -.word 1911034 -.word -2853776 -.word 111596091 -.word 245317532 -.word 122296842 -.word -182627725 -.word -13728247 -.word -6544948 -.word 2327468 -.word 5867892 -.word -878540754 -.word -418844704 -.word 148946584 -.word 375516427 -.word 2158227 -.word -11362575 -.word -4924837 -.word 3179040 -.word 138115986 -.word -727149301 -.word -315165513 -.word 203443032 -.word 2263511 -.word 1113823 -.word -9911360 -.word 758477 -.word 144853648 -.word 71279232 -.word -634278629 -.word 48538823 -.word 1784515 -.word 2861106 -.word 5878727 -.word 11692247 -.word 114200244 -.word 183096809 -.word 376209814 -.word 748246699 -.word 12474447 -.word 9009935 -.word 10628043 -.word 7543116 -.word 798303678 -.word 576591832 -.word 680142841 -.word 482722581 -.word -11310234 -.word -6934560 -.word -9838349 -.word -13624329 -.word -723799733 -.word -443777969 -.word -629606282 -.word -871890510 -.word -5928435 -.word 13050733 -.word 2102990 -.word -13598070 -.word -379390883 -.word 835183168 -.word 134581088 -.word -870210062 -.word 3382539 -.word 10072203 -.word -15599806 -.word -3153984 -.word 216465975 -.word 644571796 -.word -998311389 -.word -201839571 -.word -2228887 -.word -6760262 -.word 4517635 -.word -596073 -.word -142637881 -.word -432623749 -.word 289106574 -.word -38145761 -.word 12816206 -.word -15755637 -.word -11558208 -.word 16185834 -.word 820174585 -.word -1008283812 -.word -739668858 -.word 1035814319 -.word 4099600 -.word 10996266 -.word 10048565 -.word 7627813 -.word 262354376 -.word 703707314 -.word 643059079 -.word 488142775 -.word 4444688 -.word -11779122 -.word 7791061 -.word 14893165 -.word 284438323 -.word -753806275 -.word 498589850 -.word 953089817 -.word -4263629 -.word -10387700 -.word 299677 -.word -9551359 -.word -272851431 -.word -664762063 -.word 19177864 -.word -611240324 -.word 15302355 -.word 8181398 -.word -11550623 -.word 393809 -.word 979275978 -.word 523569511 -.word -739183455 -.word 25201853 -.word 10384926 -.word 4093077 -.word -7735096 -.word 13052352 -.word 664584540 -.word 261936936 -.word -495008363 -.word 835286776 -.word -15137961 -.word 5226683 -.word 2000332 -.word -12486848 -.word -968755565 -.word 334482183 -.word 128011478 -.word -799097282 -.word 8413656 -.word -16292342 -.word -12766243 -.word -1953000 -.word 538432889 -.word -1042630311 -.word -816977197 -.word -124982461 -.word 10198330 -.word -1821442 -.word -16151303 -.word -11253846 -.word 652643308 -.word -116563391 -.word -1033604503 -.word -720191176 -.word 11275919 -.word 6780152 -.word 2466706 -.word 5156339 -.word 721603740 -.word 433896611 -.word 157857136 -.word 329980511 -.word 12362939 -.word 12730672 -.word -8269259 -.word 769518 -.word 791167711 -.word 814700827 -.word -529192186 -.word 49245393 -.word 1345927 -.word 10523131 -.word 10695672 -.word -14834498 -.word 86132754 -.word 673428985 -.word 684470767 -.word -949335415 -.word 8890190 -.word -6574213 -.word 16514902 -.word 14979600 -.word 568928737 -.word -420717521 -.word 1056873063 -.word 958621234 -.word -3716128 -.word -6674394 -.word 9218874 -.word -7103825 -.word -237814041 -.word -427128616 -.word 589962908 -.word -454610102 -.word 13861207 -.word -11280567 -.word -7065381 -.word 7520229 -.word 887049545 -.word -721901190 -.word -452149874 -.word 481257925 -.word 11444654 -.word -14819378 -.word -7430689 -.word -7232147 -.word 732401956 -.word -948367809 -.word -475527802 -.word -462822084 -.word -8648030 -.word 14797569 -.word -5637166 -.word 4878953 -.word -553431680 -.word 946972140 -.word -360751090 -.word 312229162 -.word -8471290 -.word -542121183 -.word -794839 -.word -50865814 -.word -9445744 -.word -604481480 -.word 5443354 -.word 348348069 -.word 11430609 -.word 731503145 -.word -3732072 -.word -238834379 -.word -5086187 -.word -325491125 -.word 15403199 -.word 985729501 -.word -656361 -.word -42003898 -.word -8247799 -.word -527818851 -.word 11510556 -.word 736619362 -.word -16167867 -.word -1034664519 -.word 4264131 -.word 272883557 -.word -10905370 -.word -697890414 -.word 8172970 -.word 523030160 -.word -9249292 -.word -591909511 -.word -13113327 -.word -839188878 -.word -4778209 -.word -305782038 -.word 6865022 -.word 439327877 -.word 8866965 -.word 567442451 -.word -8285889 -.word -530256425 -.word -572895 -.word -36662482 -.word 14019017 -.word 897148614 -.word 9843973 -.word 629966191 -.word 7194579 -.word 460417915 -.word 355881 -.word 22774646 -.word 13728463 -.word 878554577 -.word 2302061 -.word 147320660 -.word -11828796 -.word -756985168 -.word 11713874 -.word 749630721 -.word 13908588 -.word 890081698 -.word -7769916 -.word -497236673 -.word -1579445 -.word -101076765 -.word -6490403 -.word -415354091 -.word 14739293 -.word 943242760 -.word -9106105 -.word -582746243 -.word -2138810 -.word -136873393 -.word 15870328 -.word 1015623476 -.word -5705868 -.word -365147683 -.word -14833295 -.word -949258429 -.word -5289426 -.word -338497429 -.word 8225248 -.word 526375697 -.word 6528331 -.word 417781297 -.word 12336210 -.word 789457186 -.word -341080 -.word -21827454 -.word 9731484 -.word 622767444 -.word 12857867 -.word 822840686 -.word -9010590 -.word -576633749 -.word -13512548 -.word -864737072 -.word -16204162 -.word -1036987221 -.word 10953311 -.word 700958404 -.word -14745691 -.word -943652201 -.word -9914896 -.word -634504916 -.word 12267508 -.word 785060593 -.word -12909577 -.word -826149873 -.word 3271804 -.word 209379475 -.word 3819232 -.word 244412194 -.word -6733847 -.word -430933318 -.word -14626653 -.word -936034350 -.word 13128918 -.word 840186626 -.word 15854702 -.word 1014623488 -.word -14579576 -.word -933021652 -.word -3260327 -.word -208645003 \ No newline at end of file diff --git a/tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_complete.s b/tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index 6db150d..0000000 --- a/tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,1352 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -#define INVNTT_REDUCE_AFTER_L56 -#define INVNTT_REDUCE_AFTER_L34 - -.data -roots_inv: -// block -.word 20558213 // zeta^510 * 2^31 = 28678040^510 * 2^31 -.word 66424611 // zeta^382 * 2^31 = 28678040^382 * 2^31 -.word 59465515 // zeta^446 * 2^31 = 28678040^446 * 2^31 -.word 39560591 // zeta^318 * 2^31 = 28678040^318 * 2^31 -// block twisted -.word 2042724475 // zeta^510 * (q^(-1) mod 2^32) * 2^31 = 28678040^510 * 375649793 * 2^31 -.word 2817904349 // zeta^382 * (q^(-1) mod 2^32) * 2^31 = 28678040^382 * 375649793 * 2^31 -.word 2405453525 // zeta^446 * (q^(-1) mod 2^32) * 2^31 = 28678040^446 * 375649793 * 2^31 -.word 2621436017 // zeta^318 * (q^(-1) mod 2^32) * 2^31 = 28678040^318 * 375649793 * 2^31 -// block -.word 35339857 // zeta^511 * 2^31 = 28678040^511 * 2^31 -.word 13377101 // zeta^447 * 2^31 = 28678040^447 * 2^31 -.word 33252123 // zeta^479 * 2^31 = 28678040^479 * 2^31 -.word 16713319 // zeta^415 * 2^31 = 28678040^415 * 2^31 -// block twisted -.word 3232754607 // zeta^511 * (q^(-1) mod 2^32) * 2^31 = 28678040^511 * 375649793 * 2^31 -.word 2219762611 // zeta^447 * (q^(-1) mod 2^32) * 2^31 = 28678040^447 * 375649793 * 2^31 -.word 3344411365 // zeta^479 * (q^(-1) mod 2^32) * 2^31 = 28678040^479 * 375649793 * 2^31 -.word 2600796057 // zeta^415 * (q^(-1) mod 2^32) * 2^31 = 28678040^415 * 375649793 * 2^31 -// block -.word 10815985 // zeta^383 * 2^31 = 28678040^383 * 2^31 -.word 56247925 // zeta^319 * 2^31 = 28678040^319 * 2^31 -.word 26943959 // zeta^351 * 2^31 = 28678040^351 * 2^31 -.word 51316823 // zeta^287 * 2^31 = 28678040^287 * 2^31 -// block twisted -.word 3650773007 // zeta^383 * (q^(-1) mod 2^32) * 2^31 = 28678040^383 * 375649793 * 2^31 -.word 4021439371 // zeta^319 * (q^(-1) mod 2^32) * 2^31 = 28678040^319 * 375649793 * 2^31 -.word 1538999337 // zeta^351 * (q^(-1) mod 2^32) * 2^31 = 28678040^351 * 375649793 * 2^31 -.word 3611844009 // zeta^287 * (q^(-1) mod 2^32) * 2^31 = 28678040^287 * 375649793 * 2^31 -// block -.word 42042379 // zeta^478 * 2^31 = 28678040^478 * 2^31 -.word 26419651 // zeta^350 * 2^31 = 28678040^350 * 2^31 -.word 61522009 // zeta^414 * 2^31 = 28678040^414 * 2^31 -.word 23758817 // zeta^286 * 2^31 = 28678040^286 * 2^31 -// block twisted -.word 2254105077 // zeta^478 * (q^(-1) mod 2^32) * 2^31 = 28678040^478 * 375649793 * 2^31 -.word 3415374909 // zeta^350 * (q^(-1) mod 2^32) * 2^31 = 28678040^350 * 375649793 * 2^31 -.word 3742677415 // zeta^414 * (q^(-1) mod 2^32) * 2^31 = 28678040^414 * 375649793 * 2^31 -.word 3187687967 // zeta^286 * (q^(-1) mod 2^32) * 2^31 = 28678040^286 * 375649793 * 2^31 -// block -.word 35776599 // zeta^495 * 2^31 = 28678040^495 * 2^31 -.word 6731445 // zeta^431 * 2^31 = 28678040^431 * 2^31 -.word 3030459 // zeta^463 * 2^31 = 28678040^463 * 2^31 -.word 41085059 // zeta^399 * 2^31 = 28678040^399 * 2^31 -// block twisted -.word 351632809 // zeta^495 * (q^(-1) mod 2^32) * 2^31 = 28678040^495 * 375649793 * 2^31 -.word 369646411 // zeta^431 * (q^(-1) mod 2^32) * 2^31 = 28678040^431 * 375649793 * 2^31 -.word 2670661701 // zeta^463 * (q^(-1) mod 2^32) * 2^31 = 28678040^463 * 375649793 * 2^31 -.word 1702245757 // zeta^399 * (q^(-1) mod 2^32) * 2^31 = 28678040^399 * 375649793 * 2^31 -// block -.word 6685305 // zeta^367 * 2^31 = 28678040^367 * 2^31 -.word 24840267 // zeta^303 * 2^31 = 28678040^303 * 2^31 -.word 21119839 // zeta^335 * 2^31 = 28678040^335 * 2^31 -.word 32376869 // zeta^271 * 2^31 = 28678040^271 * 2^31 -// block twisted -.word 2658056071 // zeta^367 * (q^(-1) mod 2^32) * 2^31 = 28678040^367 * 375649793 * 2^31 -.word 495707573 // zeta^303 * (q^(-1) mod 2^32) * 2^31 = 28678040^303 * 375649793 * 2^31 -.word 440627873 // zeta^335 * (q^(-1) mod 2^32) * 2^31 = 28678040^335 * 375649793 * 2^31 -.word 3991890395 // zeta^271 * (q^(-1) mod 2^32) * 2^31 = 28678040^271 * 375649793 * 2^31 -// block -.word 11319751 // zeta^494 * 2^31 = 28678040^494 * 2^31 -.word 57449959 // zeta^366 * 2^31 = 28678040^366 * 2^31 -.word 47736605 // zeta^430 * 2^31 = 28678040^430 * 2^31 -.word 25310795 // zeta^302 * 2^31 = 28678040^302 * 2^31 -// block twisted -.word 316214329 // zeta^494 * (q^(-1) mod 2^32) * 2^31 = 28678040^494 * 375649793 * 2^31 -.word 2994890777 // zeta^366 * (q^(-1) mod 2^32) * 2^31 = 28678040^366 * 375649793 * 2^31 -.word 2883238627 // zeta^430 * (q^(-1) mod 2^32) * 2^31 = 28678040^430 * 375649793 * 2^31 -.word 1834006453 // zeta^302 * (q^(-1) mod 2^32) * 2^31 = 28678040^302 * 375649793 * 2^31 -// block -.word 5649915 // zeta^503 * 2^31 = 28678040^503 * 2^31 -.word 25847843 // zeta^439 * 2^31 = 28678040^439 * 2^31 -.word 62444027 // zeta^471 * 2^31 = 28678040^471 * 2^31 -.word 57855139 // zeta^407 * 2^31 = 28678040^407 * 2^31 -// block twisted -.word 3048839173 // zeta^503 * (q^(-1) mod 2^32) * 2^31 = 28678040^503 * 375649793 * 2^31 -.word 3067803101 // zeta^439 * (q^(-1) mod 2^32) * 2^31 = 28678040^439 * 375649793 * 2^31 -.word 2624519173 // zeta^471 * (q^(-1) mod 2^32) * 2^31 = 28678040^471 * 375649793 * 2^31 -.word 2262798685 // zeta^407 * (q^(-1) mod 2^32) * 2^31 = 28678040^407 * 375649793 * 2^31 -// block -.word 43953263 // zeta^375 * 2^31 = 28678040^375 * 2^31 -.word 3973257 // zeta^311 * 2^31 = 28678040^311 * 2^31 -.word 45754835 // zeta^343 * 2^31 = 28678040^343 * 2^31 -.word 47438647 // zeta^279 * 2^31 = 28678040^279 * 2^31 -// block twisted -.word 1254205841 // zeta^375 * (q^(-1) mod 2^32) * 2^31 = 28678040^375 * 375649793 * 2^31 -.word 3800349047 // zeta^311 * (q^(-1) mod 2^32) * 2^31 = 28678040^311 * 375649793 * 2^31 -.word 3397129261 // zeta^343 * (q^(-1) mod 2^32) * 2^31 = 28678040^343 * 375649793 * 2^31 -.word 3896527561 // zeta^279 * (q^(-1) mod 2^32) * 2^31 = 28678040^279 * 375649793 * 2^31 -// block -.word 34946213 // zeta^462 * 2^31 = 28678040^462 * 2^31 -.word 33401995 // zeta^334 * 2^31 = 28678040^334 * 2^31 -.word 57707227 // zeta^398 * 2^31 = 28678040^398 * 2^31 -.word 43655235 // zeta^270 * 2^31 = 28678040^270 * 2^31 -// block twisted -.word 4090836315 // zeta^462 * (q^(-1) mod 2^32) * 2^31 = 28678040^462 * 375649793 * 2^31 -.word 2389950837 // zeta^334 * (q^(-1) mod 2^32) * 2^31 = 28678040^334 * 375649793 * 2^31 -.word 1383072549 // zeta^398 * (q^(-1) mod 2^32) * 2^31 = 28678040^398 * 375649793 * 2^31 -.word 2793176509 // zeta^270 * (q^(-1) mod 2^32) * 2^31 = 28678040^270 * 375649793 * 2^31 -// block -.word 30218957 // zeta^487 * 2^31 = 28678040^487 * 2^31 -.word 13073717 // zeta^423 * 2^31 = 28678040^423 * 2^31 -.word 41547715 // zeta^455 * 2^31 = 28678040^455 * 2^31 -.word 51082899 // zeta^391 * 2^31 = 28678040^391 * 2^31 -// block twisted -.word 3945457459 // zeta^487 * (q^(-1) mod 2^32) * 2^31 = 28678040^487 * 375649793 * 2^31 -.word 1399362763 // zeta^423 * (q^(-1) mod 2^32) * 2^31 = 28678040^423 * 375649793 * 2^31 -.word 923248189 // zeta^455 * (q^(-1) mod 2^32) * 2^31 = 28678040^455 * 375649793 * 2^31 -.word 2083145581 // zeta^391 * (q^(-1) mod 2^32) * 2^31 = 28678040^391 * 375649793 * 2^31 -// block -.word 6539853 // zeta^359 * 2^31 = 28678040^359 * 2^31 -.word 52712977 // zeta^295 * 2^31 = 28678040^295 * 2^31 -.word 15171525 // zeta^327 * 2^31 = 28678040^327 * 2^31 -.word 41070365 // zeta^263 * 2^31 = 28678040^263 * 2^31 -// block twisted -.word 1097807795 // zeta^359 * (q^(-1) mod 2^32) * 2^31 = 28678040^359 * 375649793 * 2^31 -.word 1402229743 // zeta^295 * (q^(-1) mod 2^32) * 2^31 = 28678040^295 * 375649793 * 2^31 -.word 857879099 // zeta^327 * (q^(-1) mod 2^32) * 2^31 = 28678040^327 * 375649793 * 2^31 -.word 2467328739 // zeta^263 * (q^(-1) mod 2^32) * 2^31 = 28678040^263 * 375649793 * 2^31 -// block -.word 1421525 // zeta^502 * 2^31 = 28678040^502 * 2^31 -.word 5608953 // zeta^374 * 2^31 = 28678040^374 * 2^31 -.word 3344309 // zeta^438 * 2^31 = 28678040^438 * 2^31 -.word 54192527 // zeta^310 * 2^31 = 28678040^310 * 2^31 -// block twisted -.word 2006884651 // zeta^502 * (q^(-1) mod 2^32) * 2^31 = 28678040^502 * 375649793 * 2^31 -.word 1547838471 // zeta^374 * (q^(-1) mod 2^32) * 2^31 = 28678040^374 * 375649793 * 2^31 -.word 1835403851 // zeta^438 * (q^(-1) mod 2^32) * 2^31 = 28678040^438 * 375649793 * 2^31 -.word 3288902769 // zeta^310 * (q^(-1) mod 2^32) * 2^31 = 28678040^310 * 375649793 * 2^31 -// block -.word 55532487 // zeta^507 * 2^31 = 28678040^507 * 2^31 -.word 25878283 // zeta^443 * 2^31 = 28678040^443 * 2^31 -.word 7519477 // zeta^475 * 2^31 = 28678040^475 * 2^31 -.word 10400227 // zeta^411 * 2^31 = 28678040^411 * 2^31 -// block twisted -.word 579496505 // zeta^507 * (q^(-1) mod 2^32) * 2^31 = 28678040^507 * 375649793 * 2^31 -.word 1491046133 // zeta^443 * (q^(-1) mod 2^32) * 2^31 = 28678040^443 * 375649793 * 2^31 -.word 2637878539 // zeta^475 * (q^(-1) mod 2^32) * 2^31 = 28678040^475 * 375649793 * 2^31 -.word 866659357 // zeta^411 * (q^(-1) mod 2^32) * 2^31 = 28678040^411 * 375649793 * 2^31 -// block -.word 66449241 // zeta^379 * 2^31 = 28678040^379 * 2^31 -.word 4428811 // zeta^315 * 2^31 = 28678040^315 * 2^31 -.word 30618985 // zeta^347 * 2^31 = 28678040^347 * 2^31 -.word 46942975 // zeta^283 * 2^31 = 28678040^283 * 2^31 -// block twisted -.word 1923058343 // zeta^379 * (q^(-1) mod 2^32) * 2^31 = 28678040^379 * 375649793 * 2^31 -.word 3711490549 // zeta^315 * (q^(-1) mod 2^32) * 2^31 = 28678040^315 * 375649793 * 2^31 -.word 1530848407 // zeta^347 * (q^(-1) mod 2^32) * 2^31 = 28678040^347 * 375649793 * 2^31 -.word 3263539969 // zeta^283 * (q^(-1) mod 2^32) * 2^31 = 28678040^283 * 375649793 * 2^31 -// block -.word 34238409 // zeta^470 * 2^31 = 28678040^470 * 2^31 -.word 7278675 // zeta^342 * 2^31 = 28678040^342 * 2^31 -.word 26316985 // zeta^406 * 2^31 = 28678040^406 * 2^31 -.word 1738533 // zeta^278 * 2^31 = 28678040^278 * 2^31 -// block twisted -.word 1976527415 // zeta^470 * (q^(-1) mod 2^32) * 2^31 = 28678040^470 * 375649793 * 2^31 -.word 3553111469 // zeta^342 * (q^(-1) mod 2^32) * 2^31 = 28678040^342 * 375649793 * 2^31 -.word 1070704967 // zeta^406 * (q^(-1) mod 2^32) * 2^31 = 28678040^406 * 375649793 * 2^31 -.word 280554203 // zeta^278 * (q^(-1) mod 2^32) * 2^31 = 28678040^278 * 375649793 * 2^31 -// block -.word 29493541 // zeta^491 * 2^31 = 28678040^491 * 2^31 -.word 46179537 // zeta^427 * 2^31 = 28678040^427 * 2^31 -.word 61070425 // zeta^459 * 2^31 = 28678040^459 * 2^31 -.word 47641435 // zeta^395 * 2^31 = 28678040^395 * 2^31 -// block twisted -.word 3525667035 // zeta^491 * (q^(-1) mod 2^32) * 2^31 = 28678040^491 * 375649793 * 2^31 -.word 738952495 // zeta^427 * (q^(-1) mod 2^32) * 2^31 = 28678040^427 * 375649793 * 2^31 -.word 2855509415 // zeta^459 * (q^(-1) mod 2^32) * 2^31 = 28678040^459 * 375649793 * 2^31 -.word 2166266533 // zeta^395 * (q^(-1) mod 2^32) * 2^31 = 28678040^395 * 375649793 * 2^31 -// block -.word 8700655 // zeta^363 * 2^31 = 28678040^363 * 2^31 -.word 49217369 // zeta^299 * 2^31 = 28678040^299 * 2^31 -.word 14037329 // zeta^331 * 2^31 = 28678040^331 * 2^31 -.word 57068693 // zeta^267 * 2^31 = 28678040^267 * 2^31 -// block twisted -.word 2143064849 // zeta^363 * (q^(-1) mod 2^32) * 2^31 = 28678040^363 * 375649793 * 2^31 -.word 3997596327 // zeta^299 * (q^(-1) mod 2^32) * 2^31 = 28678040^299 * 375649793 * 2^31 -.word 594737327 // zeta^331 * (q^(-1) mod 2^32) * 2^31 = 28678040^331 * 375649793 * 2^31 -.word 1214449003 // zeta^267 * (q^(-1) mod 2^32) * 2^31 = 28678040^267 * 375649793 * 2^31 -// block -.word 5988919 // zeta^486 * 2^31 = 28678040^486 * 2^31 -.word 27781261 // zeta^358 * 2^31 = 28678040^358 * 2^31 -.word 33650523 // zeta^422 * 2^31 = 28678040^422 * 2^31 -.word 40314383 // zeta^294 * 2^31 = 28678040^294 * 2^31 -// block twisted -.word 2046739401 // zeta^486 * (q^(-1) mod 2^32) * 2^31 = 28678040^486 * 375649793 * 2^31 -.word 2556008819 // zeta^358 * (q^(-1) mod 2^32) * 2^31 = 28678040^358 * 375649793 * 2^31 -.word 2602309285 // zeta^422 * (q^(-1) mod 2^32) * 2^31 = 28678040^422 * 375649793 * 2^31 -.word 3711528945 // zeta^294 * (q^(-1) mod 2^32) * 2^31 = 28678040^294 * 375649793 * 2^31 -// block -.word 25356533 // zeta^499 * 2^31 = 28678040^499 * 2^31 -.word 59712043 // zeta^435 * 2^31 = 28678040^435 * 2^31 -.word 59431885 // zeta^467 * 2^31 = 28678040^467 * 2^31 -.word 42783775 // zeta^403 * 2^31 = 28678040^403 * 2^31 -// block twisted -.word 232958219 // zeta^499 * (q^(-1) mod 2^32) * 2^31 = 28678040^499 * 375649793 * 2^31 -.word 2298121173 // zeta^435 * (q^(-1) mod 2^32) * 2^31 = 28678040^435 * 375649793 * 2^31 -.word 4009174579 // zeta^467 * (q^(-1) mod 2^32) * 2^31 = 28678040^467 * 375649793 * 2^31 -.word 4154483169 // zeta^403 * (q^(-1) mod 2^32) * 2^31 = 28678040^403 * 375649793 * 2^31 -// block -.word 15118727 // zeta^371 * 2^31 = 28678040^371 * 2^31 -.word 16104593 // zeta^307 * 2^31 = 28678040^307 * 2^31 -.word 66551101 // zeta^339 * 2^31 = 28678040^339 * 2^31 -.word 27099659 // zeta^275 * 2^31 = 28678040^275 * 2^31 -// block twisted -.word 256676985 // zeta^371 * (q^(-1) mod 2^32) * 2^31 = 28678040^371 * 375649793 * 2^31 -.word 2042883439 // zeta^307 * (q^(-1) mod 2^32) * 2^31 = 28678040^307 * 375649793 * 2^31 -.word 2098783427 // zeta^339 * (q^(-1) mod 2^32) * 2^31 = 28678040^339 * 375649793 * 2^31 -.word 1730866165 // zeta^275 * (q^(-1) mod 2^32) * 2^31 = 28678040^275 * 375649793 * 2^31 -// block -.word 52622279 // zeta^454 * 2^31 = 28678040^454 * 2^31 -.word 48542309 // zeta^326 * 2^31 = 28678040^326 * 2^31 -.word 28412919 // zeta^390 * 2^31 = 28678040^390 * 2^31 -.word 61490063 // zeta^262 * 2^31 = 28678040^262 * 2^31 -// block twisted -.word 111596089 // zeta^454 * (q^(-1) mod 2^32) * 2^31 = 28678040^454 * 375649793 * 2^31 -.word 2392801179 // zeta^326 * (q^(-1) mod 2^32) * 2^31 = 28678040^326 * 375649793 * 2^31 -.word 122296841 // zeta^390 * (q^(-1) mod 2^32) * 2^31 = 28678040^390 * 375649793 * 2^31 -.word 4112339569 // zeta^262 * (q^(-1) mod 2^32) * 2^31 = 28678040^262 * 375649793 * 2^31 -// block -.word 17544659 // zeta^483 * 2^31 = 28678040^483 * 2^31 -.word 26761761 // zeta^419 * 2^31 = 28678040^419 * 2^31 -.word 28138345 // zeta^451 * 2^31 = 28678040^451 * 2^31 -.word 6006005 // zeta^387 * 2^31 = 28678040^387 * 2^31 -// block twisted -.word 1268942893 // zeta^483 * (q^(-1) mod 2^32) * 2^31 = 28678040^483 * 375649793 * 2^31 -.word 3876122591 // zeta^419 * (q^(-1) mod 2^32) * 2^31 = 28678040^419 * 375649793 * 2^31 -.word 148946583 // zeta^451 * (q^(-1) mod 2^32) * 2^31 = 28678040^451 * 375649793 * 2^31 -.word 375516427 // zeta^387 * (q^(-1) mod 2^32) * 2^31 = 28678040^387 * 375649793 * 2^31 -// block -.word 49338991 // zeta^355 * 2^31 = 28678040^355 * 2^31 -.word 59052279 // zeta^291 * 2^31 = 28678040^291 * 2^31 -.word 54131019 // zeta^323 * 2^31 = 28678040^323 * 2^31 -.word 49172137 // zeta^259 * 2^31 = 28678040^259 * 2^31 -// block twisted -.word 2285599633 // zeta^355 * (q^(-1) mod 2^32) * 2^31 = 28678040^355 * 375649793 * 2^31 -.word 1420334345 // zeta^291 * (q^(-1) mod 2^32) * 2^31 = 28678040^291 * 375649793 * 2^31 -.word 1832318133 // zeta^323 * (q^(-1) mod 2^32) * 2^31 = 28678040^323 * 375649793 * 2^31 -.word 203443031 // zeta^259 * (q^(-1) mod 2^32) * 2^31 = 28678040^259 * 375649793 * 2^31 -// block -.word 41164657 // zeta^506 * 2^31 = 28678040^506 * 2^31 -.word 23553921 // zeta^378 * 2^31 = 28678040^378 * 2^31 -.word 51075303 // zeta^442 * 2^31 = 28678040^442 * 2^31 -.word 11244857 // zeta^314 * 2^31 = 28678040^314 * 2^31 -// block twisted -.word 2292337295 // zeta^506 * (q^(-1) mod 2^32) * 2^31 = 28678040^506 * 375649793 * 2^31 -.word 2218762879 // zeta^378 * (q^(-1) mod 2^32) * 2^31 = 28678040^378 * 375649793 * 2^31 -.word 3660688665 // zeta^442 * (q^(-1) mod 2^32) * 2^31 = 28678040^442 * 375649793 * 2^31 -.word 2196022471 // zeta^314 * (q^(-1) mod 2^32) * 2^31 = 28678040^314 * 375649793 * 2^31 -// block -.word 27161421 // zeta^509 * 2^31 = 28678040^509 * 2^31 -.word 12259351 // zeta^445 * 2^31 = 28678040^445 * 2^31 -.word 42183787 // zeta^477 * 2^31 = 28678040^477 * 2^31 -.word 260949 // zeta^413 * 2^31 = 28678040^413 * 2^31 -// block twisted -.word 2261683891 // zeta^509 * (q^(-1) mod 2^32) * 2^31 = 28678040^509 * 375649793 * 2^31 -.word 183096809 // zeta^445 * (q^(-1) mod 2^32) * 2^31 = 28678040^445 * 375649793 * 2^31 -.word 2523693461 // zeta^477 * (q^(-1) mod 2^32) * 2^31 = 28678040^477 * 375649793 * 2^31 -.word 2895730347 // zeta^413 * (q^(-1) mod 2^32) * 2^31 = 28678040^413 * 375649793 * 2^31 -// block -.word 49379395 // zeta^381 * 2^31 = 28678040^381 * 2^31 -.word 45318697 // zeta^317 * 2^31 = 28678040^317 * 2^31 -.word 65417737 // zeta^349 * 2^31 = 28678040^349 * 2^31 -.word 60522221 // zeta^285 * 2^31 = 28678040^285 * 2^31 -// block twisted -.word 2945787325 // zeta^381 * (q^(-1) mod 2^32) * 2^31 = 28678040^381 * 375649793 * 2^31 -.word 2724075479 // zeta^317 * (q^(-1) mod 2^32) * 2^31 = 28678040^317 * 375649793 * 2^31 -.word 2827626487 // zeta^349 * (q^(-1) mod 2^32) * 2^31 = 28678040^349 * 375649793 * 2^31 -.word 482722579 // zeta^285 * (q^(-1) mod 2^32) * 2^31 = 28678040^285 * 375649793 * 2^31 -// block -.word 3629237 // zeta^474 * 2^31 = 28678040^474 * 2^31 -.word 60326323 // zeta^346 * 2^31 = 28678040^346 * 2^31 -.word 30569867 // zeta^410 * 2^31 = 28678040^410 * 2^31 -.word 31921231 // zeta^282 * 2^31 = 28678040^282 * 2^31 -// block twisted -.word 3571167563 // zeta^474 * (q^(-1) mod 2^32) * 2^31 = 28678040^474 * 375649793 * 2^31 -.word 3851189325 // zeta^346 * (q^(-1) mod 2^32) * 2^31 = 28678040^346 * 375649793 * 2^31 -.word 1517877365 // zeta^410 * (q^(-1) mod 2^32) * 2^31 = 28678040^410 * 375649793 * 2^31 -.word 1275593137 // zeta^282 * (q^(-1) mod 2^32) * 2^31 = 28678040^282 * 375649793 * 2^31 -// block -.word 51477925 // zeta^493 * 2^31 = 28678040^493 * 2^31 -.word 23177153 // zeta^429 * 2^31 = 28678040^429 * 2^31 -.word 42516129 // zeta^461 * 2^31 = 28678040^461 * 2^31 -.word 23261199 // zeta^397 * 2^31 = 28678040^397 * 2^31 -// block twisted -.word 1768092763 // zeta^493 * (q^(-1) mod 2^32) * 2^31 = 28678040^493 * 375649793 * 2^31 -.word 2982666815 // zeta^429 * (q^(-1) mod 2^32) * 2^31 = 28678040^429 * 375649793 * 2^31 -.word 134581087 // zeta^461 * (q^(-1) mod 2^32) * 2^31 = 28678040^461 * 375649793 * 2^31 -.word 3424757233 // zeta^397 * (q^(-1) mod 2^32) * 2^31 = 28678040^397 * 375649793 * 2^31 -// block -.word 50523083 // zeta^365 * 2^31 = 28678040^365 * 2^31 -.word 29024109 // zeta^301 * 2^31 = 28678040^301 * 2^31 -.word 62634975 // zeta^333 * 2^31 = 28678040^333 * 2^31 -.word 5116371 // zeta^269 * 2^31 = 28678040^269 * 2^31 -// block twisted -.word 2363949621 // zeta^365 * (q^(-1) mod 2^32) * 2^31 = 28678040^365 * 375649793 * 2^31 -.word 2792055443 // zeta^301 * (q^(-1) mod 2^32) * 2^31 = 28678040^301 * 375649793 * 2^31 -.word 3296655905 // zeta^333 * (q^(-1) mod 2^32) * 2^31 = 28678040^333 * 375649793 * 2^31 -.word 4093127725 // zeta^269 * (q^(-1) mod 2^32) * 2^31 = 28678040^269 * 375649793 * 2^31 -// block -.word 55626043 // zeta^490 * 2^31 = 28678040^490 * 2^31 -.word 15630981 // zeta^362 * 2^31 = 28678040^362 * 2^31 -.word 43717491 // zeta^426 * 2^31 = 28678040^426 * 2^31 -.word 14342369 // zeta^298 * 2^31 = 28678040^298 * 2^31 -// block twisted -.word 2004845765 // zeta^490 * (q^(-1) mod 2^32) * 2^31 = 28678040^490 * 375649793 * 2^31 -.word 3862343547 // zeta^362 * (q^(-1) mod 2^32) * 2^31 = 28678040^362 * 375649793 * 2^31 -.word 2436590221 // zeta^426 * (q^(-1) mod 2^32) * 2^31 = 28678040^426 * 375649793 * 2^31 -.word 2109337887 // zeta^298 * (q^(-1) mod 2^32) * 2^31 = 28678040^298 * 375649793 * 2^31 -// block -.word 6776583 // zeta^501 * 2^31 = 28678040^501 * 2^31 -.word 33530533 // zeta^437 * 2^31 = 28678040^437 * 2^31 -.word 43598203 // zeta^469 * 2^31 = 28678040^469 * 2^31 -.word 59373651 // zeta^405 * 2^31 = 28678040^405 * 2^31 -// block twisted -.word 820174585 // zeta^501 * (q^(-1) mod 2^32) * 2^31 = 28678040^501 * 375649793 * 2^31 -.word 1139199835 // zeta^437 * (q^(-1) mod 2^32) * 2^31 = 28678040^437 * 375649793 * 2^31 -.word 3555298437 // zeta^469 * (q^(-1) mod 2^32) * 2^31 = 28678040^469 * 375649793 * 2^31 -.word 1035814317 // zeta^405 * (q^(-1) mod 2^32) * 2^31 = 28678040^405 * 375649793 * 2^31 -// block -.word 37946425 // zeta^373 * 2^31 = 28678040^373 * 2^31 -.word 47668559 // zeta^309 * 2^31 = 28678040^309 * 2^31 -.word 10775673 // zeta^341 * 2^31 = 28678040^341 * 2^31 -.word 3826249 // zeta^277 * 2^31 = 28678040^277 * 2^31 -// block twisted -.word 262354375 // zeta^373 * (q^(-1) mod 2^32) * 2^31 = 28678040^373 * 375649793 * 2^31 -.word 703707313 // zeta^309 * (q^(-1) mod 2^32) * 2^31 = 28678040^309 * 375649793 * 2^31 -.word 2790542727 // zeta^341 * (q^(-1) mod 2^32) * 2^31 = 28678040^341 * 375649793 * 2^31 -.word 2635626423 // zeta^277 * (q^(-1) mod 2^32) * 2^31 = 28678040^277 * 375649793 * 2^31 -// block -.word 53733071 // zeta^458 * 2^31 = 28678040^458 * 2^31 -.word 10734019 // zeta^330 * 2^31 = 28678040^330 * 2^31 -.word 25306471 // zeta^394 * 2^31 = 28678040^394 * 2^31 -.word 54139625 // zeta^266 * 2^31 = 28678040^266 * 2^31 -// block twisted -.word 284438321 // zeta^458 * (q^(-1) mod 2^32) * 2^31 = 28678040^458 * 375649793 * 2^31 -.word 3541161021 // zeta^330 * (q^(-1) mod 2^32) * 2^31 = 28678040^330 * 375649793 * 2^31 -.word 2646073497 // zeta^394 * (q^(-1) mod 2^32) * 2^31 = 28678040^394 * 375649793 * 2^31 -.word 3100573463 // zeta^266 * (q^(-1) mod 2^32) * 2^31 = 28678040^266 * 375649793 * 2^31 -// block -.word 1468391 // zeta^485 * 2^31 = 28678040^485 * 2^31 -.word 4426959 // zeta^421 * 2^31 = 28678040^421 * 2^31 -.word 42735737 // zeta^453 * 2^31 = 28678040^453 * 2^31 -.word 38665093 // zeta^389 * 2^31 = 28678040^389 * 2^31 -// block twisted -.word 1874632217 // zeta^485 * (q^(-1) mod 2^32) * 2^31 = 28678040^485 * 375649793 * 2^31 -.word 3630205233 // zeta^421 * (q^(-1) mod 2^32) * 2^31 = 28678040^421 * 375649793 * 2^31 -.word 2166661511 // zeta^453 * (q^(-1) mod 2^32) * 2^31 = 28678040^453 * 375649793 * 2^31 -.word 1536243323 // zeta^389 * (q^(-1) mod 2^32) * 2^31 = 28678040^389 * 375649793 * 2^31 -// block -.word 33133879 // zeta^357 * 2^31 = 28678040^357 * 2^31 -.word 7139481 // zeta^293 * 2^31 = 28678040^293 * 2^31 -.word 8438111 // zeta^325 * 2^31 = 28678040^325 * 2^31 -.word 50341189 // zeta^261 * 2^31 = 28678040^261 * 2^31 -// block twisted -.word 3126759625 // zeta^357 * (q^(-1) mod 2^32) * 2^31 = 28678040^357 * 375649793 * 2^31 -.word 523569511 // zeta^293 * (q^(-1) mod 2^32) * 2^31 = 28678040^293 * 375649793 * 2^31 -.word 1408300193 // zeta^325 * (q^(-1) mod 2^32) * 2^31 = 28678040^325 * 375649793 * 2^31 -.word 2172685499 // zeta^261 * (q^(-1) mod 2^32) * 2^31 = 28678040^261 * 375649793 * 2^31 -// block -.word 47558821 // zeta^498 * 2^31 = 28678040^498 * 2^31 -.word 33268441 // zeta^370 * 2^31 = 28678040^370 * 2^31 -.word 63536237 // zeta^434 * 2^31 = 28678040^434 * 2^31 -.word 26272521 // zeta^306 * 2^31 = 28678040^306 * 2^31 -// block twisted -.word 664584539 // zeta^498 * (q^(-1) mod 2^32) * 2^31 = 28678040^498 * 375649793 * 2^31 -.word 2409420583 // zeta^370 * (q^(-1) mod 2^32) * 2^31 = 28678040^370 * 375649793 * 2^31 -.word 3799958931 // zeta^434 * (q^(-1) mod 2^32) * 2^31 = 28678040^434 * 375649793 * 2^31 -.word 835286775 // zeta^306 * (q^(-1) mod 2^32) * 2^31 = 28678040^306 * 375649793 * 2^31 -// block -.word 1854317 // zeta^505 * 2^31 = 28678040^505 * 2^31 -.word 2223865 // zeta^441 * 2^31 = 28678040^441 * 2^31 -.word 22962475 // zeta^473 * 2^31 = 28678040^473 * 2^31 -.word 36888515 // zeta^409 * 2^31 = 28678040^409 * 2^31 -// block twisted -.word 1178728083 // zeta^505 * (q^(-1) mod 2^32) * 2^31 = 28678040^505 * 375649793 * 2^31 -.word 2481965831 // zeta^441 * (q^(-1) mod 2^32) * 2^31 = 28678040^441 * 375649793 * 2^31 -.word 128011477 // zeta^473 * (q^(-1) mod 2^32) * 2^31 = 28678040^473 * 375649793 * 2^31 -.word 3495870013 // zeta^409 * (q^(-1) mod 2^32) * 2^31 = 28678040^409 * 375649793 * 2^31 -// block -.word 59868297 // zeta^377 * 2^31 = 28678040^377 * 2^31 -.word 15191207 // zeta^313 * 2^31 = 28678040^313 * 2^31 -.word 59108143 // zeta^345 * 2^31 = 28678040^345 * 2^31 -.word 4355773 // zeta^281 * 2^31 = 28678040^281 * 2^31 -// block twisted -.word 538432887 // zeta^377 * (q^(-1) mod 2^32) * 2^31 = 28678040^377 * 375649793 * 2^31 -.word 3252336985 // zeta^313 * (q^(-1) mod 2^32) * 2^31 = 28678040^313 * 375649793 * 2^31 -.word 1330506449 // zeta^345 * (q^(-1) mod 2^32) * 2^31 = 28678040^345 * 375649793 * 2^31 -.word 4169984835 // zeta^281 * (q^(-1) mod 2^32) * 2^31 = 28678040^281 * 375649793 * 2^31 -// block -.word 27411989 // zeta^466 * 2^31 = 28678040^466 * 2^31 -.word 52176833 // zeta^338 * 2^31 = 28678040^338 * 2^31 -.word 52660121 // zeta^402 * 2^31 = 28678040^402 * 2^31 -.word 23140553 // zeta^274 * 2^31 = 28678040^274 * 2^31 -// block twisted -.word 652643307 // zeta^466 * (q^(-1) mod 2^32) * 2^31 = 28678040^466 * 375649793 * 2^31 -.word 4178403903 // zeta^338 * (q^(-1) mod 2^32) * 2^31 = 28678040^338 * 375649793 * 2^31 -.word 1113879143 // zeta^402 * (q^(-1) mod 2^32) * 2^31 = 28678040^402 * 375649793 * 2^31 -.word 3574776119 // zeta^274 * (q^(-1) mod 2^32) * 2^31 = 28678040^274 * 375649793 * 2^31 -// block -.word 50275685 // zeta^489 * 2^31 = 28678040^489 * 2^31 -.word 12903773 // zeta^425 * 2^31 = 28678040^425 * 2^31 -.word 25228433 // zeta^457 * 2^31 = 28678040^457 * 2^31 -.word 55395235 // zeta^393 * 2^31 = 28678040^393 * 2^31 -// block twisted -.word 2869087387 // zeta^489 * (q^(-1) mod 2^32) * 2^31 = 28678040^489 * 375649793 * 2^31 -.word 433896611 // zeta^425 * (q^(-1) mod 2^32) * 2^31 = 28678040^425 * 375649793 * 2^31 -.word 157857135 // zeta^457 * (q^(-1) mod 2^32) * 2^31 = 28678040^457 * 375649793 * 2^31 -.word 2477464157 // zeta^393 * (q^(-1) mod 2^32) * 2^31 = 28678040^393 * 375649793 * 2^31 -// block -.word 3868449 // zeta^361 * 2^31 = 28678040^361 * 2^31 -.word 66432231 // zeta^297 * 2^31 = 28678040^297 * 2^31 -.word 31236859 // zeta^329 * 2^31 = 28678040^329 * 2^31 -.word 13658415 // zeta^265 * 2^31 = 28678040^265 * 2^31 -// block twisted -.word 2938651359 // zeta^361 * (q^(-1) mod 2^32) * 2^31 = 28678040^361 * 375649793 * 2^31 -.word 814700825 // zeta^297 * (q^(-1) mod 2^32) * 2^31 = 28678040^297 * 375649793 * 2^31 -.word 1618291461 // zeta^329 * (q^(-1) mod 2^32) * 2^31 = 28678040^329 * 375649793 * 2^31 -.word 49245393 // zeta^265 * (q^(-1) mod 2^32) * 2^31 = 28678040^265 * 375649793 * 2^31 -// block -.word 34409967 // zeta^482 * 2^31 = 28678040^482 * 2^31 -.word 12619783 // zeta^354 * 2^31 = 28678040^354 * 2^31 -.word 54561811 // zeta^418 * 2^31 = 28678040^418 * 2^31 -.word 61632377 // zeta^290 * 2^31 = 28678040^290 * 2^31 -// block twisted -.word 2233616401 // zeta^482 * (q^(-1) mod 2^32) * 2^31 = 28678040^482 * 375649793 * 2^31 -.word 2820912633 // zeta^354 * (q^(-1) mod 2^32) * 2^31 = 28678040^354 * 375649793 * 2^31 -.word 684470765 // zeta^418 * (q^(-1) mod 2^32) * 2^31 = 28678040^418 * 375649793 * 2^31 -.word 3345631879 // zeta^290 * (q^(-1) mod 2^32) * 2^31 = 28678040^290 * 375649793 * 2^31 -// block -.word 7605279 // zeta^497 * 2^31 = 28678040^497 * 2^31 -.word 58319315 // zeta^433 * 2^31 = 28678040^433 * 2^31 -.word 16342937 // zeta^465 * 2^31 = 28678040^465 * 2^31 -.word 48148431 // zeta^401 * 2^31 = 28678040^401 * 2^31 -// block twisted -.word 568928737 // zeta^497 * (q^(-1) mod 2^32) * 2^31 = 28678040^497 * 375649793 * 2^31 -.word 1726766125 // zeta^433 * (q^(-1) mod 2^32) * 2^31 = 28678040^433 * 375649793 * 2^31 -.word 1056873063 // zeta^465 * (q^(-1) mod 2^32) * 2^31 = 28678040^465 * 375649793 * 2^31 -.word 958621233 // zeta^401 * (q^(-1) mod 2^32) * 2^31 = 28678040^401 * 375649793 * 2^31 -// block -.word 62377755 // zeta^369 * 2^31 = 28678040^369 * 2^31 -.word 35459369 // zeta^305 * 2^31 = 28678040^305 * 2^31 -.word 27513701 // zeta^337 * 2^31 = 28678040^337 * 2^31 -.word 18346679 // zeta^273 * 2^31 = 28678040^273 * 2^31 -// block twisted -.word 4057153253 // zeta^369 * (q^(-1) mod 2^32) * 2^31 = 28678040^369 * 375649793 * 2^31 -.word 3867838679 // zeta^305 * (q^(-1) mod 2^32) * 2^31 = 28678040^305 * 375649793 * 2^31 -.word 589962907 // zeta^337 * (q^(-1) mod 2^32) * 2^31 = 28678040^337 * 375649793 * 2^31 -.word 1692873545 // zeta^273 * (q^(-1) mod 2^32) * 2^31 = 28678040^273 * 375649793 * 2^31 -// block -.word 1824951 // zeta^450 * 2^31 = 28678040^450 * 2^31 -.word 40410247 // zeta^322 * 2^31 = 28678040^322 * 2^31 -.word 25935987 // zeta^386 * 2^31 = 28678040^386 * 2^31 -.word 53409853 // zeta^258 * 2^31 = 28678040^258 * 2^31 -// block twisted -.word 3034533193 // zeta^450 * (q^(-1) mod 2^32) * 2^31 = 28678040^450 * 375649793 * 2^31 -.word 1425582457 // zeta^322 * (q^(-1) mod 2^32) * 2^31 = 28678040^322 * 375649793 * 2^31 -.word 1695333773 // zeta^386 * (q^(-1) mod 2^32) * 2^31 = 28678040^386 * 375649793 * 2^31 -.word 2628741571 // zeta^258 * (q^(-1) mod 2^32) * 2^31 = 28678040^258 * 375649793 * 2^31 -// block -.word 44896477 // zeta^481 * 2^31 = 28678040^481 * 2^31 -.word 66621379 // zeta^417 * 2^31 = 28678040^417 * 2^31 -.word 35702907 // zeta^449 * 2^31 = 28678040^449 * 2^31 -.word 44158149 // zeta^385 * 2^31 = 28678040^385 * 2^31 -// block twisted -.word 732401955 // zeta^481 * (q^(-1) mod 2^32) * 2^31 = 28678040^481 * 375649793 * 2^31 -.word 3346599485 // zeta^417 * (q^(-1) mod 2^32) * 2^31 = 28678040^417 * 375649793 * 2^31 -.word 1671955845 // zeta^449 * (q^(-1) mod 2^32) * 2^31 = 28678040^449 * 375649793 * 2^31 -.word 1684661563 // zeta^385 * (q^(-1) mod 2^32) * 2^31 = 28678040^385 * 375649793 * 2^31 -// block -.word 32881793 // zeta^353 * 2^31 = 28678040^353 * 2^31 -.word 18033685 // zeta^289 * 2^31 = 28678040^289 * 2^31 -.word 29367795 // zeta^321 * 2^31 = 28678040^321 * 2^31 -.word 16787671 // zeta^257 * 2^31 = 28678040^257 * 2^31 -// block twisted -.word 3741535615 // zeta^353 * (q^(-1) mod 2^32) * 2^31 = 28678040^353 * 375649793 * 2^31 -.word 3094455787 // zeta^289 * (q^(-1) mod 2^32) * 2^31 = 28678040^289 * 375649793 * 2^31 -.word 3934216205 // zeta^321 * (q^(-1) mod 2^32) * 2^31 = 28678040^321 * 375649793 * 2^31 -.word 2459712809 // zeta^257 * (q^(-1) mod 2^32) * 2^31 = 28678040^257 * 375649793 * 2^31 -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 38018305 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.text - -// Montgomery multiplication via rounding -.macro mulmod dst, src, const, const_twisted - vqrdmulh.s32 \dst, \src, \const - vmul.u32 \src, \src, \const_twisted - vqrdmlah.s32 \dst, \src, modulus -.endm - -.macro gs_butterfly a, b, root, root_twisted - vsub.u32 tmp, \a, \b - vadd.u32 \a, \a, \b - mulmod \b, tmp, \root, \root_twisted -.endm - -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type invntt_n256_u32_33556993_28678040_complete_manual, %function -.global invntt_n256_u32_33556993_28678040_complete_manual -invntt_n256_u32_33556993_28678040_complete_manual: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, 33556993 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in .req r0 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - - tmp .req q4 - - /* Layers 7,8 */ - - mov lr, #16 - vldrw.u32 q2, [in, #48] // *.... - nop // ....* - vldrw.u32 q3, [in, #32] // ..*.. - nop // ...*. - vldrw.u32 q6, [root_ptr, #64] // .*... - - // original source code - // vldrw.u32 q2, [in, #48] // *.... - // vldrw.u32 q6, [root_ptr, #64] // ....* - // vldrw.u32 q3, [in, #32] // ..*.. - // nop // ...*. - // nop // .*... - - sub lr, lr, #1 - wls lr, lr, layer78_loop_end -layer78_loop: - vsub.u32 q7, q3, q2 // .............*.................... - vqrdmulh.s32 q6, q7, q6 // ...............*.................. - vldrw.u32 q0, [in, #16] // .*................................ - vadd.u32 q1, q3, q2 // ..............*................... - vldrw.u32 q4, [in] // *................................. - vsub.u32 q3, q4, q0 // ......*........................... - vldrw.u32 q2, [root_ptr, #32] // ....*............................. - vadd.u32 q4, q4, q0 // .......*.......................... - vqrdmulh.s32 q2, q3, q2 // ........*......................... - vsub.u32 q0, q4, q1 // ....................*............. - vldrw.u32 q5, [root_ptr, #80] // ............*..................... - vmul.u32 q7, q7, q5 // ................*................. - vldrw.u32 q5, [root_ptr] , #96 // ..................*............... - vqrdmlah.s32 q6, q7, modulus // .................*................ - vadd.u32 q7, q4, q1 // .....................*............ - vqrdmulh.s32 q4, q0, q5 // ......................*........... - vldrw.u32 q1, [root_ptr, #-48] // .....*............................ - vmul.u32 q3, q3, q1 // .........*........................ - vldrw.u32 q1, [root_ptr, #-80] // ...................*.............. - vqrdmlah.s32 q2, q3, modulus // ..........*....................... - vstrw.u32 q7, [in] , #64 // ..............................*... - vadd.u32 q7, q2, q6 // ..........................*....... - vstrw.u32 q7, [in, #-48] // ...............................*.. - vsub.u32 q7, q2, q6 // .........................*........ - vmul.u32 q3, q7, q1 // ............................*..... - vldrw.u32 q2, [in, #48] // ...e.............................. - vqrdmulh.s32 q7, q7, q5 // ...........................*...... - vldrw.u32 q6, [root_ptr, #64] // ...........e...................... - vqrdmlah.s32 q7, q3, modulus // .............................*.... - vldrw.u32 q3, [in, #32] // ..e............................... - vmul.u32 q5, q0, q1 // .......................*.......... - vstrw.u32 q7, [in, #-16] // .................................* - vqrdmlah.s32 q4, q5, modulus // ........................*......... - vstrw.u32 q4, [in, #-32] // ................................*. - - // original source code - // vldrw.u32 data0, [in] // .............*............................. - // vldrw.u32 data1, [in, #16] // ...........*............................... - // vldrw.u32 data2, [in, #32] // ....e...................................... - // vldrw.u32 data3, [in, #48] // e.......................................... - // vldrw.u32 root1, [root_ptr, #32] // ...............*........................... - // vldrw.u32 root1_twisted, [root_ptr, #48] // .........................*................. - // vsub.u32 tmp, data0, data1 // ..............*............................ - // vadd.u32 data0, data0, data1 // ................*.......................... - // vqrdmulh.s32 data1, tmp, root1 // .................*......................... - // vmul.u32 tmp, tmp, root1_twisted // ..........................*................ - // vqrdmlah.s32 data1, tmp, modulus // ............................*.............. - // vldrw.u32 root2, [root_ptr, #64] // ..e........................................ - // vldrw.u32 root2_twisted, [root_ptr, #80] // ...................*....................... - // vsub.u32 tmp, data2, data3 // .........*................................. - // vadd.u32 data2, data2, data3 // ............*.............................. - // vqrdmulh.s32 data3, tmp, root2 // ..........*................................ - // vmul.u32 tmp, tmp, root2_twisted // ....................*...................... - // vqrdmlah.s32 data3, tmp, modulus // ......................*.................... - // vldrw.u32 root0, [root_ptr] , #96 // .....................*..................... - // vldrw.u32 root0_twisted, [root_ptr, #-80] // ...........................*............... - // vsub.u32 tmp, data0, data2 // ..................*........................ - // vadd.u32 data0, data0, data2 // .......................*................... - // vqrdmulh.s32 data2, tmp, root0 // ........................*.................. - // vmul.u32 tmp, tmp, root0_twisted // .......................................*... - // vqrdmlah.s32 data2, tmp, modulus // .........................................*. - // vsub.u32 tmp, data1, data3 // ................................*.......... - // vadd.u32 data1, data1, data3 // ..............................*............ - // vqrdmulh.s32 data3, tmp, root0 // ...................................*....... - // vmul.u32 tmp, tmp, root0_twisted // .................................*......... - // vqrdmlah.s32 data3, tmp, modulus // .....................................*..... - // vstrw.u32 data0, [in] , #64 // .............................*............. - // vstrw.u32 data1, [in, #-48] // ...............................*........... - // vstrw.u32 data2, [in, #-32] // ..........................................* - // vstrw.u32 data3, [in, #-16] // ........................................*.. - - le lr, layer78_loop -layer78_loop_end: - vadd.u32 q4, q3, q2 // ...*........................... - vldrw.u32 q7, [in, #16] // ..*............................ - vsub.u32 q1, q3, q2 // *.............................. - vldrw.u32 q5, [in] // ....*.......................... - vsub.u32 q3, q5, q7 // .....*......................... - vldrw.u32 q0, [root_ptr] , #96 // ............*.................. - vadd.u32 q7, q5, q7 // .......*....................... - vqrdmulh.s32 q6, q1, q6 // .*............................. - vldrw.u32 q5, [root_ptr, #-16] // ..........*.................... - vmul.u32 q1, q1, q5 // ...........*................... - vsub.u32 q5, q7, q4 // .........*..................... - vqrdmlah.s32 q6, q1, modulus // .............*................. - vadd.u32 q4, q7, q4 // ..............*................ - vqrdmulh.s32 q1, q5, q0 // ...............*............... - vldrw.u32 q2, [root_ptr, #-64] // ......*........................ - vqrdmulh.s32 q2, q3, q2 // ........*...................... - vldrw.u32 q7, [root_ptr, #-48] // ................*.............. - vmul.u32 q3, q3, q7 // .................*............. - vldrw.u32 q7, [root_ptr, #-80] // ..................*............ - vqrdmlah.s32 q2, q3, modulus // ...................*........... - vstrw.u32 q4, [in] , #64 // ....................*.......... - vmul.u32 q5, q5, q7 // ...........................*... - vadd.u32 q4, q2, q6 // .....................*......... - vqrdmlah.s32 q1, q5, modulus // .............................*. - vsub.u32 q5, q2, q6 // .......................*....... - vmul.u32 q3, q5, q7 // ........................*...... - vstrw.u32 q4, [in, #-48] // ......................*........ - vqrdmulh.s32 q4, q5, q0 // .........................*..... - vstrw.u32 q1, [in, #-32] // ..............................* - vqrdmlah.s32 q4, q3, modulus // ..........................*.... - vstrw.u32 q4, [in, #-16] // ............................*.. - - // original source code - // vsub.u32 q7, q3, q2 // ..*............................ - // vqrdmulh.s32 q6, q7, q6 // .......*....................... - // vldrw.u32 q0, [in, #16] // .*............................. - // vadd.u32 q1, q3, q2 // *.............................. - // vldrw.u32 q4, [in] // ...*........................... - // vsub.u32 q3, q4, q0 // ....*.......................... - // vldrw.u32 q2, [root_ptr, #32] // ..............*................ - // vadd.u32 q4, q4, q0 // ......*........................ - // vqrdmulh.s32 q2, q3, q2 // ...............*............... - // vsub.u32 q0, q4, q1 // ..........*.................... - // vldrw.u32 q5, [root_ptr, #80] // ........*...................... - // vmul.u32 q7, q7, q5 // .........*..................... - // vldrw.u32 q5, [root_ptr] , #96 // .....*......................... - // vqrdmlah.s32 q6, q7, modulus // ...........*................... - // vadd.u32 q7, q4, q1 // ............*.................. - // vqrdmulh.s32 q4, q0, q5 // .............*................. - // vldrw.u32 q1, [root_ptr, #-48] // ................*.............. - // vmul.u32 q3, q3, q1 // .................*............. - // vldrw.u32 q1, [root_ptr, #-80] // ..................*............ - // vqrdmlah.s32 q2, q3, modulus // ...................*........... - // vstrw.u32 q7, [in] , #64 // ....................*.......... - // vadd.u32 q7, q2, q6 // ......................*........ - // vstrw.u32 q7, [in, #-48] // ..........................*.... - // vsub.u32 q7, q2, q6 // ........................*...... - // vmul.u32 q3, q7, q1 // .........................*..... - // vqrdmulh.s32 q7, q7, q5 // ...........................*... - // vqrdmlah.s32 q7, q3, modulus // .............................*. - // vmul.u32 q5, q0, q1 // .....................*......... - // vstrw.u32 q7, [in, #-16] // ..............................* - // vqrdmlah.s32 q4, q5, modulus // .......................*....... - // vstrw.u32 q4, [in, #-32] // ............................*.. - - - - sub in, in, #(4*256) - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - /* Layers 5,6 */ - - mov lr, #16 - ldrd r10, r6, [root_ptr, #16] // *....... - vld40.u32 {q0,q1,q2,q3}, [in] // .*...... - nop // .......* - vld41.u32 {q0,q1,q2,q3}, [in] // ..*..... - nop // ......*. - vld42.u32 {q0,q1,q2,q3}, [in] // ...*.... - nop // .....*.. - vld43.u32 {q0,q1,q2,q3}, [in] // ....*... - - // original source code - // ldrd r10, r6, [root_ptr, #16] // *....... - // vld40.u32 {q0,q1,q2,q3}, [in] // .*...... - // vld41.u32 {q0,q1,q2,q3}, [in] // ...*.... - // vld42.u32 {q0,q1,q2,q3}, [in] // .....*.. - // vld43.u32 {q0,q1,q2,q3}, [in] // .......* - // nop // ......*. - // nop // ....*... - // nop // ..*..... - - sub lr, lr, #1 - wls lr, lr, layer56_loop_end -layer56_loop: - vsub.u32 q6, q2, q3 // ............*.................. - vmul.u32 q5, q6, r6 // ...............*............... - ldrd r9, r4, [root_ptr, #8] // .*............................. - vqrdmulh.s32 q6, q6, r10 // ..............*................ - vsub.u32 q4, q0, q1 // .......*....................... - vmul.u32 q7, q4, r4 // ..........*.................... - ldrd r8, r5, [root_ptr] , #24 // *.............................. - vqrdmlah.s32 q6, q5, modulus // ................*.............. - vadd.u32 q0, q0, q1 // ........*...................... - vqrdmulh.s32 q5, q4, r9 // .........*..................... - vadd.u32 q4, q2, q3 // .............*................. - vqrdmlah.s32 q5, q7, modulus // ...........*................... - ldrd r10, r6, [root_ptr, #16] // ..e............................ - vadd.u32 q7, q5, q6 // .......................*....... - vstrw.u32 q7, [in, #16] // ............................*.. - vadd.u32 q7, q0, q4 // ..................*............ - vstrw.u32 q7, [in] , #64 // ...........................*... - vsub.u32 q7, q0, q4 // .................*............. - vqrdmulh.s32 q4, q7, r8 // ...................*........... - vld40.u32 {q0,q1,q2,q3}, [in] // ...e........................... - vmul.u32 q7, q7, r5 // ....................*.......... - vld41.u32 {q0,q1,q2,q3}, [in] // ....e.......................... - vqrdmlah.s32 q4, q7, modulus // .....................*......... - vstrw.u32 q4, [in, #-32] // .............................*. - vsub.u32 q4, q5, q6 // ......................*........ - vqrdmulh.s32 q7, q4, r8 // ........................*...... - vld42.u32 {q0,q1,q2,q3}, [in] // .....e......................... - vmul.u32 q4, q4, r5 // .........................*..... - vld43.u32 {q0,q1,q2,q3}, [in] // ......e........................ - vqrdmlah.s32 q7, q4, modulus // ..........................*.... - vstrw.u32 q7, [in, #-16] // ..............................* - - // original source code - // ldrd root0, root0_twisted, [root_ptr] , #24 // .........................*........................ - // ldrd root1, root1_twisted, [root_ptr, #-16] // .....................*............................ - // ldrd root2, root2_twisted, [root_ptr, #-8] // e................................................. - // vld40.u32 {data0,data1,data2,data3}, [in] // .......e.......................................... - // vld41.u32 {data0,data1,data2,data3}, [in] // .........e........................................ - // vld42.u32 {data0,data1,data2,data3}, [in] // ..............e................................... - // vld43.u32 {data0,data1,data2,data3}, [in] // ................e................................. - // vsub.u32 tmp, data0, data1 // .......................*.......................... - // vadd.u32 data0, data0, data1 // ...........................*...................... - // vqrdmulh.s32 data1, tmp, root1 // ............................*..................... - // vmul.u32 tmp, tmp, root1_twisted // ........................*......................... - // vqrdmlah.s32 data1, tmp, modulus // ..............................*................... - // vsub.u32 tmp, data2, data3 // ...................*.............................. - // vadd.u32 data2, data2, data3 // .............................*.................... - // vqrdmulh.s32 data3, tmp, root2 // ......................*........................... - // vmul.u32 tmp, tmp, root2_twisted // ....................*............................. - // vqrdmlah.s32 data3, tmp, modulus // ..........................*....................... - // vsub.u32 tmp, data0, data2 // ....................................*............. - // vadd.u32 data0, data0, data2 // ..................................*............... - // vqrdmulh.s32 data2, tmp, root0 // .....................................*............ - // vmul.u32 tmp, tmp, root0_twisted // .......................................*.......... - // vqrdmlah.s32 data2, tmp, modulus // .........................................*........ - // vsub.u32 tmp, data1, data3 // ...........................................*...... - // vadd.u32 data1, data1, data3 // ................................*................. - // vqrdmulh.s32 data3, tmp, root0 // ............................................*..... - // vmul.u32 tmp, tmp, root0_twisted // ..............................................*... - // vqrdmlah.s32 data3, tmp, modulus // ................................................*. - // vstrw.u32 data0, [in] , #64 // ...................................*.............. - // vstrw.u32 data1, [in, #-48] // .................................*................ - // vstrw.u32 data2, [in, #-32] // ..........................................*....... - // vstrw.u32 data3, [in, #-16] // .................................................* - - le lr, layer56_loop -layer56_loop_end: - vsub.u32 q4, q2, q3 // *........................... - vmul.u32 q6, q4, r6 // .*.......................... - ldrd r8, r5, [root_ptr, #8] // ..*......................... - vqrdmulh.s32 q7, q4, r10 // ...*........................ - vsub.u32 q4, q0, q1 // ....*....................... - vqrdmulh.s32 q5, q4, r8 // .........*.................. - ldrd r9, r4, [root_ptr] , #24 // ......*..................... - vmul.u32 q4, q4, r5 // .....*...................... - vadd.u32 q0, q0, q1 // ........*................... - vqrdmlah.s32 q7, q6, modulus // .......*.................... - nop // ...........................* - vqrdmlah.s32 q5, q4, modulus // ...........*................ - nop // ..........................*. - vadd.u32 q6, q5, q7 // ............*............... - vstrw.u32 q6, [in, #16] // .............*.............. - vsub.u32 q6, q5, q7 // .....................*...... - vmul.u32 q5, q6, r4 // .......................*.... - vadd.u32 q4, q2, q3 // ..........*................. - vqrdmulh.s32 q7, q6, r9 // ......................*..... - vadd.u32 q6, q0, q4 // ..............*............. - vqrdmlah.s32 q7, q5, modulus // ........................*... - vsub.u32 q4, q0, q4 // ................*........... - vqrdmulh.s32 q5, q4, r9 // .................*.......... - vstrw.u32 q6, [in] , #64 // ...............*............ - vmul.u32 q4, q4, r4 // ..................*......... - vstrw.u32 q7, [in, #-16] // .........................*.. - vqrdmlah.s32 q5, q4, modulus // ...................*........ - vstrw.u32 q5, [in, #-32] // ....................*....... - - // original source code - // vsub.u32 q6, q2, q3 // *........................... - // vmul.u32 q5, q6, r6 // .*.......................... - // ldrd r9, r4, [root_ptr, #8] // ..*......................... - // vqrdmulh.s32 q6, q6, r10 // ...*........................ - // vsub.u32 q4, q0, q1 // ....*....................... - // vmul.u32 q7, q4, r4 // .......*.................... - // ldrd r8, r5, [root_ptr] , #24 // ......*..................... - // vqrdmlah.s32 q6, q5, modulus // .........*.................. - // vadd.u32 q0, q0, q1 // ........*................... - // vqrdmulh.s32 q5, q4, r9 // .....*...................... - // vadd.u32 q4, q2, q3 // .................*.......... - // vqrdmlah.s32 q5, q7, modulus // ...........*................ - // vadd.u32 q7, q5, q6 // .............*.............. - // vstrw.u32 q7, [in, #16] // ..............*............. - // vadd.u32 q7, q0, q4 // ...................*........ - // vstrw.u32 q7, [in] , #64 // .......................*.... - // vsub.u32 q7, q0, q4 // .....................*...... - // vqrdmulh.s32 q4, q7, r8 // ......................*..... - // vmul.u32 q7, q7, r5 // ........................*... - // vqrdmlah.s32 q4, q7, modulus // ..........................*. - // vstrw.u32 q4, [in, #-32] // ...........................* - // vsub.u32 q4, q5, q6 // ...............*............ - // vqrdmulh.s32 q7, q4, r8 // ..................*......... - // vmul.u32 q4, q4, r5 // ................*........... - // vqrdmlah.s32 q7, q4, modulus // ....................*....... - // vstrw.u32 q7, [in, #-16] // .........................*.. - // nop // ............*............... - // nop // ..........*................. - - - - sub in, in, #(4*256) - - // TEMPORARY: Barrett reduction - modulus_neg .req r10 - neg modulus_neg, modulus - barrett_const .req r1 - .equ const_barrett, 63 - movw barrett_const, #:lower16:const_barrett - movt barrett_const, #:upper16:const_barrett - mov lr, #64 - wls lr, lr, 2f -1: - vldrw.u32 data0, [in] - vqrdmulh.s32 tmp, data0, barrett_const - vmla.s32 data0, tmp, modulus_neg - vstrw.u32 data0, [in], #16 - le lr, 1b -2: - sub in, in, #(4*256) - .unreq barrett_const - .unreq modulus_neg - - /* Layers 3,4 */ - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q3, [in, #192] // .*. - nop // ..* - vldrw.u32 q4, [in, #128] // *.. - - // original source code - // vldrw.u32 q4, [in, #128] // ..* - // vldrw.u32 q3, [in, #192] // *.. - // nop // .*. - - sub lr, lr, #1 - wls lr, lr, layer34_loop_end -layer34_loop: - vsub.u32 q5, q4, q3 // .........*.................. - vqrdmulh.s32 q6, q5, root2 // ...........*................ - vadd.u32 q3, q4, q3 // ..........*................. - vmul.u32 q7, q5, root2_twisted // ............*............... - vldrw.u32 q0, [in] // *........................... - vqrdmlah.s32 q6, q7, modulus // .............*.............. - vldrw.u32 q7, [in, #64] // .*.......................... - vsub.u32 q4, q0, q7 // ....*....................... - vqrdmulh.s32 q2, q4, root1 // ......*..................... - vadd.u32 q5, q0, q7 // .....*...................... - vmul.u32 q7, q4, root1_twisted // .......*.................... - vadd.u32 q4, q5, q3 // ...............*............ - vqrdmlah.s32 q2, q7, modulus // ........*................... - vstrw.u32 q4, [in] , #16 // ........................*... - vsub.u32 q1, q2, q6 // ...................*........ - vmul.u32 q4, q1, root0_twisted // ......................*..... - vadd.u32 q2, q2, q6 // ....................*....... - vqrdmulh.s32 q7, q1, root0 // .....................*...... - vstrw.u32 q2, [in, #48] // .........................*.. - vqrdmlah.s32 q7, q4, modulus // .......................*.... - vstrw.u32 q7, [in, #176] // ...........................* - vsub.u32 q7, q5, q3 // ..............*............. - vqrdmulh.s32 q6, q7, root0 // ................*........... - vldrw.u32 q4, [in, #128] // ..e......................... - vmul.u32 q5, q7, root0_twisted // .................*.......... - vldrw.u32 q3, [in, #192] // ...e........................ - vqrdmlah.s32 q6, q5, modulus // ..................*......... - vstrw.u32 q6, [in, #112] // ..........................*. - - // original source code - // vldrw.u32 data0, [in] // .........*....................... - // vldrw.u32 data1, [in, #64] // ...........*..................... - // vldrw.u32 data2, [in, #128] // e................................ - // vldrw.u32 data3, [in, #192] // ..e.............................. - // vsub.u32 tmp, data0, data1 // ............*.................... - // vadd.u32 data0, data0, data1 // ..............*.................. - // vqrdmulh.s32 data1, tmp, root1 // .............*................... - // vmul.u32 tmp, tmp, root1_twisted // ...............*................. - // vqrdmlah.s32 data1, tmp, modulus // .................*............... - // vsub.u32 tmp, data2, data3 // .....*........................... - // vadd.u32 data2, data2, data3 // .......*......................... - // vqrdmulh.s32 data3, tmp, root2 // ......*.......................... - // vmul.u32 tmp, tmp, root2_twisted // ........*........................ - // vqrdmlah.s32 data3, tmp, modulus // ..........*...................... - // vsub.u32 tmp, data0, data2 // ..........................*...... - // vadd.u32 data0, data0, data2 // ................*................ - // vqrdmulh.s32 data2, tmp, root0 // ...........................*..... - // vmul.u32 tmp, tmp, root0_twisted // .............................*... - // vqrdmlah.s32 data2, tmp, modulus // ...............................*. - // vsub.u32 tmp, data1, data3 // ...................*............. - // vadd.u32 data1, data1, data3 // .....................*........... - // vqrdmulh.s32 data3, tmp, root0 // ......................*.......... - // vmul.u32 tmp, tmp, root0_twisted // ....................*............ - // vqrdmlah.s32 data3, tmp, modulus // ........................*........ - // vstrw.u32 data0, [in] , #16 // ..................*.............. - // vstrw.u32 data1, [in, #48] // .......................*......... - // vstrw.u32 data2, [in, #112] // ................................* - // vstrw.u32 data3, [in, #176] // .........................*....... - - le lr, layer34_loop -layer34_loop_end: - vsub.u32 q5, q4, q3 // *......................... - vqrdmulh.s32 q2, q5, root2 // .*........................ - vadd.u32 q3, q4, q3 // ..*....................... - vldrw.u32 q7, [in] // ....*..................... - vmul.u32 q5, q5, root2_twisted // ...*...................... - vldrw.u32 q0, [in, #64] // ......*................... - vqrdmlah.s32 q2, q5, modulus // .....*.................... - vsub.u32 q4, q7, q0 // .......*.................. - vqrdmulh.s32 q6, q4, root1 // ........*................. - vadd.u32 q5, q7, q0 // .........*................ - vmul.u32 q0, q4, root1_twisted // ..........*............... - vadd.u32 q4, q5, q3 // ...........*.............. - vqrdmlah.s32 q6, q0, modulus // ............*............. - vsub.u32 q1, q5, q3 // .....................*.... - vqrdmulh.s32 q0, q1, root0 // ......................*... - vadd.u32 q7, q6, q2 // ................*......... - vmul.u32 q5, q1, root0_twisted // .......................*.. - vstrw.u32 q7, [in, #64] // ..................*....... - vqrdmlah.s32 q0, q5, modulus // ........................*. - vsub.u32 q1, q6, q2 // ..............*........... - vqrdmulh.s32 q7, q1, root0 // .................*........ - vstrw.u32 q4, [in] , #16 // .............*............ - vmul.u32 q4, q1, root0_twisted // ...............*.......... - vstrw.u32 q0, [in, #112] // .........................* - vqrdmlah.s32 q7, q4, modulus // ...................*...... - vstrw.u32 q7, [in, #176] // ....................*..... - - // original source code - // vsub.u32 q5, q4, q3 // *......................... - // vqrdmulh.s32 q6, q5, root2 // .*........................ - // vadd.u32 q3, q4, q3 // ..*....................... - // vmul.u32 q7, q5, root2_twisted // ....*..................... - // vldrw.u32 q0, [in] // ...*...................... - // vqrdmlah.s32 q6, q7, modulus // ......*................... - // vldrw.u32 q7, [in, #64] // .....*.................... - // vsub.u32 q4, q0, q7 // .......*.................. - // vqrdmulh.s32 q2, q4, root1 // ........*................. - // vadd.u32 q5, q0, q7 // .........*................ - // vmul.u32 q7, q4, root1_twisted // ..........*............... - // vadd.u32 q4, q5, q3 // ...........*.............. - // vqrdmlah.s32 q2, q7, modulus // ............*............. - // vstrw.u32 q4, [in] , #16 // .....................*.... - // vsub.u32 q1, q2, q6 // ...................*...... - // vmul.u32 q4, q1, root0_twisted // ......................*... - // vadd.u32 q2, q2, q6 // ...............*.......... - // vqrdmulh.s32 q7, q1, root0 // ....................*..... - // vstrw.u32 q2, [in, #48] // .................*........ - // vqrdmlah.s32 q7, q4, modulus // ........................*. - // vstrw.u32 q7, [in, #176] // .........................* - // vsub.u32 q7, q5, q3 // .............*............ - // vqrdmulh.s32 q6, q7, root0 // ..............*........... - // vmul.u32 q5, q7, root0_twisted // ................*......... - // vqrdmlah.s32 q6, q5, modulus // ..................*....... - // vstrw.u32 q6, [in, #112] // .......................*.. - - - add in, in, #(4*64 - 4*16) - - subs count, count, #1 - bne out_start - - sub in, in, #(4*256) - - // TEMPORARY: Barrett reduction - modulus_neg .req r10 - neg modulus_neg, modulus - barrett_const .req r1 - .equ const_barrett, 63 - movw barrett_const, #:lower16:const_barrett - movt barrett_const, #:upper16:const_barrett - mov lr, #64 - wls lr, lr, 2f -1: - vldrw.u32 data0, [in] - vqrdmulh.s32 tmp, data0, barrett_const - vmla.s32 data0, tmp, modulus_neg - vstrw.u32 data0, [in], #16 - le lr, 1b -2: - sub in, in, #(4*256) - .unreq barrett_const - .unreq modulus_neg - - in_low .req r0 - in_high .req r1 - add in_high, in_low, #(4*128) - - /* Layers 1,2 */ - - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q4, [in_high] // *.. - nop // ..* - vldrw.u32 q3, [in_high, #256] // .*. - - // original source code - // vldrw.u32 q4, [in_high] // *.. - // vldrw.u32 q3, [in_high, #256] // ..* - // nop // .*. - - sub lr, lr, #1 - wls lr, lr, layer12_loop_end -layer12_loop: - vsub.u32 q5, q4, q3 // .........*.................. - vqrdmulh.s32 q6, q5, root2 // ...........*................ - vadd.u32 q3, q4, q3 // ..........*................. - vmul.u32 q7, q5, root2_twisted // ............*............... - vldrw.u32 q0, [in_low] // *........................... - vqrdmlah.s32 q6, q7, modulus // .............*.............. - vldrw.u32 q7, [in_low, #256] // .*.......................... - vsub.u32 q4, q0, q7 // ....*....................... - vqrdmulh.s32 q2, q4, root1 // ......*..................... - vadd.u32 q5, q0, q7 // .....*...................... - vmul.u32 q7, q4, root1_twisted // .......*.................... - vadd.u32 q4, q5, q3 // ...............*............ - vqrdmlah.s32 q2, q7, modulus // ........*................... - vstrw.u32 q4, [in_low] , #16 // ........................*... - vsub.u32 q1, q2, q6 // ...................*........ - vmul.u32 q4, q1, root0_twisted // ......................*..... - vadd.u32 q2, q2, q6 // ....................*....... - vqrdmulh.s32 q7, q1, root0 // .....................*...... - vstrw.u32 q2, [in_low, #240] // .........................*.. - vqrdmlah.s32 q7, q4, modulus // .......................*.... - vstrw.u32 q7, [in_high, #256] // ...........................* - vsub.u32 q7, q5, q3 // ..............*............. - vqrdmulh.s32 q6, q7, root0 // ................*........... - vldrw.u32 q4, [in_high, #16] // ..e......................... - vmul.u32 q5, q7, root0_twisted // .................*.......... - vldrw.u32 q3, [in_high, #272] // ...e........................ - vqrdmlah.s32 q6, q5, modulus // ..................*......... - vstrw.u32 q6, [in_high] , #16 // ..........................*. - - // original source code - // vldrw.u32 data0, [in_low] // .........*....................... - // vldrw.u32 data1, [in_low, #256] // ...........*..................... - // vldrw.u32 data2, [in_high] // e................................ - // vldrw.u32 data3, [in_high, #256] // ..e.............................. - // vsub.u32 tmp, data0, data1 // ............*.................... - // vadd.u32 data0, data0, data1 // ..............*.................. - // vqrdmulh.s32 data1, tmp, root1 // .............*................... - // vmul.u32 tmp, tmp, root1_twisted // ...............*................. - // vqrdmlah.s32 data1, tmp, modulus // .................*............... - // vsub.u32 tmp, data2, data3 // .....*........................... - // vadd.u32 data2, data2, data3 // .......*......................... - // vqrdmulh.s32 data3, tmp, root2 // ......*.......................... - // vmul.u32 tmp, tmp, root2_twisted // ........*........................ - // vqrdmlah.s32 data3, tmp, modulus // ..........*...................... - // vsub.u32 tmp, data0, data2 // ..........................*...... - // vadd.u32 data0, data0, data2 // ................*................ - // vqrdmulh.s32 data2, tmp, root0 // ...........................*..... - // vmul.u32 tmp, tmp, root0_twisted // .............................*... - // vqrdmlah.s32 data2, tmp, modulus // ...............................*. - // vsub.u32 tmp, data1, data3 // ...................*............. - // vadd.u32 data1, data1, data3 // .....................*........... - // vqrdmulh.s32 data3, tmp, root0 // ......................*.......... - // vmul.u32 tmp, tmp, root0_twisted // ....................*............ - // vqrdmlah.s32 data3, tmp, modulus // ........................*........ - // vstrw.u32 data0, [in_low] , #16 // ..................*.............. - // vstrw.u32 data1, [in_low, #240] // .......................*......... - // vstrw.u32 data2, [in_high] , #16 // ................................* - // vstrw.u32 data3, [in_high, #240] // .........................*....... - - le lr, layer12_loop -layer12_loop_end: - vsub.u32 q7, q4, q3 // *......................... - vqrdmulh.s32 q2, q7, root2 // .*........................ - vldrw.u32 q1, [in_low, #256] // ......*................... - vmul.u32 q0, q7, root2_twisted // ...*...................... - vadd.u32 q3, q4, q3 // ..*....................... - vqrdmlah.s32 q2, q0, modulus // .....*.................... - vldrw.u32 q7, [in_low] // ....*..................... - vsub.u32 q4, q7, q1 // .......*.................. - vmul.u32 q5, q4, root1_twisted // ..........*............... - vadd.u32 q1, q7, q1 // .........*................ - vqrdmulh.s32 q6, q4, root1 // ........*................. - vadd.u32 q4, q1, q3 // ...........*.............. - vqrdmlah.s32 q6, q5, modulus // ............*............. - vsub.u32 q1, q1, q3 // .....................*.... - vqrdmulh.s32 q0, q1, root0 // ......................*... - vsub.u32 q5, q6, q2 // ..............*........... - vqrdmulh.s32 q7, q5, root0 // .................*........ - vstrw.u32 q4, [in_low] , #16 // .............*............ - vmul.u32 q4, q5, root0_twisted // ...............*.......... - vadd.u32 q6, q6, q2 // ................*......... - vqrdmlah.s32 q7, q4, modulus // ...................*...... - vstrw.u32 q6, [in_low, #240] // ..................*....... - vmul.u32 q6, q1, root0_twisted // .......................*.. - vstrw.u32 q7, [in_high, #256] // ....................*..... - vqrdmlah.s32 q0, q6, modulus // ........................*. - vstrw.u32 q0, [in_high] , #16 // .........................* - - // original source code - // vsub.u32 q5, q4, q3 // *......................... - // vqrdmulh.s32 q6, q5, root2 // .*........................ - // vadd.u32 q3, q4, q3 // ....*..................... - // vmul.u32 q7, q5, root2_twisted // ...*...................... - // vldrw.u32 q0, [in_low] // ......*................... - // vqrdmlah.s32 q6, q7, modulus // .....*.................... - // vldrw.u32 q7, [in_low, #256] // ..*....................... - // vsub.u32 q4, q0, q7 // .......*.................. - // vqrdmulh.s32 q2, q4, root1 // ..........*............... - // vadd.u32 q5, q0, q7 // .........*................ - // vmul.u32 q7, q4, root1_twisted // ........*................. - // vadd.u32 q4, q5, q3 // ...........*.............. - // vqrdmlah.s32 q2, q7, modulus // ............*............. - // vstrw.u32 q4, [in_low] , #16 // .................*........ - // vsub.u32 q1, q2, q6 // ...............*.......... - // vmul.u32 q4, q1, root0_twisted // ..................*....... - // vadd.u32 q2, q2, q6 // ...................*...... - // vqrdmulh.s32 q7, q1, root0 // ................*......... - // vstrw.u32 q2, [in_low, #240] // .....................*.... - // vqrdmlah.s32 q7, q4, modulus // ....................*..... - // vstrw.u32 q7, [in_high, #256] // .......................*.. - // vsub.u32 q7, q5, q3 // .............*............ - // vqrdmulh.s32 q6, q7, root0 // ..............*........... - // vmul.u32 q5, q7, root0_twisted // ......................*... - // vqrdmlah.s32 q6, q5, modulus // ........................*. - // vstrw.u32 q6, [in_high] , #16 // .........................* - - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_incomplete.s b/tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index e3c6679..0000000 --- a/tests/ntt_n256/manual/invntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,685 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.data -roots_inv: -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 38018305 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.text - -// Montgomery multiplication via rounding -.macro mulmod dst, src, const, const_twisted - vqrdmulh.s32 \dst, \src, \const - vmul.u32 \src, \src, \const_twisted - vqrdmlah.s32 \dst, \src, modulus -.endm - -.macro gs_butterfly a, b, root, root_twisted - vsub.u32 tmp, \a, \b - vadd.u32 \a, \a, \b - mulmod \b, tmp, \root, \root_twisted -.endm - -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type invntt_n256_u32_33556993_28678040_incomplete_manual, %function -.global invntt_n256_u32_33556993_28678040_incomplete_manual -invntt_n256_u32_33556993_28678040_incomplete_manual: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, 33556993 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in .req r0 - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 5,6 */ - - mov lr, #16 - ldrd r10, r9, [root_ptr, #16] // ..*. - vldrw.u32 q3, [in, #48] // *... - nop // ...* - vldrw.u32 q7, [in, #32] // .*.. - - // original source code - // vldrw.u32 q3, [in, #48] // .*.. - // vldrw.u32 q7, [in, #32] // ...* - // ldrd r10, r9, [root_ptr, #16] // *... - // nop // ..*. - - sub lr, lr, #1 - wls lr, lr, layer56_loop_end -layer56_loop: - vsub.u32 q2, q7, q3 // ............*.................. - vmul.u32 q1, q2, r9 // ...............*............... - vldrw.u32 q4, [in, #16] // ....*.......................... - vqrdmulh.s32 q6, q2, r10 // ..............*................ - ldrd r9, r4, [root_ptr, #8] // .*............................. - vldrw.u32 q0, [in] // ...*........................... - vsub.u32 q5, q0, q4 // .......*....................... - vqrdmulh.s32 q2, q5, r9 // .........*..................... - vadd.u32 q0, q0, q4 // ........*...................... - vmul.u32 q5, q5, r4 // ..........*.................... - vadd.u32 q7, q7, q3 // .............*................. - vqrdmlah.s32 q2, q5, modulus // ...........*................... - vadd.u32 q5, q0, q7 // ..................*............ - vldrw.u32 q3, [in, #112] // ......e........................ - vqrdmlah.s32 q6, q1, modulus // ................*.............. - ldrd r8, r5, [root_ptr] , #24 // *.............................. - vsub.u32 q1, q2, q6 // ......................*........ - vqrdmulh.s32 q4, q1, r8 // ........................*...... - vadd.u32 q6, q2, q6 // .......................*....... - vstrw.u32 q6, [in, #16] // ............................*.. - vmul.u32 q6, q1, r5 // .........................*..... - vsub.u32 q1, q0, q7 // .................*............. - vmul.u32 q0, q1, r5 // ....................*.......... - vldrw.u32 q7, [in, #96] // .....e......................... - vqrdmulh.s32 q2, q1, r8 // ...................*........... - ldrd r10, r9, [root_ptr, #16] // ..e............................ - vstrw.u32 q5, [in] , #64 // ...........................*... - vqrdmlah.s32 q4, q6, modulus // ..........................*.... - vstrw.u32 q4, [in, #-16] // ..............................* - vqrdmlah.s32 q2, q0, modulus // .....................*......... - vstrw.u32 q2, [in, #-32] // .............................*. - - // original source code - // ldrd root0, root0_twisted, [root_ptr] , #24 // .................................*............... - // ldrd root1, root1_twisted, [root_ptr, #-16] // ......................*.......................... - // ldrd root2, root2_twisted, [root_ptr, #-8] // ............e.................................... - // vldrw.u32 data0, [in] // .......................*......................... - // vldrw.u32 data1, [in, #16] // ....................*............................ - // vldrw.u32 data2, [in, #32] // ..........e...................................... - // vldrw.u32 data3, [in, #48] // e................................................ - // vsub.u32 tmp, data0, data1 // ........................*........................ - // vadd.u32 data0, data0, data1 // ..........................*...................... - // vqrdmulh.s32 data1, tmp, root1 // .........................*....................... - // vmul.u32 tmp, tmp, root1_twisted // ...........................*..................... - // vqrdmlah.s32 data1, tmp, modulus // .............................*................... - // vsub.u32 tmp, data2, data3 // ..................*.............................. - // vadd.u32 data2, data2, data3 // ............................*.................... - // vqrdmulh.s32 data3, tmp, root2 // .....................*........................... - // vmul.u32 tmp, tmp, root2_twisted // ...................*............................. - // vqrdmlah.s32 data3, tmp, modulus // ................................*................ - // vsub.u32 tmp, data0, data2 // .......................................*......... - // vadd.u32 data0, data0, data2 // ..............................*.................. - // vqrdmulh.s32 data2, tmp, root0 // ..........................................*...... - // vmul.u32 tmp, tmp, root0_twisted // ........................................*........ - // vqrdmlah.s32 data2, tmp, modulus // ...............................................*. - // vsub.u32 tmp, data1, data3 // ..................................*.............. - // vadd.u32 data1, data1, data3 // ....................................*............ - // vqrdmulh.s32 data3, tmp, root0 // ...................................*............. - // vmul.u32 tmp, tmp, root0_twisted // ......................................*.......... - // vqrdmlah.s32 data3, tmp, modulus // .............................................*... - // vstrw.u32 data0, [in] , #64 // ............................................*.... - // vstrw.u32 data1, [in, #-48] // .....................................*........... - // vstrw.u32 data2, [in, #-32] // ................................................* - // vstrw.u32 data3, [in, #-16] // ..............................................*.. - - le lr, layer56_loop -layer56_loop_end: - vldrw.u32 q1, [in] // .....*...................... - vsub.u32 q6, q7, q3 // *........................... - vmul.u32 q4, q6, r9 // .*.......................... - vldrw.u32 q5, [in, #16] // ..*......................... - vqrdmulh.s32 q2, q6, r10 // ...*........................ - vadd.u32 q0, q1, q5 // ........*................... - ldrd r9, r8, [root_ptr, #8] // ....*....................... - vsub.u32 q6, q1, q5 // ......*..................... - vmul.u32 q5, q6, r8 // .........*.................. - vadd.u32 q7, q7, q3 // ..........*................. - vqrdmulh.s32 q6, q6, r9 // .......*.................... - vadd.u32 q1, q0, q7 // ............*............... - vqrdmlah.s32 q6, q5, modulus // ...........*................ - vsub.u32 q5, q0, q7 // ....................*....... - vqrdmlah.s32 q2, q4, modulus // .............*.............. - ldrd r5, r4, [root_ptr] , #24 // ..............*............. - vqrdmulh.s32 q4, q5, r5 // ......................*..... - vsub.u32 q0, q6, q2 // ...............*............ - vmul.u32 q5, q5, r4 // .....................*...... - vadd.u32 q2, q6, q2 // .................*.......... - vqrdmlah.s32 q4, q5, modulus // ..........................*. - vstrw.u32 q1, [in] , #64 // .......................*.... - vqrdmulh.s32 q6, q0, r5 // ................*........... - vstrw.u32 q2, [in, #-48] // ..................*......... - vmul.u32 q0, q0, r4 // ...................*........ - vstrw.u32 q4, [in, #-32] // ...........................* - vqrdmlah.s32 q6, q0, modulus // ........................*... - vstrw.u32 q6, [in, #-16] // .........................*.. - - // original source code - // vsub.u32 q2, q7, q3 // .*.......................... - // vmul.u32 q1, q2, r9 // ..*......................... - // vldrw.u32 q4, [in, #16] // ...*........................ - // vqrdmulh.s32 q6, q2, r10 // ....*....................... - // ldrd r9, r4, [root_ptr, #8] // ......*..................... - // vldrw.u32 q0, [in] // *........................... - // vsub.u32 q5, q0, q4 // .......*.................... - // vqrdmulh.s32 q2, q5, r9 // ..........*................. - // vadd.u32 q0, q0, q4 // .....*...................... - // vmul.u32 q5, q5, r4 // ........*................... - // vadd.u32 q7, q7, q3 // .........*.................. - // vqrdmlah.s32 q2, q5, modulus // ............*............... - // vadd.u32 q5, q0, q7 // ...........*................ - // vqrdmlah.s32 q6, q1, modulus // ..............*............. - // ldrd r8, r5, [root_ptr] , #24 // ...............*............ - // vsub.u32 q1, q2, q6 // .................*.......... - // vqrdmulh.s32 q4, q1, r8 // ......................*..... - // vadd.u32 q6, q2, q6 // ...................*........ - // vstrw.u32 q6, [in, #16] // .......................*.... - // vmul.u32 q6, q1, r5 // ........................*... - // vsub.u32 q1, q0, q7 // .............*.............. - // vmul.u32 q0, q1, r5 // ..................*......... - // vqrdmulh.s32 q2, q1, r8 // ................*........... - // vstrw.u32 q5, [in] , #64 // .....................*...... - // vqrdmlah.s32 q4, q6, modulus // ..........................*. - // vstrw.u32 q4, [in, #-16] // ...........................* - // vqrdmlah.s32 q2, q0, modulus // ....................*....... - // vstrw.u32 q2, [in, #-32] // .........................*.. - - - - sub in, in, #(4*256) - - // TEMPORARY: Barrett reduction - modulus_neg .req r10 - neg modulus_neg, modulus - barrett_const .req r1 - .equ const_barrett, 63 - movw barrett_const, #:lower16:const_barrett - movt barrett_const, #:upper16:const_barrett - mov lr, #64 - wls lr, lr, 2f -1: - vldrw.u32 data0, [in] - vqrdmulh.s32 tmp, data0, barrett_const - vmla.s32 data0, tmp, modulus_neg - vstrw.u32 data0, [in], #16 - le lr, 1b -2: - sub in, in, #(4*256) - .unreq barrett_const - .unreq modulus_neg - - /* Layers 3,4 */ - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+24 - ldrd root1, root1_twisted, [root_ptr, #-16] - ldrd root2, root2_twisted, [root_ptr, #-8] - - mov lr, #4 - vldrw.u32 q3, [in, #192] // .*. - nop // ..* - vldrw.u32 q4, [in, #128] // *.. - - // original source code - // vldrw.u32 q4, [in, #128] // ..* - // vldrw.u32 q3, [in, #192] // *.. - // nop // .*. - - sub lr, lr, #1 - wls lr, lr, layer34_loop_end -layer34_loop: - vsub.u32 q5, q4, q3 // .........*.................. - vqrdmulh.s32 q6, q5, root2 // ...........*................ - vadd.u32 q3, q4, q3 // ..........*................. - vmul.u32 q7, q5, root2_twisted // ............*............... - vldrw.u32 q0, [in] // *........................... - vqrdmlah.s32 q6, q7, modulus // .............*.............. - vldrw.u32 q7, [in, #64] // .*.......................... - vsub.u32 q4, q0, q7 // ....*....................... - vqrdmulh.s32 q2, q4, root1 // ......*..................... - vadd.u32 q5, q0, q7 // .....*...................... - vmul.u32 q7, q4, root1_twisted // .......*.................... - vadd.u32 q4, q5, q3 // ...............*............ - vqrdmlah.s32 q2, q7, modulus // ........*................... - vstrw.u32 q4, [in] , #16 // ........................*... - vsub.u32 q1, q2, q6 // ...................*........ - vmul.u32 q4, q1, root0_twisted // ......................*..... - vadd.u32 q2, q2, q6 // ....................*....... - vqrdmulh.s32 q7, q1, root0 // .....................*...... - vstrw.u32 q2, [in, #48] // .........................*.. - vqrdmlah.s32 q7, q4, modulus // .......................*.... - vstrw.u32 q7, [in, #176] // ...........................* - vsub.u32 q7, q5, q3 // ..............*............. - vqrdmulh.s32 q6, q7, root0 // ................*........... - vldrw.u32 q4, [in, #128] // ..e......................... - vmul.u32 q5, q7, root0_twisted // .................*.......... - vldrw.u32 q3, [in, #192] // ...e........................ - vqrdmlah.s32 q6, q5, modulus // ..................*......... - vstrw.u32 q6, [in, #112] // ..........................*. - - // original source code - // vldrw.u32 data0, [in] // .........*....................... - // vldrw.u32 data1, [in, #64] // ...........*..................... - // vldrw.u32 data2, [in, #128] // e................................ - // vldrw.u32 data3, [in, #192] // ..e.............................. - // vsub.u32 tmp, data0, data1 // ............*.................... - // vadd.u32 data0, data0, data1 // ..............*.................. - // vqrdmulh.s32 data1, tmp, root1 // .............*................... - // vmul.u32 tmp, tmp, root1_twisted // ...............*................. - // vqrdmlah.s32 data1, tmp, modulus // .................*............... - // vsub.u32 tmp, data2, data3 // .....*........................... - // vadd.u32 data2, data2, data3 // .......*......................... - // vqrdmulh.s32 data3, tmp, root2 // ......*.......................... - // vmul.u32 tmp, tmp, root2_twisted // ........*........................ - // vqrdmlah.s32 data3, tmp, modulus // ..........*...................... - // vsub.u32 tmp, data0, data2 // ..........................*...... - // vadd.u32 data0, data0, data2 // ................*................ - // vqrdmulh.s32 data2, tmp, root0 // ...........................*..... - // vmul.u32 tmp, tmp, root0_twisted // .............................*... - // vqrdmlah.s32 data2, tmp, modulus // ...............................*. - // vsub.u32 tmp, data1, data3 // ...................*............. - // vadd.u32 data1, data1, data3 // .....................*........... - // vqrdmulh.s32 data3, tmp, root0 // ......................*.......... - // vmul.u32 tmp, tmp, root0_twisted // ....................*............ - // vqrdmlah.s32 data3, tmp, modulus // ........................*........ - // vstrw.u32 data0, [in] , #16 // ..................*.............. - // vstrw.u32 data1, [in, #48] // .......................*......... - // vstrw.u32 data2, [in, #112] // ................................* - // vstrw.u32 data3, [in, #176] // .........................*....... - - le lr, layer34_loop -layer34_loop_end: - vsub.u32 q5, q4, q3 // *......................... - vqrdmulh.s32 q2, q5, root2 // .*........................ - vadd.u32 q3, q4, q3 // ..*....................... - vldrw.u32 q7, [in] // ....*..................... - vmul.u32 q5, q5, root2_twisted // ...*...................... - vldrw.u32 q0, [in, #64] // ......*................... - vqrdmlah.s32 q2, q5, modulus // .....*.................... - vsub.u32 q4, q7, q0 // .......*.................. - vqrdmulh.s32 q6, q4, root1 // ........*................. - vadd.u32 q5, q7, q0 // .........*................ - vmul.u32 q0, q4, root1_twisted // ..........*............... - vadd.u32 q4, q5, q3 // ...........*.............. - vqrdmlah.s32 q6, q0, modulus // ............*............. - vsub.u32 q1, q5, q3 // .....................*.... - vqrdmulh.s32 q0, q1, root0 // ......................*... - vadd.u32 q7, q6, q2 // ................*......... - vmul.u32 q5, q1, root0_twisted // .......................*.. - vstrw.u32 q7, [in, #64] // ..................*....... - vqrdmlah.s32 q0, q5, modulus // ........................*. - vsub.u32 q1, q6, q2 // ..............*........... - vqrdmulh.s32 q7, q1, root0 // .................*........ - vstrw.u32 q4, [in] , #16 // .............*............ - vmul.u32 q4, q1, root0_twisted // ...............*.......... - vstrw.u32 q0, [in, #112] // .........................* - vqrdmlah.s32 q7, q4, modulus // ...................*...... - vstrw.u32 q7, [in, #176] // ....................*..... - - // original source code - // vsub.u32 q5, q4, q3 // *......................... - // vqrdmulh.s32 q6, q5, root2 // .*........................ - // vadd.u32 q3, q4, q3 // ..*....................... - // vmul.u32 q7, q5, root2_twisted // ....*..................... - // vldrw.u32 q0, [in] // ...*...................... - // vqrdmlah.s32 q6, q7, modulus // ......*................... - // vldrw.u32 q7, [in, #64] // .....*.................... - // vsub.u32 q4, q0, q7 // .......*.................. - // vqrdmulh.s32 q2, q4, root1 // ........*................. - // vadd.u32 q5, q0, q7 // .........*................ - // vmul.u32 q7, q4, root1_twisted // ..........*............... - // vadd.u32 q4, q5, q3 // ...........*.............. - // vqrdmlah.s32 q2, q7, modulus // ............*............. - // vstrw.u32 q4, [in] , #16 // .....................*.... - // vsub.u32 q1, q2, q6 // ...................*...... - // vmul.u32 q4, q1, root0_twisted // ......................*... - // vadd.u32 q2, q2, q6 // ...............*.......... - // vqrdmulh.s32 q7, q1, root0 // ....................*..... - // vstrw.u32 q2, [in, #48] // .................*........ - // vqrdmlah.s32 q7, q4, modulus // ........................*. - // vstrw.u32 q7, [in, #176] // .........................* - // vsub.u32 q7, q5, q3 // .............*............ - // vqrdmulh.s32 q6, q7, root0 // ..............*........... - // vmul.u32 q5, q7, root0_twisted // ................*......... - // vqrdmlah.s32 q6, q5, modulus // ..................*....... - // vstrw.u32 q6, [in, #112] // .......................*.. - - - add in, in, #(4*64 - 4*16) - - subs count, count, #1 - bne out_start - - sub in, in, #(4*256) - - // TEMPORARY: Barrett reduction - modulus_neg .req r10 - neg modulus_neg, modulus - barrett_const .req r1 - .equ const_barrett, 63 - movw barrett_const, #:lower16:const_barrett - movt barrett_const, #:upper16:const_barrett - mov lr, #64 - wls lr, lr, 2f -1: - vldrw.u32 data0, [in] - vqrdmulh.s32 tmp, data0, barrett_const - vmla.s32 data0, tmp, modulus_neg - vstrw.u32 data0, [in], #16 - le lr, 1b -2: - sub in, in, #(4*256) - .unreq barrett_const - .unreq modulus_neg - - in_low .req r0 - in_high .req r1 - add in_high, in_low, #(4*128) - - /* Layers 1,2 */ - - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q3, [in_high, #256] // .*. - nop // ..* - vldrw.u32 q4, [in_high] // *.. - - // original source code - // vldrw.u32 q4, [in_high] // ..* - // vldrw.u32 q3, [in_high, #256] // *.. - // nop // .*. - - sub lr, lr, #1 - wls lr, lr, layer12_loop_end -layer12_loop: - vsub.u32 q5, q4, q3 // .........*.................. - vqrdmulh.s32 q6, q5, root2 // ...........*................ - vadd.u32 q3, q4, q3 // ..........*................. - vmul.u32 q7, q5, root2_twisted // ............*............... - vldrw.u32 q0, [in_low] // *........................... - vqrdmlah.s32 q6, q7, modulus // .............*.............. - vldrw.u32 q7, [in_low, #256] // .*.......................... - vsub.u32 q4, q0, q7 // ....*....................... - vqrdmulh.s32 q2, q4, root1 // ......*..................... - vadd.u32 q5, q0, q7 // .....*...................... - vmul.u32 q7, q4, root1_twisted // .......*.................... - vadd.u32 q4, q5, q3 // ...............*............ - vqrdmlah.s32 q2, q7, modulus // ........*................... - vstrw.u32 q4, [in_low] , #16 // ........................*... - vsub.u32 q1, q2, q6 // ...................*........ - vmul.u32 q4, q1, root0_twisted // ......................*..... - vadd.u32 q2, q2, q6 // ....................*....... - vqrdmulh.s32 q7, q1, root0 // .....................*...... - vstrw.u32 q2, [in_low, #240] // .........................*.. - vqrdmlah.s32 q7, q4, modulus // .......................*.... - vstrw.u32 q7, [in_high, #256] // ...........................* - vsub.u32 q7, q5, q3 // ..............*............. - vqrdmulh.s32 q6, q7, root0 // ................*........... - vldrw.u32 q4, [in_high, #16] // ..e......................... - vmul.u32 q5, q7, root0_twisted // .................*.......... - vldrw.u32 q3, [in_high, #272] // ...e........................ - vqrdmlah.s32 q6, q5, modulus // ..................*......... - vstrw.u32 q6, [in_high] , #16 // ..........................*. - - // original source code - // vldrw.u32 data0, [in_low] // .........*....................... - // vldrw.u32 data1, [in_low, #256] // ...........*..................... - // vldrw.u32 data2, [in_high] // e................................ - // vldrw.u32 data3, [in_high, #256] // ..e.............................. - // vsub.u32 tmp, data0, data1 // ............*.................... - // vadd.u32 data0, data0, data1 // ..............*.................. - // vqrdmulh.s32 data1, tmp, root1 // .............*................... - // vmul.u32 tmp, tmp, root1_twisted // ...............*................. - // vqrdmlah.s32 data1, tmp, modulus // .................*............... - // vsub.u32 tmp, data2, data3 // .....*........................... - // vadd.u32 data2, data2, data3 // .......*......................... - // vqrdmulh.s32 data3, tmp, root2 // ......*.......................... - // vmul.u32 tmp, tmp, root2_twisted // ........*........................ - // vqrdmlah.s32 data3, tmp, modulus // ..........*...................... - // vsub.u32 tmp, data0, data2 // ..........................*...... - // vadd.u32 data0, data0, data2 // ................*................ - // vqrdmulh.s32 data2, tmp, root0 // ...........................*..... - // vmul.u32 tmp, tmp, root0_twisted // .............................*... - // vqrdmlah.s32 data2, tmp, modulus // ...............................*. - // vsub.u32 tmp, data1, data3 // ...................*............. - // vadd.u32 data1, data1, data3 // .....................*........... - // vqrdmulh.s32 data3, tmp, root0 // ......................*.......... - // vmul.u32 tmp, tmp, root0_twisted // ....................*............ - // vqrdmlah.s32 data3, tmp, modulus // ........................*........ - // vstrw.u32 data0, [in_low] , #16 // ..................*.............. - // vstrw.u32 data1, [in_low, #240] // .......................*......... - // vstrw.u32 data2, [in_high] , #16 // ................................* - // vstrw.u32 data3, [in_high, #240] // .........................*....... - - le lr, layer12_loop -layer12_loop_end: - vldrw.u32 q7, [in_low] // ....*..................... - vsub.u32 q0, q4, q3 // *......................... - vmul.u32 q6, q0, root2_twisted // ...*...................... - vldrw.u32 q1, [in_low, #256] // ......*................... - vqrdmulh.s32 q2, q0, root2 // .*........................ - vadd.u32 q3, q4, q3 // ..*....................... - vqrdmlah.s32 q2, q6, modulus // .....*.................... - vsub.u32 q4, q7, q1 // .......*.................. - vmul.u32 q5, q4, root1_twisted // ..........*............... - vadd.u32 q0, q7, q1 // .........*................ - vqrdmulh.s32 q6, q4, root1 // ........*................. - vsub.u32 q7, q0, q3 // .....................*.... - vmul.u32 q1, q7, root0_twisted // .......................*.. - vadd.u32 q4, q0, q3 // ...........*.............. - vqrdmlah.s32 q6, q5, modulus // ............*............. - vstrw.u32 q4, [in_low] , #16 // .............*............ - vqrdmulh.s32 q0, q7, root0 // ......................*... - vsub.u32 q5, q6, q2 // ..............*........... - vmul.u32 q4, q5, root0_twisted // ...............*.......... - vadd.u32 q7, q6, q2 // ................*......... - vqrdmlah.s32 q0, q1, modulus // ........................*. - vstrw.u32 q0, [in_high] , #16 // .........................* - vqrdmulh.s32 q6, q5, root0 // .................*........ - vstrw.u32 q7, [in_low, #240] // ..................*....... - vqrdmlah.s32 q6, q4, modulus // ...................*...... - vstrw.u32 q6, [in_high, #240] // ....................*..... - - // original source code - // vsub.u32 q5, q4, q3 // .*........................ - // vqrdmulh.s32 q6, q5, root2 // ....*..................... - // vadd.u32 q3, q4, q3 // .....*.................... - // vmul.u32 q7, q5, root2_twisted // ..*....................... - // vldrw.u32 q0, [in_low] // *......................... - // vqrdmlah.s32 q6, q7, modulus // ......*................... - // vldrw.u32 q7, [in_low, #256] // ...*...................... - // vsub.u32 q4, q0, q7 // .......*.................. - // vqrdmulh.s32 q2, q4, root1 // ..........*............... - // vadd.u32 q5, q0, q7 // .........*................ - // vmul.u32 q7, q4, root1_twisted // ........*................. - // vadd.u32 q4, q5, q3 // .............*............ - // vqrdmlah.s32 q2, q7, modulus // ..............*........... - // vstrw.u32 q4, [in_low] , #16 // ...............*.......... - // vsub.u32 q1, q2, q6 // .................*........ - // vmul.u32 q4, q1, root0_twisted // ..................*....... - // vadd.u32 q2, q2, q6 // ...................*...... - // vqrdmulh.s32 q7, q1, root0 // ......................*... - // vstrw.u32 q2, [in_low, #240] // .......................*.. - // vqrdmlah.s32 q7, q4, modulus // ........................*. - // vstrw.u32 q7, [in_high, #256] // .........................* - // vsub.u32 q7, q5, q3 // ...........*.............. - // vqrdmulh.s32 q6, q7, root0 // ................*......... - // vmul.u32 q5, q7, root0_twisted // ............*............. - // vqrdmlah.s32 q6, q5, modulus // ....................*..... - // vstrw.u32 q6, [in_high] , #16 // .....................*.... - - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_n256/manual/ntt_n256_l6_s32_twiddles.s b/tests/ntt_n256/manual/ntt_n256_l6_s32_twiddles.s deleted file mode 100644 index 8f54664..0000000 --- a/tests/ntt_n256/manual/ntt_n256_l6_s32_twiddles.s +++ /dev/null @@ -1,150 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -15854702 -.word -1014623488 -.word 3260327 -.word 208645003 -.word 14579576 -.word 933021652 -.word 6733847 -.word 430933318 -.word -13128918 -.word -840186626 -.word 14626653 -.word 936034350 -.word 12909577 -.word 826149873 -.word -3819232 -.word -244412194 -.word -3271804 -.word -209379475 -.word 14745691 -.word 943652201 -.word -12267508 -.word -785060593 -.word 9914896 -.word 634504916 -.word 13512548 -.word 864737072 -.word -10953311 -.word -700958404 -.word 16204162 -.word 1036987221 -.word -9731484 -.word -622767444 -.word 9010590 -.word 576633749 -.word -12857867 -.word -822840686 -.word -6528331 -.word -417781297 -.word 341080 -.word 21827454 -.word -12336210 -.word -789457186 -.word 14833295 -.word 949258429 -.word -8225248 -.word -526375697 -.word 5289426 -.word 338497429 -.word 2138810 -.word 136873393 -.word 5705868 -.word 365147683 -.word -15870328 -.word -1015623476 -.word 6490403 -.word 415354091 -.word 9106105 -.word 582746243 -.word -14739293 -.word -943242760 -.word -13908588 -.word -890081698 -.word 1579445 -.word 101076765 -.word 7769916 -.word 497236673 -.word -2302061 -.word -147320660 -.word -11713874 -.word -749630721 -.word 11828796 -.word 756985168 -.word -7194579 -.word -460417915 -.word -13728463 -.word -878554577 -.word -355881 -.word -22774646 -.word 572895 -.word 36662482 -.word -9843973 -.word -629966191 -.word -14019017 -.word -897148614 -.word -6865022 -.word -439327877 -.word 8285889 -.word 530256425 -.word -8866965 -.word -567442451 -.word 9249292 -.word 591909511 -.word 4778209 -.word 305782038 -.word 13113327 -.word 839188878 -.word -4264131 -.word -272883557 -.word -8172970 -.word -523030160 -.word 10905370 -.word 697890414 -.word 8247799 -.word 527818851 -.word 16167867 -.word 1034664519 -.word -11510556 -.word -736619362 -.word 5086187 -.word 325491125 -.word 656361 -.word 42003898 -.word -15403199 -.word -985729501 -.word -5443354 -.word -348348069 -.word 3732072 -.word 238834379 -.word -11430609 -.word -731503145 -.word 8471290 -.word 542121183 -.word 9445744 -.word 604481480 -.word 794839 -.word 50865814 \ No newline at end of file diff --git a/tests/ntt_n256/manual/ntt_n256_l8_s32_twiddles.s b/tests/ntt_n256/manual/ntt_n256_l8_s32_twiddles.s deleted file mode 100644 index 8fd6e71..0000000 --- a/tests/ntt_n256/manual/ntt_n256_l8_s32_twiddles.s +++ /dev/null @@ -1,534 +0,0 @@ - -/// -/// Copyright (c) 2022 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - -.word -15854702 -.word -1014623488 -.word 3260327 -.word 208645003 -.word 14579576 -.word 933021652 -.word 6733847 -.word 430933318 -.word -13128918 -.word -840186626 -.word 14626653 -.word 936034350 -.word 12909577 -.word 826149873 -.word -3819232 -.word -244412194 -.word -3271804 -.word -209379475 -.word 14745691 -.word 943652201 -.word -12267508 -.word -785060593 -.word 9914896 -.word 634504916 -.word 13512548 -.word 864737072 -.word -10953311 -.word -700958404 -.word 16204162 -.word 1036987221 -.word -9731484 -.word -622767444 -.word 9010590 -.word 576633749 -.word -12857867 -.word -822840686 -.word -6528331 -.word -417781297 -.word 341080 -.word 21827454 -.word -12336210 -.word -789457186 -.word 14833295 -.word 949258429 -.word -8225248 -.word -526375697 -.word 5289426 -.word 338497429 -.word 2138810 -.word 136873393 -.word 5705868 -.word 365147683 -.word -15870328 -.word -1015623476 -.word 6490403 -.word 415354091 -.word 9106105 -.word 582746243 -.word -14739293 -.word -943242760 -.word -13908588 -.word -890081698 -.word 1579445 -.word 101076765 -.word 7769916 -.word 497236673 -.word -2302061 -.word -147320660 -.word -11713874 -.word -749630721 -.word 11828796 -.word 756985168 -.word -7194579 -.word -460417915 -.word -13728463 -.word -878554577 -.word -355881 -.word -22774646 -.word 572895 -.word 36662482 -.word -9843973 -.word -629966191 -.word -14019017 -.word -897148614 -.word -6865022 -.word -439327877 -.word 8285889 -.word 530256425 -.word -8866965 -.word -567442451 -.word 9249292 -.word 591909511 -.word 4778209 -.word 305782038 -.word 13113327 -.word 839188878 -.word -4264131 -.word -272883557 -.word -8172970 -.word -523030160 -.word 10905370 -.word 697890414 -.word 8247799 -.word 527818851 -.word 16167867 -.word 1034664519 -.word -11510556 -.word -736619362 -.word 5086187 -.word 325491125 -.word 656361 -.word 42003898 -.word -15403199 -.word -985729501 -.word -5443354 -.word -348348069 -.word 3732072 -.word 238834379 -.word -11430609 -.word -731503145 -.word 8471290 -.word 542121183 -.word 9445744 -.word 604481480 -.word 794839 -.word 50865814 -.word -7520229 -.word 7065381 -.word 11280567 -.word -13861207 -.word -481257925 -.word 452149874 -.word 721901190 -.word -887049545 -.word -4878953 -.word 5637166 -.word -14797569 -.word 8648030 -.word -312229162 -.word 360751090 -.word -946972140 -.word 553431680 -.word 7232147 -.word 7430689 -.word 14819378 -.word -11444654 -.word 462822084 -.word 475527802 -.word 948367809 -.word -732401956 -.word 14834498 -.word -10695672 -.word -10523131 -.word -1345927 -.word 949335415 -.word -684470767 -.word -673428985 -.word -86132754 -.word 7103825 -.word -9218874 -.word 6674394 -.word 3716128 -.word 454610102 -.word -589962908 -.word 427128616 -.word 237814041 -.word -14979600 -.word -16514902 -.word 6574213 -.word -8890190 -.word -958621234 -.word -1056873063 -.word 420717521 -.word -568928737 -.word 11253846 -.word 16151303 -.word 1821442 -.word -10198330 -.word 720191176 -.word 1033604503 -.word 116563391 -.word -652643308 -.word -769518 -.word 8269259 -.word -12730672 -.word -12362939 -.word -49245393 -.word 529192186 -.word -814700827 -.word -791167711 -.word -5156339 -.word -2466706 -.word -6780152 -.word -11275919 -.word -329980511 -.word -157857136 -.word -433896611 -.word -721603740 -.word -13052352 -.word 7735096 -.word -4093077 -.word -10384926 -.word -835286776 -.word 495008363 -.word -261936936 -.word -664584540 -.word 1953000 -.word 12766243 -.word 16292342 -.word -8413656 -.word 124982461 -.word 816977197 -.word 1042630311 -.word -538432889 -.word 12486848 -.word -2000332 -.word -5226683 -.word 15137961 -.word 799097282 -.word -128011478 -.word -334482183 -.word 968755565 -.word -14893165 -.word -7791061 -.word 11779122 -.word -4444688 -.word -953089817 -.word -498589850 -.word 753806275 -.word -284438323 -.word -393809 -.word 11550623 -.word -8181398 -.word -15302355 -.word -25201853 -.word 739183455 -.word -523569511 -.word -979275978 -.word 9551359 -.word -299677 -.word 10387700 -.word 4263629 -.word 611240324 -.word -19177864 -.word 664762063 -.word 272851431 -.word 596073 -.word -4517635 -.word 6760262 -.word 2228887 -.word 38145761 -.word -289106574 -.word 432623749 -.word 142637881 -.word -7627813 -.word -10048565 -.word -10996266 -.word -4099600 -.word -488142775 -.word -643059079 -.word -703707314 -.word -262354376 -.word -16185834 -.word 11558208 -.word 15755637 -.word -12816206 -.word -1035814319 -.word 739668858 -.word 1008283812 -.word -820174585 -.word 13624329 -.word 9838349 -.word 6934560 -.word 11310234 -.word 871890510 -.word 629606282 -.word 443777969 -.word 723799733 -.word 3153984 -.word 15599806 -.word -10072203 -.word -3382539 -.word 201839571 -.word 998311389 -.word -644571796 -.word -216465975 -.word 13598070 -.word -2102990 -.word -13050733 -.word 5928435 -.word 870210062 -.word -134581088 -.word -835183168 -.word 379390883 -.word -758477 -.word 9911360 -.word -1113823 -.word -2263511 -.word -48538823 -.word 634278629 -.word -71279232 -.word -144853648 -.word -7543116 -.word -10628043 -.word -9009935 -.word -12474447 -.word -482722581 -.word -680142841 -.word -576591832 -.word -798303678 -.word -11692247 -.word -5878727 -.word -2861106 -.word -1784515 -.word -748246699 -.word -376209814 -.word -183096809 -.word -114200244 -.word 2853776 -.word -1911034 -.word -3833379 -.word -1743822 -.word 182627725 -.word -122296842 -.word -245317532 -.word -111596091 -.word -3179040 -.word 4924837 -.word 11362575 -.word -2158227 -.word -203443032 -.word 315165513 -.word 727149301 -.word -138115986 -.word -5867892 -.word -2327468 -.word 6544948 -.word 13728247 -.word -375516427 -.word -148946584 -.word 418844704 -.word 878540754 -.word 9116920 -.word -7107193 -.word -6383693 -.word 1574249 -.word 583438350 -.word -454825638 -.word -408525172 -.word 100744247 -.word 6510145 -.word 760999 -.word 1634503 -.word -4010884 -.word 416617482 -.word 48700219 -.word 104600209 -.word -256676985 -.word 2195232 -.word 4465852 -.word -2353891 -.word -3640250 -.word 140484126 -.word 285792715 -.word -150637527 -.word -232958220 -.word -4383994 -.word -16731042 -.word 11592382 -.word 2671395 -.word -280554203 -.word -1070704968 -.word 741855827 -.word 170956232 -.word 14579779 -.word -9293480 -.word 4646776 -.word 69049 -.word 933034643 -.word -594737327 -.word 297370968 -.word 4418799 -.word -293505 -.word -11063747 -.word -11547014 -.word 12021234 -.word -18782886 -.word -708025769 -.word -738952496 -.word 769300260 -.word 15720958 -.word 4876619 -.word 9370171 -.word 2197027 -.word 1006064525 -.word 312079797 -.word 599645177 -.word 140598997 -.word 16117282 -.word 9635661 -.word 9117520 -.word 3506913 -.word 1031427326 -.word 616635240 -.word 583476747 -.word 224425303 -.word -13542586 -.word -7663005 -.word 10257619 -.word -9055324 -.word -866659357 -.word -490394891 -.word 656437514 -.word -579496507 -.word -10089721 -.word 11944835 -.word -3788839 -.word 3189790 -.word -645692862 -.word 764411097 -.word -242467190 -.word 204130980 -.word -4997961 -.word -13405384 -.word 11645481 -.word 16402437 -.word -319845092 -.word -857879099 -.word 745253903 -.word 1049675853 -.word 1005359 -.word -14426854 -.word 11690281 -.word 5461508 -.word 64338065 -.word -923248190 -.word 748120885 -.word 349509836 -.word 4898455 -.word -11497049 -.word -13241747 -.word -4941226 -.word 313477194 -.word -735754980 -.word -847407131 -.word -316214329 -.word 6226096 -.word 14029790 -.word 7729000 -.word 13958531 -.word 398439734 -.word 897838034 -.word 494618249 -.word 893277806 -.word -1801935 -.word -7454249 -.word -14381089 -.word -14084755 -.word -115315039 -.word -477035527 -.word -920319454 -.word -901355525 -.word -16254433 -.word 8630188 -.word 13744680 -.word -1666087 -.word -1040204320 -.word 552289879 -.word 879592386 -.word -106621430 -.word 4735938 -.word -6885336 -.word -7746022 -.word -7978303 -.word 303076900 -.word -440627874 -.word -495707574 -.word -510572423 -.word 6957373 -.word -8175281 -.word -5776166 -.word -5494682 -.word 445237890 -.word -523178053 -.word -369646411 -.word -351632810 -.word -7406071 -.word -4031087 -.word -10476123 -.word 1636987 -.word -473952370 -.word -257969879 -.word -670420703 -.word 104759172 -.word 10674616 -.word 9508293 -.word 4274200 -.word 10066304 -.word 683123285 -.word 608484310 -.word 273527923 -.word 644194289 -.word -7083547 -.word 14853570 -.word -1129445 -.word 16598340 -.word -453312409 -.word 950555930 -.word -72278963 -.word 1062212688 \ No newline at end of file diff --git a/tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_complete.s b/tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index ca10604..0000000 --- a/tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,1226 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.word 13704133 // zeta^ 2 * 2^31 = 28678040^ 2 * 2^31 -.word 41177999 // zeta^130 * 2^31 = 28678040^130 * 2^31 -.word 26703739 // zeta^ 66 * 2^31 = 28678040^ 66 * 2^31 -.word 65289035 // zeta^194 * 2^31 = 28678040^194 * 2^31 -.word 1666225723 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 2 * 375649793 * 2^31 -.word 2599633521 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 28678040^130 * 375649793 * 2^31 -.word 2869384837 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 66 * 375649793 * 2^31 -.word 1260434101 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 28678040^194 * 375649793 * 2^31 -.word 50326315 // zeta^ 1 * 2^31 = 28678040^ 1 * 2^31 -.word 37746191 // zeta^ 65 * 2^31 = 28678040^ 65 * 2^31 -.word 49080301 // zeta^ 33 * 2^31 = 28678040^ 33 * 2^31 -.word 34232193 // zeta^ 97 * 2^31 = 28678040^ 97 * 2^31 -.word 1835254485 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 1 * 375649793 * 2^31 -.word 360751089 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 65 * 375649793 * 2^31 -.word 1200511507 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 33 * 375649793 * 2^31 -.word 553431679 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 97 * 375649793 * 2^31 -.word 22955837 // zeta^129 * 2^31 = 28678040^129 * 2^31 -.word 31411079 // zeta^193 * 2^31 = 28678040^193 * 2^31 -.word 492607 // zeta^161 * 2^31 = 28678040^161 * 2^31 -.word 22217509 // zeta^225 * 2^31 = 28678040^225 * 2^31 -.word 2610305731 // zeta^129 * (q^(-1) mod 2^32) * 2^31 = 28678040^129 * 375649793 * 2^31 -.word 2623011449 // zeta^193 * (q^(-1) mod 2^32) * 2^31 = 28678040^193 * 375649793 * 2^31 -.word 948367809 // zeta^161 * (q^(-1) mod 2^32) * 2^31 = 28678040^161 * 375649793 * 2^31 -.word 3562565339 // zeta^225 * (q^(-1) mod 2^32) * 2^31 = 28678040^225 * 375649793 * 2^31 -.word 5481609 // zeta^ 34 * 2^31 = 28678040^ 34 * 2^31 -.word 12552175 // zeta^162 * 2^31 = 28678040^162 * 2^31 -.word 54494203 // zeta^ 98 * 2^31 = 28678040^ 98 * 2^31 -.word 32704019 // zeta^226 * 2^31 = 28678040^226 * 2^31 -.word 949335415 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 34 * 375649793 * 2^31 -.word 3610496529 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 28678040^162 * 375649793 * 2^31 -.word 1474054661 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 98 * 375649793 * 2^31 -.word 2061350893 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 28678040^226 * 375649793 * 2^31 -.word 48767307 // zeta^ 17 * 2^31 = 28678040^ 17 * 2^31 -.word 39600285 // zeta^ 81 * 2^31 = 28678040^ 81 * 2^31 -.word 31654617 // zeta^ 49 * 2^31 = 28678040^ 49 * 2^31 -.word 4736231 // zeta^113 * 2^31 = 28678040^113 * 2^31 -.word 2602093749 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 17 * 375649793 * 2^31 -.word 3705004387 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 81 * 375649793 * 2^31 -.word 427128615 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 49 * 375649793 * 2^31 -.word 237814041 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 28678040^113 * 375649793 * 2^31 -.word 18965555 // zeta^145 * 2^31 = 28678040^145 * 2^31 -.word 50771049 // zeta^209 * 2^31 = 28678040^209 * 2^31 -.word 8794671 // zeta^177 * 2^31 = 28678040^177 * 2^31 -.word 59508707 // zeta^241 * 2^31 = 28678040^241 * 2^31 -.word 3336346061 // zeta^145 * (q^(-1) mod 2^32) * 2^31 = 28678040^145 * 375649793 * 2^31 -.word 3238094231 // zeta^209 * (q^(-1) mod 2^32) * 2^31 = 28678040^209 * 375649793 * 2^31 -.word 2568201169 // zeta^177 * (q^(-1) mod 2^32) * 2^31 = 28678040^177 * 375649793 * 2^31 -.word 3726038557 // zeta^241 * (q^(-1) mod 2^32) * 2^31 = 28678040^241 * 375649793 * 2^31 -.word 43973433 // zeta^ 18 * 2^31 = 28678040^ 18 * 2^31 -.word 14453865 // zeta^146 * 2^31 = 28678040^146 * 2^31 -.word 14937153 // zeta^ 82 * 2^31 = 28678040^ 82 * 2^31 -.word 39701997 // zeta^210 * 2^31 = 28678040^210 * 2^31 -.word 720191175 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 18 * 375649793 * 2^31 -.word 3181088151 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 28678040^146 * 375649793 * 2^31 -.word 116563391 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 82 * 375649793 * 2^31 -.word 3642323987 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 28678040^210 * 375649793 * 2^31 -.word 53455571 // zeta^ 9 * 2^31 = 28678040^ 9 * 2^31 -.word 35877127 // zeta^ 73 * 2^31 = 28678040^ 73 * 2^31 -.word 681755 // zeta^ 41 * 2^31 = 28678040^ 41 * 2^31 -.word 63245537 // zeta^105 * 2^31 = 28678040^105 * 2^31 -.word 4245721901 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 9 * 375649793 * 2^31 -.word 2676675833 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 73 * 375649793 * 2^31 -.word 3480266469 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 41 * 375649793 * 2^31 -.word 1356315935 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 28678040^105 * 375649793 * 2^31 -.word 11718751 // zeta^137 * 2^31 = 28678040^137 * 2^31 -.word 41885553 // zeta^201 * 2^31 = 28678040^201 * 2^31 -.word 54210213 // zeta^169 * 2^31 = 28678040^169 * 2^31 -.word 16838301 // zeta^233 * 2^31 = 28678040^233 * 2^31 -.word 1817503137 // zeta^137 * (q^(-1) mod 2^32) * 2^31 = 28678040^137 * 375649793 * 2^31 -.word 4137110159 // zeta^201 * (q^(-1) mod 2^32) * 2^31 = 28678040^201 * 375649793 * 2^31 -.word 3861070683 // zeta^169 * (q^(-1) mod 2^32) * 2^31 = 28678040^169 * 375649793 * 2^31 -.word 1425879907 // zeta^233 * (q^(-1) mod 2^32) * 2^31 = 28678040^233 * 375649793 * 2^31 -.word 40841465 // zeta^ 50 * 2^31 = 28678040^ 50 * 2^31 -.word 3577749 // zeta^178 * 2^31 = 28678040^178 * 2^31 -.word 33845545 // zeta^114 * 2^31 = 28678040^114 * 2^31 -.word 19555165 // zeta^242 * 2^31 = 28678040^242 * 2^31 -.word 3459680519 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 50 * 375649793 * 2^31 -.word 495008363 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 28678040^178 * 375649793 * 2^31 -.word 1885546711 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 28678040^114 * 375649793 * 2^31 -.word 3630382755 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 28678040^242 * 375649793 * 2^31 -.word 62758213 // zeta^ 25 * 2^31 = 28678040^ 25 * 2^31 -.word 8005843 // zeta^ 89 * 2^31 = 28678040^ 89 * 2^31 -.word 51922779 // zeta^ 57 * 2^31 = 28678040^ 57 * 2^31 -.word 7245689 // zeta^121 * 2^31 = 28678040^121 * 2^31 -.word 124982459 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 25 * 375649793 * 2^31 -.word 2964460845 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 89 * 375649793 * 2^31 -.word 1042630309 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 57 * 375649793 * 2^31 -.word 3756534407 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 28678040^121 * 375649793 * 2^31 -.word 30225471 // zeta^153 * 2^31 = 28678040^153 * 2^31 -.word 44151511 // zeta^217 * 2^31 = 28678040^217 * 2^31 -.word 64890121 // zeta^185 * 2^31 = 28678040^185 * 2^31 -.word 65259669 // zeta^249 * 2^31 = 28678040^249 * 2^31 -.word 799097281 // zeta^153 * (q^(-1) mod 2^32) * 2^31 = 28678040^153 * 375649793 * 2^31 -.word 4166955817 // zeta^217 * (q^(-1) mod 2^32) * 2^31 = 28678040^217 * 375649793 * 2^31 -.word 1813001463 // zeta^185 * (q^(-1) mod 2^32) * 2^31 = 28678040^185 * 375649793 * 2^31 -.word 3116239211 // zeta^249 * (q^(-1) mod 2^32) * 2^31 = 28678040^249 * 375649793 * 2^31 -.word 12974361 // zeta^ 10 * 2^31 = 28678040^ 10 * 2^31 -.word 41807515 // zeta^138 * 2^31 = 28678040^138 * 2^31 -.word 56379967 // zeta^ 74 * 2^31 = 28678040^ 74 * 2^31 -.word 13380915 // zeta^202 * 2^31 = 28678040^202 * 2^31 -.word 1194393831 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 10 * 375649793 * 2^31 -.word 1648893797 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 28678040^138 * 375649793 * 2^31 -.word 753806273 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 74 * 375649793 * 2^31 -.word 4010528973 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 28678040^202 * 375649793 * 2^31 -.word 16772797 // zeta^ 5 * 2^31 = 28678040^ 5 * 2^31 -.word 58675875 // zeta^ 69 * 2^31 = 28678040^ 69 * 2^31 -.word 59974505 // zeta^ 37 * 2^31 = 28678040^ 37 * 2^31 -.word 33980107 // zeta^101 * 2^31 = 28678040^101 * 2^31 -.word 2122281795 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 5 * 375649793 * 2^31 -.word 2886667101 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 69 * 375649793 * 2^31 -.word 3771397783 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 37 * 375649793 * 2^31 -.word 1168207669 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 28678040^101 * 375649793 * 2^31 -.word 28448893 // zeta^133 * 2^31 = 28678040^133 * 2^31 -.word 24378249 // zeta^197 * 2^31 = 28678040^197 * 2^31 -.word 62687027 // zeta^165 * 2^31 = 28678040^165 * 2^31 -.word 65645595 // zeta^229 * 2^31 = 28678040^229 * 2^31 -.word 2758723971 // zeta^133 * (q^(-1) mod 2^32) * 2^31 = 28678040^133 * 375649793 * 2^31 -.word 2128305783 // zeta^197 * (q^(-1) mod 2^32) * 2^31 = 28678040^197 * 375649793 * 2^31 -.word 664762061 // zeta^165 * (q^(-1) mod 2^32) * 2^31 = 28678040^165 * 375649793 * 2^31 -.word 2420335077 // zeta^229 * (q^(-1) mod 2^32) * 2^31 = 28678040^229 * 375649793 * 2^31 -.word 52771617 // zeta^ 42 * 2^31 = 28678040^ 42 * 2^31 -.word 23396495 // zeta^170 * 2^31 = 28678040^170 * 2^31 -.word 51483005 // zeta^106 * 2^31 = 28678040^106 * 2^31 -.word 11487943 // zeta^234 * 2^31 = 28678040^234 * 2^31 -.word 2185629407 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 42 * 375649793 * 2^31 -.word 1858377073 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 28678040^170 * 375649793 * 2^31 -.word 432623747 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 28678040^106 * 375649793 * 2^31 -.word 2290121529 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 28678040^234 * 375649793 * 2^31 -.word 63287737 // zeta^ 21 * 2^31 = 28678040^ 21 * 2^31 -.word 56338313 // zeta^ 85 * 2^31 = 28678040^ 85 * 2^31 -.word 19445427 // zeta^ 53 * 2^31 = 28678040^ 53 * 2^31 -.word 29167561 // zeta^117 * 2^31 = 28678040^117 * 2^31 -.word 1659340871 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 21 * 375649793 * 2^31 -.word 1504424567 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 85 * 375649793 * 2^31 -.word 3591259981 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 53 * 375649793 * 2^31 -.word 4032612919 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 28678040^117 * 375649793 * 2^31 -.word 7740335 // zeta^149 * 2^31 = 28678040^149 * 2^31 -.word 23515783 // zeta^213 * 2^31 = 28678040^213 * 2^31 -.word 33583453 // zeta^181 * 2^31 = 28678040^181 * 2^31 -.word 60337403 // zeta^245 * 2^31 = 28678040^245 * 2^31 -.word 3259152977 // zeta^149 * (q^(-1) mod 2^32) * 2^31 = 28678040^149 * 375649793 * 2^31 -.word 739668857 // zeta^213 * (q^(-1) mod 2^32) * 2^31 = 28678040^213 * 375649793 * 2^31 -.word 3155767459 // zeta^181 * (q^(-1) mod 2^32) * 2^31 = 28678040^181 * 375649793 * 2^31 -.word 3474792709 // zeta^245 * (q^(-1) mod 2^32) * 2^31 = 28678040^245 * 375649793 * 2^31 -.word 35192755 // zeta^ 26 * 2^31 = 28678040^ 26 * 2^31 -.word 36544119 // zeta^154 * 2^31 = 28678040^154 * 2^31 -.word 6787663 // zeta^ 90 * 2^31 = 28678040^ 90 * 2^31 -.word 63484749 // zeta^218 * 2^31 = 28678040^218 * 2^31 -.word 3019374157 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 26 * 375649793 * 2^31 -.word 2777089929 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 28678040^154 * 375649793 * 2^31 -.word 443777969 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 90 * 375649793 * 2^31 -.word 723799731 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 28678040^218 * 375649793 * 2^31 -.word 61997615 // zeta^ 13 * 2^31 = 28678040^ 13 * 2^31 -.word 4479011 // zeta^ 77 * 2^31 = 28678040^ 77 * 2^31 -.word 38089877 // zeta^ 45 * 2^31 = 28678040^ 45 * 2^31 -.word 16590903 // zeta^109 * 2^31 = 28678040^109 * 2^31 -.word 201839569 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 13 * 375649793 * 2^31 -.word 998311389 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 77 * 375649793 * 2^31 -.word 1502911851 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 45 * 375649793 * 2^31 -.word 1931017673 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 28678040^109 * 375649793 * 2^31 -.word 43852787 // zeta^141 * 2^31 = 28678040^141 * 2^31 -.word 24597857 // zeta^205 * 2^31 = 28678040^205 * 2^31 -.word 43936833 // zeta^173 * 2^31 = 28678040^173 * 2^31 -.word 15636061 // zeta^237 * 2^31 = 28678040^237 * 2^31 -.word 870210061 // zeta^141 * (q^(-1) mod 2^32) * 2^31 = 28678040^141 * 375649793 * 2^31 -.word 4160386207 // zeta^205 * (q^(-1) mod 2^32) * 2^31 = 28678040^205 * 375649793 * 2^31 -.word 1312300479 // zeta^173 * (q^(-1) mod 2^32) * 2^31 = 28678040^173 * 375649793 * 2^31 -.word 2526874531 // zeta^237 * (q^(-1) mod 2^32) * 2^31 = 28678040^237 * 375649793 * 2^31 -.word 55869129 // zeta^ 58 * 2^31 = 28678040^ 58 * 2^31 -.word 16038683 // zeta^186 * 2^31 = 28678040^186 * 2^31 -.word 43560065 // zeta^122 * 2^31 = 28678040^122 * 2^31 -.word 25949329 // zeta^250 * 2^31 = 28678040^250 * 2^31 -.word 2098944823 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 58 * 375649793 * 2^31 -.word 634278629 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 28678040^186 * 375649793 * 2^31 -.word 2076204415 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 28678040^122 * 375649793 * 2^31 -.word 2002629999 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 28678040^250 * 375649793 * 2^31 -.word 6591765 // zeta^ 29 * 2^31 = 28678040^ 29 * 2^31 -.word 1696249 // zeta^ 93 * 2^31 = 28678040^ 93 * 2^31 -.word 21795289 // zeta^ 61 * 2^31 = 28678040^ 61 * 2^31 -.word 17734591 // zeta^125 * 2^31 = 28678040^125 * 2^31 -.word 3812244715 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 29 * 375649793 * 2^31 -.word 1467340807 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 93 * 375649793 * 2^31 -.word 1570891815 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 61 * 375649793 * 2^31 -.word 1349179969 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 28678040^125 * 375649793 * 2^31 -.word 66853037 // zeta^157 * 2^31 = 28678040^157 * 2^31 -.word 24930199 // zeta^221 * 2^31 = 28678040^221 * 2^31 -.word 54854635 // zeta^189 * 2^31 = 28678040^189 * 2^31 -.word 39952565 // zeta^253 * 2^31 = 28678040^253 * 2^31 -.word 1399236947 // zeta^157 * (q^(-1) mod 2^32) * 2^31 = 28678040^157 * 375649793 * 2^31 -.word 1771273833 // zeta^221 * (q^(-1) mod 2^32) * 2^31 = 28678040^221 * 375649793 * 2^31 -.word 4111870485 // zeta^189 * (q^(-1) mod 2^32) * 2^31 = 28678040^189 * 375649793 * 2^31 -.word 2033283403 // zeta^253 * (q^(-1) mod 2^32) * 2^31 = 28678040^253 * 375649793 * 2^31 -.word 5623923 // zeta^ 6 * 2^31 = 28678040^ 6 * 2^31 -.word 38701067 // zeta^134 * 2^31 = 28678040^134 * 2^31 -.word 18571677 // zeta^ 70 * 2^31 = 28678040^ 70 * 2^31 -.word 14491707 // zeta^198 * 2^31 = 28678040^198 * 2^31 -.word 182627725 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 6 * 375649793 * 2^31 -.word 4172670453 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 28678040^134 * 375649793 * 2^31 -.word 1902166115 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 70 * 375649793 * 2^31 -.word 4183371205 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 28678040^198 * 375649793 * 2^31 -.word 17941849 // zeta^ 3 * 2^31 = 28678040^ 3 * 2^31 -.word 12982967 // zeta^ 67 * 2^31 = 28678040^ 67 * 2^31 -.word 8061707 // zeta^ 35 * 2^31 = 28678040^ 35 * 2^31 -.word 17774995 // zeta^ 99 * 2^31 = 28678040^ 99 * 2^31 -.word 4091524263 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 3 * 375649793 * 2^31 -.word 2462649161 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 67 * 375649793 * 2^31 -.word 2874632949 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 35 * 375649793 * 2^31 -.word 2009367661 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 99 * 375649793 * 2^31 -.word 61107981 // zeta^131 * 2^31 = 28678040^131 * 2^31 -.word 38975641 // zeta^195 * 2^31 = 28678040^195 * 2^31 -.word 40352225 // zeta^163 * 2^31 = 28678040^163 * 2^31 -.word 49569327 // zeta^227 * 2^31 = 28678040^227 * 2^31 -.word 3919450867 // zeta^131 * (q^(-1) mod 2^32) * 2^31 = 28678040^131 * 375649793 * 2^31 -.word 4146020711 // zeta^195 * (q^(-1) mod 2^32) * 2^31 = 28678040^195 * 375649793 * 2^31 -.word 418844703 // zeta^163 * (q^(-1) mod 2^32) * 2^31 = 28678040^163 * 375649793 * 2^31 -.word 3026024401 // zeta^227 * (q^(-1) mod 2^32) * 2^31 = 28678040^227 * 375649793 * 2^31 -.word 26799603 // zeta^ 38 * 2^31 = 28678040^ 38 * 2^31 -.word 33463463 // zeta^166 * 2^31 = 28678040^166 * 2^31 -.word 39332725 // zeta^102 * 2^31 = 28678040^102 * 2^31 -.word 61125067 // zeta^230 * 2^31 = 28678040^230 * 2^31 -.word 583438349 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 38 * 375649793 * 2^31 -.word 1692658009 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 28678040^166 * 375649793 * 2^31 -.word 1738958475 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 28678040^102 * 375649793 * 2^31 -.word 2248227893 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 28678040^230 * 375649793 * 2^31 -.word 40014327 // zeta^ 19 * 2^31 = 28678040^ 19 * 2^31 -.word 562885 // zeta^ 83 * 2^31 = 28678040^ 83 * 2^31 -.word 51009393 // zeta^ 51 * 2^31 = 28678040^ 51 * 2^31 -.word 51995259 // zeta^115 * 2^31 = 28678040^115 * 2^31 -.word 2564101129 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 19 * 375649793 * 2^31 -.word 2196183867 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 83 * 375649793 * 2^31 -.word 2252083855 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 51 * 375649793 * 2^31 -.word 4038290309 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 28678040^115 * 375649793 * 2^31 -.word 24330211 // zeta^147 * 2^31 = 28678040^147 * 2^31 -.word 7682101 // zeta^211 * 2^31 = 28678040^211 * 2^31 -.word 7401943 // zeta^179 * 2^31 = 28678040^179 * 2^31 -.word 41757453 // zeta^243 * 2^31 = 28678040^243 * 2^31 -.word 140484125 // zeta^147 * (q^(-1) mod 2^32) * 2^31 = 28678040^147 * 375649793 * 2^31 -.word 285792715 // zeta^211 * (q^(-1) mod 2^32) * 2^31 = 28678040^211 * 375649793 * 2^31 -.word 1996846121 // zeta^179 * (q^(-1) mod 2^32) * 2^31 = 28678040^179 * 375649793 * 2^31 -.word 4062009075 // zeta^243 * (q^(-1) mod 2^32) * 2^31 = 28678040^243 * 375649793 * 2^31 -.word 65375453 // zeta^ 22 * 2^31 = 28678040^ 22 * 2^31 -.word 40797001 // zeta^150 * 2^31 = 28678040^150 * 2^31 -.word 59835311 // zeta^ 86 * 2^31 = 28678040^ 86 * 2^31 -.word 32875577 // zeta^214 * 2^31 = 28678040^214 * 2^31 -.word 4014413091 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 22 * 375649793 * 2^31 -.word 3224262327 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 28678040^150 * 375649793 * 2^31 -.word 741855825 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 86 * 375649793 * 2^31 -.word 2318439879 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 28678040^214 * 375649793 * 2^31 -.word 10045293 // zeta^ 11 * 2^31 = 28678040^ 11 * 2^31 -.word 53076657 // zeta^ 75 * 2^31 = 28678040^ 75 * 2^31 -.word 17896617 // zeta^ 43 * 2^31 = 28678040^ 43 * 2^31 -.word 58413331 // zeta^107 * 2^31 = 28678040^107 * 2^31 -.word 3080518291 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 11 * 375649793 * 2^31 -.word 3700229967 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 75 * 375649793 * 2^31 -.word 297370967 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 43 * 375649793 * 2^31 -.word 2151902445 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 28678040^107 * 375649793 * 2^31 -.word 19472551 // zeta^139 * 2^31 = 28678040^139 * 2^31 -.word 6043561 // zeta^203 * 2^31 = 28678040^203 * 2^31 -.word 20934449 // zeta^171 * 2^31 = 28678040^171 * 2^31 -.word 37620445 // zeta^235 * 2^31 = 28678040^235 * 2^31 -.word 2128700761 // zeta^139 * (q^(-1) mod 2^32) * 2^31 = 28678040^139 * 375649793 * 2^31 -.word 1439457879 // zeta^203 * (q^(-1) mod 2^32) * 2^31 = 28678040^203 * 375649793 * 2^31 -.word 3556014799 // zeta^171 * (q^(-1) mod 2^32) * 2^31 = 28678040^171 * 375649793 * 2^31 -.word 769300259 // zeta^235 * (q^(-1) mod 2^32) * 2^31 = 28678040^235 * 375649793 * 2^31 -.word 12921459 // zeta^ 54 * 2^31 = 28678040^ 54 * 2^31 -.word 63769677 // zeta^182 * 2^31 = 28678040^182 * 2^31 -.word 61505033 // zeta^118 * 2^31 = 28678040^118 * 2^31 -.word 65692461 // zeta^246 * 2^31 = 28678040^246 * 2^31 -.word 1006064525 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 54 * 375649793 * 2^31 -.word 2459563443 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 28678040^182 * 375649793 * 2^31 -.word 2747128823 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 28678040^118 * 375649793 * 2^31 -.word 2288082643 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 28678040^246 * 375649793 * 2^31 -.word 20171011 // zeta^ 27 * 2^31 = 28678040^ 27 * 2^31 -.word 36495001 // zeta^ 91 * 2^31 = 28678040^ 91 * 2^31 -.word 62685175 // zeta^ 59 * 2^31 = 28678040^ 59 * 2^31 -.word 664745 // zeta^123 * 2^31 = 28678040^123 * 2^31 -.word 1031427325 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 27 * 375649793 * 2^31 -.word 2764118887 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 91 * 375649793 * 2^31 -.word 583476745 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 59 * 375649793 * 2^31 -.word 2371908951 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 28678040^123 * 375649793 * 2^31 -.word 56713759 // zeta^155 * 2^31 = 28678040^155 * 2^31 -.word 59594509 // zeta^219 * 2^31 = 28678040^219 * 2^31 -.word 41235703 // zeta^187 * 2^31 = 28678040^187 * 2^31 -.word 11581499 // zeta^251 * 2^31 = 28678040^251 * 2^31 -.word 3428307937 // zeta^155 * (q^(-1) mod 2^32) * 2^31 = 28678040^155 * 375649793 * 2^31 -.word 1657088755 // zeta^219 * (q^(-1) mod 2^32) * 2^31 = 28678040^219 * 375649793 * 2^31 -.word 2803921161 // zeta^187 * (q^(-1) mod 2^32) * 2^31 = 28678040^187 * 375649793 * 2^31 -.word 3715470789 // zeta^251 * (q^(-1) mod 2^32) * 2^31 = 28678040^251 * 375649793 * 2^31 -.word 23458751 // zeta^ 14 * 2^31 = 28678040^ 14 * 2^31 -.word 9406759 // zeta^142 * 2^31 = 28678040^142 * 2^31 -.word 33711991 // zeta^ 78 * 2^31 = 28678040^ 78 * 2^31 -.word 32167773 // zeta^206 * 2^31 = 28678040^206 * 2^31 -.word 1501790785 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 14 * 375649793 * 2^31 -.word 2911894745 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 28678040^142 * 375649793 * 2^31 -.word 1905016457 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 78 * 375649793 * 2^31 -.word 204130979 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 28678040^206 * 375649793 * 2^31 -.word 26043621 // zeta^ 7 * 2^31 = 28678040^ 7 * 2^31 -.word 51942461 // zeta^ 71 * 2^31 = 28678040^ 71 * 2^31 -.word 14401009 // zeta^ 39 * 2^31 = 28678040^ 39 * 2^31 -.word 60574133 // zeta^103 * 2^31 = 28678040^103 * 2^31 -.word 1827638555 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 7 * 375649793 * 2^31 -.word 3437088195 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 71 * 375649793 * 2^31 -.word 2892737551 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 39 * 375649793 * 2^31 -.word 3197159499 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 28678040^103 * 375649793 * 2^31 -.word 16031087 // zeta^135 * 2^31 = 28678040^135 * 2^31 -.word 25566271 // zeta^199 * 2^31 = 28678040^199 * 2^31 -.word 54040269 // zeta^167 * 2^31 = 28678040^167 * 2^31 -.word 36895029 // zeta^231 * 2^31 = 28678040^231 * 2^31 -.word 2211821713 // zeta^135 * (q^(-1) mod 2^32) * 2^31 = 28678040^135 * 375649793 * 2^31 -.word 3371719105 // zeta^199 * (q^(-1) mod 2^32) * 2^31 = 28678040^199 * 375649793 * 2^31 -.word 2895604531 // zeta^167 * (q^(-1) mod 2^32) * 2^31 = 28678040^167 * 375649793 * 2^31 -.word 349509835 // zeta^231 * (q^(-1) mod 2^32) * 2^31 = 28678040^231 * 375649793 * 2^31 -.word 41803191 // zeta^ 46 * 2^31 = 28678040^ 46 * 2^31 -.word 19377381 // zeta^174 * 2^31 = 28678040^174 * 2^31 -.word 9664027 // zeta^110 * 2^31 = 28678040^110 * 2^31 -.word 55794235 // zeta^238 * 2^31 = 28678040^238 * 2^31 -.word 2460960841 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 46 * 375649793 * 2^31 -.word 1411728667 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 28678040^174 * 375649793 * 2^31 -.word 1300076517 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 28678040^110 * 375649793 * 2^31 -.word 3978752965 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 28678040^238 * 375649793 * 2^31 -.word 19675339 // zeta^ 23 * 2^31 = 28678040^ 23 * 2^31 -.word 21359151 // zeta^ 87 * 2^31 = 28678040^ 87 * 2^31 -.word 63140729 // zeta^ 55 * 2^31 = 28678040^ 55 * 2^31 -.word 23160723 // zeta^119 * 2^31 = 28678040^119 * 2^31 -.word 398439733 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 23 * 375649793 * 2^31 -.word 897838033 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 87 * 375649793 * 2^31 -.word 494618247 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 55 * 375649793 * 2^31 -.word 3040761453 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 28678040^119 * 375649793 * 2^31 -.word 9258847 // zeta^151 * 2^31 = 28678040^151 * 2^31 -.word 4669959 // zeta^215 * 2^31 = 28678040^215 * 2^31 -.word 41266143 // zeta^183 * 2^31 = 28678040^183 * 2^31 -.word 61464071 // zeta^247 * 2^31 = 28678040^247 * 2^31 -.word 2032168609 // zeta^151 * (q^(-1) mod 2^32) * 2^31 = 28678040^151 * 375649793 * 2^31 -.word 1670448121 // zeta^215 * (q^(-1) mod 2^32) * 2^31 = 28678040^215 * 375649793 * 2^31 -.word 1227164193 // zeta^183 * (q^(-1) mod 2^32) * 2^31 = 28678040^183 * 375649793 * 2^31 -.word 1246128121 // zeta^247 * (q^(-1) mod 2^32) * 2^31 = 28678040^247 * 375649793 * 2^31 -.word 43355169 // zeta^ 30 * 2^31 = 28678040^ 30 * 2^31 -.word 5591977 // zeta^158 * 2^31 = 28678040^158 * 2^31 -.word 40694335 // zeta^ 94 * 2^31 = 28678040^ 94 * 2^31 -.word 25071607 // zeta^222 * 2^31 = 28678040^222 * 2^31 -.word 1107279327 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 30 * 375649793 * 2^31 -.word 552289879 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 28678040^158 * 375649793 * 2^31 -.word 879592385 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 94 * 375649793 * 2^31 -.word 2040862217 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 28678040^222 * 375649793 * 2^31 -.word 34737117 // zeta^ 15 * 2^31 = 28678040^ 15 * 2^31 -.word 45994147 // zeta^ 79 * 2^31 = 28678040^ 79 * 2^31 -.word 42273719 // zeta^ 47 * 2^31 = 28678040^ 47 * 2^31 -.word 60428681 // zeta^111 * 2^31 = 28678040^111 * 2^31 -.word 303076899 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 15 * 375649793 * 2^31 -.word 3854339421 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 79 * 375649793 * 2^31 -.word 3799259721 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 47 * 375649793 * 2^31 -.word 1636911223 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 28678040^111 * 375649793 * 2^31 -.word 26028927 // zeta^143 * 2^31 = 28678040^143 * 2^31 -.word 64083527 // zeta^207 * 2^31 = 28678040^207 * 2^31 -.word 60382541 // zeta^175 * 2^31 = 28678040^175 * 2^31 -.word 31337387 // zeta^239 * 2^31 = 28678040^239 * 2^31 -.word 2592721537 // zeta^143 * (q^(-1) mod 2^32) * 2^31 = 28678040^143 * 375649793 * 2^31 -.word 1624305593 // zeta^207 * (q^(-1) mod 2^32) * 2^31 = 28678040^207 * 375649793 * 2^31 -.word 3925320883 // zeta^175 * (q^(-1) mod 2^32) * 2^31 = 28678040^175 * 375649793 * 2^31 -.word 3943334485 // zeta^239 * (q^(-1) mod 2^32) * 2^31 = 28678040^239 * 375649793 * 2^31 -.word 27553395 // zeta^ 62 * 2^31 = 28678040^ 62 * 2^31 -.word 7648471 // zeta^190 * 2^31 = 28678040^190 * 2^31 -.word 689375 // zeta^126 * 2^31 = 28678040^126 * 2^31 -.word 46555773 // zeta^254 * 2^31 = 28678040^254 * 2^31 -.word 1673531277 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 62 * 375649793 * 2^31 -.word 1889513769 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 28678040^190 * 375649793 * 2^31 -.word 1477062945 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 28678040^126 * 375649793 * 2^31 -.word 2252242819 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 28678040^254 * 375649793 * 2^31 -.word 15797163 // zeta^ 31 * 2^31 = 28678040^ 31 * 2^31 -.word 40170027 // zeta^ 95 * 2^31 = 28678040^ 95 * 2^31 -.word 10866061 // zeta^ 63 * 2^31 = 28678040^ 63 * 2^31 -.word 56298001 // zeta^127 * 2^31 = 28678040^127 * 2^31 -.word 683123285 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 31 * 375649793 * 2^31 -.word 2755967957 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 95 * 375649793 * 2^31 -.word 273527923 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 63 * 375649793 * 2^31 -.word 644194287 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 28678040^127 * 375649793 * 2^31 -.word 50400667 // zeta^159 * 2^31 = 28678040^159 * 2^31 -.word 33861863 // zeta^223 * 2^31 = 28678040^223 * 2^31 -.word 53736885 // zeta^191 * 2^31 = 28678040^191 * 2^31 -.word 31774129 // zeta^255 * 2^31 = 28678040^255 * 2^31 -.word 1694171237 // zeta^159 * (q^(-1) mod 2^32) * 2^31 = 28678040^159 * 375649793 * 2^31 -.word 950555929 // zeta^223 * (q^(-1) mod 2^32) * 2^31 = 28678040^223 * 375649793 * 2^31 -.word 2075204683 // zeta^191 * (q^(-1) mod 2^32) * 2^31 = 28678040^191 * 375649793 * 2^31 -.word 1062212687 // zeta^255 * (q^(-1) mod 2^32) * 2^31 = 28678040^255 * 375649793 * 2^31 -.text - -// Montgomery multiplication via rounding -.macro mulmod dst, src, const, const_twisted - vqrdmulh.s32 \dst, \src, \const - vmul.u32 \src, \src, \const_twisted - vqrdmlah.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_complete_manual, %function -.global ntt_n256_u32_33556993_28678040_complete_manual -ntt_n256_u32_33556993_28678040_complete_manual: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, 33556993 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q6, [in_high, #256] // .*.... - vmul.u32 q2, q6, root0_twisted // ...*.. - vldrw.u32 q0, [in_low, #256] // *..... - vqrdmulh.s32 q6, q6, root0 // ..*... - nop // .....* - vqrdmlah.s32 q6, q2, modulus // ....*. - - // original source code - // vldrw.u32 q0, [in_low, #256] // ..*... - // vldrw.u32 q3, [in_high, #256] // *..... - // vqrdmulh.s32 q6, q3, root0 // ...*.. - // vmul.u32 q4, q3, root0_twisted // .*.... - // vqrdmlah.s32 q6, q4, modulus // .....* - // nop // ....*. - - sub lr, lr, #1 - wls lr, lr, layer12_loop_end -layer12_loop: - vsub.u32 q5, q0, q6 // ............*............... - vqrdmulh.s32 q7, q5, root2 // ...................*........ - vldrw.u32 q4, [in_high] // ..*......................... - vqrdmulh.s32 q2, q4, root0 // ....*....................... - vadd.u32 q3, q0, q6 // .............*.............. - vmul.u32 q0, q4, root0_twisted // .....*...................... - vldrw.u32 q6, [in_low] // *........................... - vqrdmlah.s32 q2, q0, modulus // ......*..................... - vldrw.u32 q0, [in_low, #272] // .e.......................... - vqrdmulh.s32 q4, q3, root1 // ..............*............. - vsub.u32 q1, q6, q2 // .......*.................... - vmul.u32 q3, q3, root1_twisted // ...............*............ - vadd.u32 q2, q6, q2 // ........*................... - vqrdmlah.s32 q4, q3, modulus // ................*........... - vldrw.u32 q3, [in_high, #272] // ...e........................ - vmul.u32 q6, q5, root2_twisted // ....................*....... - vadd.u32 q5, q2, q4 // ..................*......... - vqrdmlah.s32 q7, q6, modulus // .....................*...... - vstrw.u32 q5, [in_low] , #16 // ........................*... - vadd.u32 q5, q1, q7 // .......................*.... - vstrw.u32 q5, [in_high] , #16 // ..........................*. - vqrdmulh.s32 q6, q3, root0 // .........e.................. - vsub.u32 q2, q2, q4 // .................*.......... - vmul.u32 q4, q3, root0_twisted // ..........e................. - vstrw.u32 q2, [in_low, #240] // .........................*.. - vsub.u32 q3, q1, q7 // ......................*..... - vqrdmlah.s32 q6, q4, modulus // ...........e................ - vstrw.u32 q3, [in_high, #240] // ...........................* - - // original source code - // vldrw.u32 data0, [in_low] // ..........................*..................... - // vldrw.u32 data1, [in_low, #256] // e............................................... - // vldrw.u32 data2, [in_high] // ......................*......................... - // vldrw.u32 data3, [in_high, #256] // ......e......................................... - // vqrdmulh.s32 tmp, data2, root0 // .......................*........................ - // vmul.u32 data2, data2, root0_twisted // .........................*...................... - // vqrdmlah.s32 tmp, data2, modulus // ...........................*.................... - // vsub.u32 data2, data0, tmp // ..............................*................. - // vadd.u32 data0, data0, tmp // ................................*............... - // vqrdmulh.s32 tmp, data3, root0 // .............e.................................. - // vmul.u32 data3, data3, root0_twisted // ...............e................................ - // vqrdmlah.s32 tmp, data3, modulus // ..................e............................. - // vsub.u32 data3, data1, tmp // ....................*........................... - // vadd.u32 data1, data1, tmp // ........................*....................... - // vqrdmulh.s32 tmp, data1, root1 // .............................*.................. - // vmul.u32 data1, data1, root1_twisted // ...............................*................ - // vqrdmlah.s32 tmp, data1, modulus // .................................*.............. - // vsub.u32 data1, data0, tmp // ..........................................*..... - // vadd.u32 data0, data0, tmp // ....................................*........... - // vqrdmulh.s32 tmp, data3, root2 // .....................*.......................... - // vmul.u32 data3, data3, root2_twisted // ...................................*............ - // vqrdmlah.s32 tmp, data3, modulus // .....................................*.......... - // vsub.u32 data3, data2, tmp // .............................................*.. - // vadd.u32 data2, data2, tmp // .......................................*........ - // vstrw.u32 data0, [in_low] , #16 // ......................................*......... - // vstrw.u32 data1, [in_low, #240] // ............................................*... - // vstrw.u32 data2, [in_high] , #16 // ........................................*....... - // vstrw.u32 data3, [in_high, #240] // ...............................................* - - le lr, layer12_loop -layer12_loop_end: - vldrw.u32 q5, [in_high] // ..*..................... - vqrdmulh.s32 q2, q5, root0 // ...*.................... - vadd.u32 q7, q0, q6 // ....*................... - vqrdmulh.s32 q4, q7, root1 // ........*............... - vsub.u32 q3, q0, q6 // *....................... - vmul.u32 q0, q5, root0_twisted // .....*.................. - vldrw.u32 q6, [in_low] // ......*................. - vqrdmlah.s32 q2, q0, modulus // .......*................ - nop // .......................* - vmul.u32 q5, q7, root1_twisted // ..........*............. - vsub.u32 q1, q6, q2 // .........*.............. - vqrdmlah.s32 q4, q5, modulus // ............*........... - vadd.u32 q7, q6, q2 // ...........*............ - vqrdmulh.s32 q2, q3, root2 // .*...................... - vsub.u32 q5, q7, q4 // ...................*.... - vmul.u32 q6, q3, root2_twisted // .............*.......... - vstrw.u32 q5, [in_low, #256] // ....................*... - vadd.u32 q4, q7, q4 // ..............*......... - vqrdmlah.s32 q2, q6, modulus // ...............*........ - vstrw.u32 q4, [in_low] , #16 // ................*....... - vsub.u32 q4, q1, q2 // .....................*.. - vstrw.u32 q4, [in_high, #256] // ......................*. - vadd.u32 q2, q1, q2 // .................*...... - vstrw.u32 q2, [in_high] , #16 // ..................*..... - - // original source code - // vsub.u32 q5, q0, q6 // ....*................... - // vqrdmulh.s32 q7, q5, root2 // .............*.......... - // vldrw.u32 q4, [in_high] // *....................... - // vqrdmulh.s32 q2, q4, root0 // .*...................... - // vadd.u32 q3, q0, q6 // ..*..................... - // vmul.u32 q0, q4, root0_twisted // .....*.................. - // vldrw.u32 q6, [in_low] // ......*................. - // vqrdmlah.s32 q2, q0, modulus // .......*................ - // vqrdmulh.s32 q4, q3, root1 // ...*.................... - // vsub.u32 q1, q6, q2 // ..........*............. - // vmul.u32 q3, q3, root1_twisted // .........*.............. - // vadd.u32 q2, q6, q2 // ............*........... - // vqrdmlah.s32 q4, q3, modulus // ...........*............ - // vmul.u32 q6, q5, root2_twisted // ...............*........ - // vadd.u32 q5, q2, q4 // .................*...... - // vqrdmlah.s32 q7, q6, modulus // ..................*..... - // vstrw.u32 q5, [in_low] , #16 // ...................*.... - // vadd.u32 q5, q1, q7 // ......................*. - // vstrw.u32 q5, [in_high] , #16 // .......................* - // vsub.u32 q2, q2, q4 // ..............*......... - // vstrw.u32 q2, [in_low, #240] // ................*....... - // vsub.u32 q3, q1, q7 // ....................*... - // vstrw.u32 q3, [in_high, #240] // .....................*.. - // nop // ........*............... - - - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(64*4) - - /* Layers 3,4 */ - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q6, [in, #192] // .*.... - vmul.u32 q2, q6, root0_twisted // ...*.. - vldrw.u32 q0, [in, #64] // *..... - vqrdmulh.s32 q6, q6, root0 // ..*... - nop // .....* - vqrdmlah.s32 q6, q2, modulus // ....*. - - // original source code - // vldrw.u32 q0, [in, #64] // ..*... - // vldrw.u32 q3, [in, #192] // *..... - // vqrdmulh.s32 q6, q3, root0 // ...*.. - // vmul.u32 q4, q3, root0_twisted // .*.... - // vqrdmlah.s32 q6, q4, modulus // .....* - // nop // ....*. - - sub lr, lr, #1 - wls lr, lr, layer34_loop_end -layer34_loop: - vsub.u32 q5, q0, q6 // ............*............... - vqrdmulh.s32 q7, q5, root2 // ...................*........ - vldrw.u32 q4, [in, #128] // ..*......................... - vqrdmulh.s32 q2, q4, root0 // ....*....................... - vadd.u32 q3, q0, q6 // .............*.............. - vmul.u32 q0, q4, root0_twisted // .....*...................... - vldrw.u32 q6, [in] // *........................... - vqrdmlah.s32 q2, q0, modulus // ......*..................... - vldrw.u32 q0, [in, #80] // .e.......................... - vqrdmulh.s32 q4, q3, root1 // ..............*............. - vsub.u32 q1, q6, q2 // .......*.................... - vmul.u32 q3, q3, root1_twisted // ...............*............ - vadd.u32 q2, q6, q2 // ........*................... - vqrdmlah.s32 q4, q3, modulus // ................*........... - vldrw.u32 q3, [in, #208] // ...e........................ - vmul.u32 q6, q5, root2_twisted // ....................*....... - vadd.u32 q5, q2, q4 // ..................*......... - vqrdmlah.s32 q7, q6, modulus // .....................*...... - vstrw.u32 q5, [in] , #16 // ........................*... - vadd.u32 q5, q1, q7 // .......................*.... - vstrw.u32 q5, [in, #112] // ..........................*. - vqrdmulh.s32 q6, q3, root0 // .........e.................. - vsub.u32 q2, q2, q4 // .................*.......... - vmul.u32 q4, q3, root0_twisted // ..........e................. - vstrw.u32 q2, [in, #48] // .........................*.. - vsub.u32 q3, q1, q7 // ......................*..... - vqrdmlah.s32 q6, q4, modulus // ...........e................ - vstrw.u32 q3, [in, #176] // ...........................* - - // original source code - // vldrw.u32 data0, [in] // ..........................*..................... - // vldrw.u32 data1, [in, #64] // e............................................... - // vldrw.u32 data2, [in, #128] // ......................*......................... - // vldrw.u32 data3, [in, #192] // ......e......................................... - // vqrdmulh.s32 tmp, data2, root0 // .......................*........................ - // vmul.u32 data2, data2, root0_twisted // .........................*...................... - // vqrdmlah.s32 tmp, data2, modulus // ...........................*.................... - // vsub.u32 data2, data0, tmp // ..............................*................. - // vadd.u32 data0, data0, tmp // ................................*............... - // vqrdmulh.s32 tmp, data3, root0 // .............e.................................. - // vmul.u32 data3, data3, root0_twisted // ...............e................................ - // vqrdmlah.s32 tmp, data3, modulus // ..................e............................. - // vsub.u32 data3, data1, tmp // ....................*........................... - // vadd.u32 data1, data1, tmp // ........................*....................... - // vqrdmulh.s32 tmp, data1, root1 // .............................*.................. - // vmul.u32 data1, data1, root1_twisted // ...............................*................ - // vqrdmlah.s32 tmp, data1, modulus // .................................*.............. - // vsub.u32 data1, data0, tmp // ..........................................*..... - // vadd.u32 data0, data0, tmp // ....................................*........... - // vqrdmulh.s32 tmp, data3, root2 // .....................*.......................... - // vmul.u32 data3, data3, root2_twisted // ...................................*............ - // vqrdmlah.s32 tmp, data3, modulus // .....................................*.......... - // vsub.u32 data3, data2, tmp // .............................................*.. - // vadd.u32 data2, data2, tmp // .......................................*........ - // vstrw.u32 data0, [in] , #16 // ......................................*......... - // vstrw.u32 data1, [in, #48] // ............................................*... - // vstrw.u32 data2, [in, #112] // ........................................*....... - // vstrw.u32 data3, [in, #176] // ...............................................* - - le lr, layer34_loop -layer34_loop_end: - vldrw.u32 q5, [in, #128] // ..*..................... - vqrdmulh.s32 q2, q5, root0 // ...*.................... - vadd.u32 q7, q0, q6 // ....*................... - vqrdmulh.s32 q4, q7, root1 // ........*............... - vsub.u32 q3, q0, q6 // *....................... - vmul.u32 q0, q5, root0_twisted // .....*.................. - vldrw.u32 q6, [in] // ......*................. - vqrdmlah.s32 q2, q0, modulus // .......*................ - nop // .......................* - vmul.u32 q5, q7, root1_twisted // ..........*............. - vsub.u32 q1, q6, q2 // .........*.............. - vqrdmlah.s32 q4, q5, modulus // ............*........... - vadd.u32 q7, q6, q2 // ...........*............ - vqrdmulh.s32 q2, q3, root2 // .*...................... - vsub.u32 q5, q7, q4 // ...................*.... - vmul.u32 q6, q3, root2_twisted // .............*.......... - vstrw.u32 q5, [in, #64] // ....................*... - vadd.u32 q4, q7, q4 // ..............*......... - vqrdmlah.s32 q2, q6, modulus // ...............*........ - vstrw.u32 q4, [in] , #16 // ................*....... - vsub.u32 q4, q1, q2 // .....................*.. - vstrw.u32 q4, [in, #176] // ......................*. - vadd.u32 q2, q1, q2 // .................*...... - vstrw.u32 q2, [in, #112] // ..................*..... - - // original source code - // vsub.u32 q5, q0, q6 // ....*................... - // vqrdmulh.s32 q7, q5, root2 // .............*.......... - // vldrw.u32 q4, [in, #128] // *....................... - // vqrdmulh.s32 q2, q4, root0 // .*...................... - // vadd.u32 q3, q0, q6 // ..*..................... - // vmul.u32 q0, q4, root0_twisted // .....*.................. - // vldrw.u32 q6, [in] // ......*................. - // vqrdmlah.s32 q2, q0, modulus // .......*................ - // vqrdmulh.s32 q4, q3, root1 // ...*.................... - // vsub.u32 q1, q6, q2 // ..........*............. - // vmul.u32 q3, q3, root1_twisted // .........*.............. - // vadd.u32 q2, q6, q2 // ............*........... - // vqrdmlah.s32 q4, q3, modulus // ...........*............ - // vmul.u32 q6, q5, root2_twisted // ...............*........ - // vadd.u32 q5, q2, q4 // .................*...... - // vqrdmlah.s32 q7, q6, modulus // ..................*..... - // vstrw.u32 q5, [in] , #16 // ...................*.... - // vadd.u32 q5, q1, q7 // ......................*. - // vstrw.u32 q5, [in, #112] // .......................* - // vsub.u32 q2, q2, q4 // ..............*......... - // vstrw.u32 q2, [in, #48] // ................*....... - // vsub.u32 q3, q1, q7 // ....................*... - // vstrw.u32 q3, [in, #176] // .....................*.. - // nop // ........*............... - - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*256) - - /* Layers 5,6 */ - - // 1 butterfly blocks per root config, 16 root configs - // loop over root configs - - mov lr, #16 - ldrd r8, r5, [root_ptr] , #24 // ...*... - vldrw.u32 q5, [in, #48] // ..*.... - vqrdmulh.s32 q2, q5, r8 // ....*.. - ldrd r4, r9, [root_ptr, #-8] // .*..... - vmul.u32 q5, q5, r5 // .....*. - vldrw.u32 q1, [in, #16] // *...... - vqrdmlah.s32 q2, q5, modulus // ......* - - // original source code - // vldrw.u32 q1, [in, #16] // .....*. - // ldrd r4, r9, [root_ptr, #16] // ...*... - // vldrw.u32 q3, [in, #48] // .*..... - // ldrd r8, r5, [root_ptr] , #24 // *...... - // vqrdmulh.s32 q2, q3, r8 // ..*.... - // vmul.u32 q0, q3, r5 // ....*.. - // vqrdmlah.s32 q2, q0, modulus // ......* - - sub lr, lr, #1 - wls lr, lr, layer56_loop_end -layer56_loop: - vsub.u32 q5, q1, q2 // ...............*............... - vqrdmulh.s32 q0, q5, r4 // ......................*........ - vldrw.u32 q4, [in, #32] // .....*......................... - vqrdmulh.s32 q6, q4, r8 // .......*....................... - vldrw.u32 q3, [in] // ...*........................... - vmul.u32 q4, q4, r5 // ........*...................... - vadd.u32 q7, q1, q2 // ................*.............. - vqrdmlah.s32 q6, q4, modulus // .........*..................... - ldrd r8, r4, [root_ptr, #-16] // .*............................. - vqrdmulh.s32 q2, q7, r8 // .................*............. - vldrw.u32 q1, [in, #80] // ....e.......................... - vmul.u32 q7, q7, r4 // ..................*............ - vadd.u32 q4, q3, q6 // ...........*................... - vqrdmlah.s32 q2, q7, modulus // ...................*........... - vsub.u32 q6, q3, q6 // ..........*.................... - vmul.u32 q7, q5, r9 // .......................*....... - vsub.u32 q5, q4, q2 // ....................*.......... - ldrd r4, r9, [root_ptr, #16] // ..e............................ - vadd.u32 q4, q4, q2 // .....................*......... - vqrdmlah.s32 q0, q7, modulus // ........................*...... - vldrw.u32 q3, [in, #112] // ......e........................ - vsub.u32 q7, q6, q0 // .........................*..... - ldrd r8, r5, [root_ptr] , #24 // e.............................. - vadd.u32 q6, q6, q0 // ..........................*.... - vst40.u32 {q4,q5,q6,q7}, [in] // ...........................*... - vqrdmulh.s32 q2, q3, r8 // ............e.................. - vst41.u32 {q4,q5,q6,q7}, [in] // ............................*.. - vmul.u32 q0, q3, r5 // .............e................. - vst42.u32 {q4,q5,q6,q7}, [in] // .............................*. - vqrdmlah.s32 q2, q0, modulus // ..............e................ - vst43.u32 {q4,q5,q6,q7}, [in]! // ..............................* - - // original source code - // ldrd root0, root0_twisted, [root_ptr] , #24 // ............e....................................... - // ldrd root1, root1_twisted, [root_ptr, #-16] // .............................*...................... - // ldrd root2, root2_twisted, [root_ptr, #-8] // .......e............................................ - // vldrw.u32 data0, [in] // .........................*.......................... - // vldrw.u32 data1, [in, #16] // e................................................... - // vldrw.u32 data2, [in, #32] // .......................*............................ - // vldrw.u32 data3, [in, #48] // ..........e......................................... - // vqrdmulh.s32 tmp, data2, root0 // ........................*........................... - // vmul.u32 data2, data2, root0_twisted // ..........................*......................... - // vqrdmlah.s32 tmp, data2, modulus // ............................*....................... - // vsub.u32 data2, data0, tmp // ...................................*................ - // vadd.u32 data0, data0, tmp // .................................*.................. - // vqrdmulh.s32 tmp, data3, root0 // ...............e.................................... - // vmul.u32 data3, data3, root0_twisted // .................e.................................. - // vqrdmlah.s32 tmp, data3, modulus // ...................e................................ - // vsub.u32 data3, data1, tmp // .....................*.............................. - // vadd.u32 data1, data1, tmp // ...........................*........................ - // vqrdmulh.s32 tmp, data1, root1 // ..............................*..................... - // vmul.u32 data1, data1, root1_twisted // ................................*................... - // vqrdmlah.s32 tmp, data1, modulus // ..................................*................. - // vsub.u32 data1, data0, tmp // .....................................*.............. - // vadd.u32 data0, data0, tmp // .......................................*............ - // vqrdmulh.s32 tmp, data3, root2 // ......................*............................. - // vmul.u32 data3, data3, root2_twisted // ....................................*............... - // vqrdmlah.s32 tmp, data3, modulus // ........................................*........... - // vsub.u32 data3, data2, tmp // ..........................................*......... - // vadd.u32 data2, data2, tmp // ............................................*....... - // vst40.u32 {data0,data1,data2,data3}, [in] // .............................................*...... - // vst41.u32 {data0,data1,data2,data3}, [in] // ...............................................*.... - // vst42.u32 {data0,data1,data2,data3}, [in] // .................................................*.. - // vst43.u32 {data0,data1,data2,data3}, [in]! // ...................................................* - - le lr, layer56_loop -layer56_loop_end: - vldrw.u32 q0, [in, #32] // ..*.......................... - vmul.u32 q3, q0, r5 // .....*....................... - vsub.u32 q6, q1, q2 // *............................ - vqrdmulh.s32 q7, q0, r8 // ...*......................... - vldrw.u32 q4, [in] // ....*........................ - vqrdmlah.s32 q7, q3, modulus // .......*..................... - vadd.u32 q3, q1, q2 // ......*...................... - vqrdmulh.s32 q5, q6, r4 // .*........................... - vadd.u32 q0, q4, q7 // ...........*................. - vmul.u32 q6, q6, r9 // ..............*.............. - vsub.u32 q4, q4, q7 // .............*............... - vqrdmlah.s32 q5, q6, modulus // .................*........... - ldrd r8, r4, [root_ptr, #-16] // ........*.................... - vmul.u32 q7, q3, r4 // ..........*.................. - vsub.u32 q6, q4, q5 // ..................*.......... - vqrdmulh.s32 q2, q3, r8 // .........*................... - vadd.u32 q5, q4, q5 // ...................*......... - vqrdmlah.s32 q2, q7, modulus // ............*................ - nop // ..........................*.. - vsub.u32 q4, q0, q2 // ...............*............. - nop // .........................*... - vadd.u32 q3, q0, q2 // ................*............ - vst40.u32 {q3,q4,q5,q6}, [in] // ....................*........ - nop // ........................*.... - vst41.u32 {q3,q4,q5,q6}, [in] // .....................*....... - nop // ............................* - vst42.u32 {q3,q4,q5,q6}, [in] // ......................*...... - nop // ...........................*. - vst43.u32 {q3,q4,q5,q6}, [in]! // .......................*..... - - // original source code - // vsub.u32 q5, q1, q2 // ..*.......................... - // vqrdmulh.s32 q0, q5, r4 // .......*..................... - // vldrw.u32 q4, [in, #32] // *............................ - // vqrdmulh.s32 q6, q4, r8 // ...*......................... - // vldrw.u32 q3, [in] // ....*........................ - // vmul.u32 q4, q4, r5 // .*........................... - // vadd.u32 q7, q1, q2 // ......*...................... - // vqrdmlah.s32 q6, q4, modulus // .....*....................... - // ldrd r8, r4, [root_ptr, #-16] // ............*................ - // vqrdmulh.s32 q2, q7, r8 // ...............*............. - // vmul.u32 q7, q7, r4 // .............*............... - // vadd.u32 q4, q3, q6 // ........*.................... - // vqrdmlah.s32 q2, q7, modulus // .................*........... - // vsub.u32 q6, q3, q6 // ..........*.................. - // vmul.u32 q7, q5, r9 // .........*................... - // vsub.u32 q5, q4, q2 // ...................*......... - // vadd.u32 q4, q4, q2 // .....................*....... - // vqrdmlah.s32 q0, q7, modulus // ...........*................. - // vsub.u32 q7, q6, q0 // ..............*.............. - // vadd.u32 q6, q6, q0 // ................*............ - // vst40.u32 {q4,q5,q6,q7}, [in] // ......................*...... - // vst41.u32 {q4,q5,q6,q7}, [in] // ........................*.... - // vst42.u32 {q4,q5,q6,q7}, [in] // ..........................*.. - // vst43.u32 {q4,q5,q6,q7}, [in]! // ............................* - // nop // .......................*..... - // nop // ....................*........ - // nop // ..................*.......... - // nop // ...........................*. - // nop // .........................*... - - - - sub in, in, #(4*256) - - /* Layers 7,8 */ - - .unreq root0 - .unreq root0_twisted - .unreq root1 - .unreq root1_twisted - .unreq root2 - .unreq root2_twisted - - root0 .req q5 - root0_twisted .req q6 - root1 .req q5 - root1_twisted .req q6 - root2 .req q5 - root2_twisted .req q6 - - mov lr, #16 - vldrw.u32 q3, [root_ptr, #16] // *........ - nop // ........* - vldrw.u32 q6, [in, #48] // .*....... - vmul.u32 q4, q6, q3 // ..*...... - vldrw.u32 q1, [root_ptr] , #96 // ...*..... - vqrdmulh.s32 q6, q6, q1 // ....*.... - vldrw.u32 q7, [root_ptr, #-16] // .....*... - vqrdmlah.s32 q6, q4, modulus // ......*.. - vldrw.u32 q4, [in, #16] // .......*. - - // original source code - // vldrw.u32 q3, [root_ptr, #16] // *........ - // vldrw.u32 q6, [in, #48] // ..*...... - // vmul.u32 q4, q6, q3 // ...*..... - // vldrw.u32 q1, [root_ptr] , #96 // ....*.... - // vqrdmulh.s32 q6, q6, q1 // .....*... - // vldrw.u32 q7, [root_ptr, #-16] // ......*.. - // vqrdmlah.s32 q6, q4, modulus // .......*. - // vldrw.u32 q4, [in, #16] // ........* - // nop // .*....... - - sub lr, lr, #1 - wls lr, lr, layer78_loop_end -layer78_loop: - vsub.u32 q5, q4, q6 // ..............*................... - vmul.u32 q0, q5, q7 // ..........................*....... - vldrw.u32 q7, [in, #32] // ..*............................... - vmul.u32 q3, q7, q3 // .......*.......................... - vadd.u32 q4, q4, q6 // ...............*.................. - vqrdmulh.s32 q2, q7, q1 // ......*........................... - vldrw.u32 q6, [in] , #64 // *................................. - vqrdmlah.s32 q2, q3, modulus // ........*......................... - vldrw.u32 q3, [root_ptr, #-32] // .......................*.......... - vqrdmulh.s32 q7, q5, q3 // .........................*........ - vsub.u32 q5, q6, q2 // .........*........................ - vqrdmlah.s32 q7, q0, modulus // ...........................*...... - vldrw.u32 q1, [root_ptr, #-64] // ................*................. - vadd.u32 q3, q5, q7 // .............................*.... - vstrw.u32 q3, [in, #-32] // ................................*. - vsub.u32 q0, q5, q7 // ............................*..... - vstrw.u32 q0, [in, #-16] // .................................* - vadd.u32 q0, q6, q2 // ..........*....................... - vqrdmulh.s32 q2, q4, q1 // ..................*............... - vldrw.u32 q3, [root_ptr, #-48] // .................*................ - vmul.u32 q5, q4, q3 // ...................*.............. - vldrw.u32 q3, [root_ptr, #16] // .....e............................ - vqrdmlah.s32 q2, q5, modulus // ....................*............. - vldrw.u32 q6, [in, #48] // ...e.............................. - vmul.u32 q4, q6, q3 // ............e..................... - vldrw.u32 q1, [root_ptr] , #96 // ....e............................. - vqrdmulh.s32 q6, q6, q1 // ...........e...................... - vldrw.u32 q7, [root_ptr, #-16] // ........................e......... - vqrdmlah.s32 q6, q4, modulus // .............e.................... - vldrw.u32 q4, [in, #16] // .e................................ - vadd.u32 q5, q0, q2 // ......................*........... - vstrw.u32 q5, [in, #-64] // ..............................*... - vsub.u32 q2, q0, q2 // .....................*............ - vstrw.u32 q2, [in, #-48] // ...............................*.. - - // original source code - // vldrw.u32 data0, [in] , #64 // ...................*........................... - // vldrw.u32 data1, [in, #-48] // ........e...................................... - // vldrw.u32 data2, [in, #-32] // ...............*............................... - // vldrw.u32 data3, [in, #-16] // ..e............................................ - // vldrw.u32 root0, [root_ptr] , #96 // ....e.......................................... - // vldrw.u32 root0_twisted, [root_ptr, #-80] // e.............................................. - // vqrdmulh.s32 tmp, data2, root0 // ..................*............................ - // vmul.u32 data2, data2, root0_twisted // ................*.............................. - // vqrdmlah.s32 tmp, data2, modulus // ....................*.......................... - // vsub.u32 data2, data0, tmp // .......................*....................... - // vadd.u32 data0, data0, tmp // ..............................*................ - // vqrdmulh.s32 tmp, data3, root0 // .....e......................................... - // vmul.u32 data3, data3, root0_twisted // ...e........................................... - // vqrdmlah.s32 tmp, data3, modulus // .......e....................................... - // vsub.u32 data3, data1, tmp // .............*................................. - // vadd.u32 data1, data1, tmp // .................*............................. - // vldrw.u32 root1, [root_ptr, #-64] // .........................*..................... - // vldrw.u32 root1_twisted, [root_ptr, #-48] // ................................*.............. - // vqrdmulh.s32 tmp, data1, root1 // ...............................*............... - // vmul.u32 data1, data1, root1_twisted // .................................*............. - // vqrdmlah.s32 tmp, data1, modulus // ...................................*........... - // vsub.u32 data1, data0, tmp // .............................................*. - // vadd.u32 data0, data0, tmp // ...........................................*... - // vldrw.u32 root2, [root_ptr, #-32] // .....................*......................... - // vldrw.u32 root2_twisted, [root_ptr, #-16] // ......e........................................ - // vqrdmulh.s32 tmp, data3, root2 // ......................*........................ - // vmul.u32 data3, data3, root2_twisted // ..............*................................ - // vqrdmlah.s32 tmp, data3, modulus // ........................*...................... - // vsub.u32 data3, data2, tmp // ............................*.................. - // vadd.u32 data2, data2, tmp // ..........................*.................... - // vstrw.u32 data0, [in, #-64] // ............................................*.. - // vstrw.u32 data1, [in, #-48] // ..............................................* - // vstrw.u32 data2, [in, #-32] // ...........................*................... - // vstrw.u32 data3, [in, #-16] // .............................*................. - - le lr, layer78_loop -layer78_loop_end: - vsub.u32 q2, q4, q6 // *......................... - vmul.u32 q5, q2, q7 // .*........................ - vldrw.u32 q7, [in, #32] // ..*....................... - vmul.u32 q3, q7, q3 // ...*...................... - vadd.u32 q4, q4, q6 // ....*..................... - vqrdmulh.s32 q0, q7, q1 // .....*.................... - vldrw.u32 q6, [in] , #64 // ......*................... - vqrdmlah.s32 q0, q3, modulus // .......*.................. - vldrw.u32 q3, [root_ptr, #-32] // ........*................. - vqrdmulh.s32 q7, q2, q3 // .........*................ - vldrw.u32 q3, [root_ptr, #-48] // ...................*...... - vqrdmlah.s32 q7, q5, modulus // ...........*.............. - vsub.u32 q2, q6, q0 // ..........*............... - vmul.u32 q5, q4, q3 // ....................*..... - vadd.u32 q3, q2, q7 // .............*............ - vldrw.u32 q1, [root_ptr, #-64] // ............*............. - vsub.u32 q2, q2, q7 // ...............*.......... - vstrw.u32 q2, [in, #-16] // ................*......... - vqrdmulh.s32 q2, q4, q1 // ..................*....... - vadd.u32 q0, q6, q0 // .................*........ - vqrdmlah.s32 q2, q5, modulus // .....................*.... - vstrw.u32 q3, [in, #-32] // ..............*........... - vsub.u32 q5, q0, q2 // ........................*. - vstrw.u32 q5, [in, #-48] // .........................* - vadd.u32 q2, q0, q2 // ......................*... - vstrw.u32 q2, [in, #-64] // .......................*.. - - // original source code - // vsub.u32 q5, q4, q6 // *......................... - // vmul.u32 q0, q5, q7 // .*........................ - // vldrw.u32 q7, [in, #32] // ..*....................... - // vmul.u32 q3, q7, q3 // ...*...................... - // vadd.u32 q4, q4, q6 // ....*..................... - // vqrdmulh.s32 q2, q7, q1 // .....*.................... - // vldrw.u32 q6, [in] , #64 // ......*................... - // vqrdmlah.s32 q2, q3, modulus // .......*.................. - // vldrw.u32 q3, [root_ptr, #-32] // ........*................. - // vqrdmulh.s32 q7, q5, q3 // .........*................ - // vsub.u32 q5, q6, q2 // ............*............. - // vqrdmlah.s32 q7, q0, modulus // ...........*.............. - // vldrw.u32 q1, [root_ptr, #-64] // ...............*.......... - // vadd.u32 q3, q5, q7 // ..............*........... - // vstrw.u32 q3, [in, #-32] // .....................*.... - // vsub.u32 q0, q5, q7 // ................*......... - // vstrw.u32 q0, [in, #-16] // .................*........ - // vadd.u32 q0, q6, q2 // ...................*...... - // vqrdmulh.s32 q2, q4, q1 // ..................*....... - // vldrw.u32 q3, [root_ptr, #-48] // ..........*............... - // vmul.u32 q5, q4, q3 // .............*............ - // vqrdmlah.s32 q2, q5, modulus // ....................*..... - // vadd.u32 q5, q0, q2 // ........................*. - // vstrw.u32 q5, [in, #-64] // .........................* - // vsub.u32 q2, q0, q2 // ......................*... - // vstrw.u32 q2, [in, #-48] // .......................*.. - - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_incomplete.s b/tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index a90c26b..0000000 --- a/tests/ntt_n256/manual/ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,659 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.text - -// Montgomery multiplication via rounding -.macro mulmod dst, src, const, const_twisted - vqrdmulh.s32 \dst, \src, \const - vmul.u32 \src, \src, \const_twisted - vqrdmlah.s32 \dst, \src, modulus -.endm - -.macro ct_butterfly a, b, root, root_twisted - mulmod tmp, \b, \root, \root_twisted - vsub.u32 \b, \a, tmp - vadd.u32 \a, \a, tmp -.endm - -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete_manual, %function -.global ntt_n256_u32_33556993_28678040_incomplete_manual -ntt_n256_u32_33556993_28678040_incomplete_manual: - - push {r4-r11,lr} - // Save MVE vector registers - vpush {d8-d15} - - modulus .req r12 - root_ptr .req r11 - - .equ modulus_const, 33556993 - movw modulus, #:lower16:modulus_const - movt modulus, #:upper16:modulus_const - ldr root_ptr, roots_addr - - in_low .req r0 - in_high .req r1 - - add in_high, in_low, #(4*128) - - root0 .req r2 - root0_twisted .req r3 - root1 .req r4 - root1_twisted .req r5 - root2 .req r6 - root2_twisted .req r7 - - data0 .req q0 - data1 .req q1 - data2 .req q2 - data3 .req q3 - - tmp .req q4 - - /* Layers 1-2 */ - - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #16 - vldrw.u32 q6, [in_high, #256] // .*.... - vmul.u32 q2, q6, root0_twisted // ...*.. - vldrw.u32 q0, [in_low, #256] // *..... - vqrdmulh.s32 q6, q6, root0 // ..*... - nop // .....* - vqrdmlah.s32 q6, q2, modulus // ....*. - - // original source code - // vldrw.u32 q0, [in_low, #256] // ..*... - // vldrw.u32 q3, [in_high, #256] // *..... - // vqrdmulh.s32 q6, q3, root0 // ...*.. - // vmul.u32 q4, q3, root0_twisted // .*.... - // vqrdmlah.s32 q6, q4, modulus // .....* - // nop // ....*. - - sub lr, lr, #1 - wls lr, lr, layer12_loop_end -layer12_loop: - vsub.u32 q5, q0, q6 // ............*............... - vqrdmulh.s32 q7, q5, root2 // ...................*........ - vldrw.u32 q4, [in_high] // ..*......................... - vqrdmulh.s32 q2, q4, root0 // ....*....................... - vadd.u32 q3, q0, q6 // .............*.............. - vmul.u32 q0, q4, root0_twisted // .....*...................... - vldrw.u32 q6, [in_low] // *........................... - vqrdmlah.s32 q2, q0, modulus // ......*..................... - vldrw.u32 q0, [in_low, #272] // .e.......................... - vqrdmulh.s32 q4, q3, root1 // ..............*............. - vsub.u32 q1, q6, q2 // .......*.................... - vmul.u32 q3, q3, root1_twisted // ...............*............ - vadd.u32 q2, q6, q2 // ........*................... - vqrdmlah.s32 q4, q3, modulus // ................*........... - vldrw.u32 q3, [in_high, #272] // ...e........................ - vmul.u32 q6, q5, root2_twisted // ....................*....... - vadd.u32 q5, q2, q4 // ..................*......... - vqrdmlah.s32 q7, q6, modulus // .....................*...... - vstrw.u32 q5, [in_low] , #16 // ........................*... - vadd.u32 q5, q1, q7 // .......................*.... - vstrw.u32 q5, [in_high] , #16 // ..........................*. - vqrdmulh.s32 q6, q3, root0 // .........e.................. - vsub.u32 q2, q2, q4 // .................*.......... - vmul.u32 q4, q3, root0_twisted // ..........e................. - vstrw.u32 q2, [in_low, #240] // .........................*.. - vsub.u32 q3, q1, q7 // ......................*..... - vqrdmlah.s32 q6, q4, modulus // ...........e................ - vstrw.u32 q3, [in_high, #240] // ...........................* - - // original source code - // vldrw.u32 data0, [in_low] // ..........................*..................... - // vldrw.u32 data1, [in_low, #256] // e............................................... - // vldrw.u32 data2, [in_high] // ......................*......................... - // vldrw.u32 data3, [in_high, #256] // ......e......................................... - // vqrdmulh.s32 tmp, data2, root0 // .......................*........................ - // vmul.u32 data2, data2, root0_twisted // .........................*...................... - // vqrdmlah.s32 tmp, data2, modulus // ...........................*.................... - // vsub.u32 data2, data0, tmp // ..............................*................. - // vadd.u32 data0, data0, tmp // ................................*............... - // vqrdmulh.s32 tmp, data3, root0 // .............e.................................. - // vmul.u32 data3, data3, root0_twisted // ...............e................................ - // vqrdmlah.s32 tmp, data3, modulus // ..................e............................. - // vsub.u32 data3, data1, tmp // ....................*........................... - // vadd.u32 data1, data1, tmp // ........................*....................... - // vqrdmulh.s32 tmp, data1, root1 // .............................*.................. - // vmul.u32 data1, data1, root1_twisted // ...............................*................ - // vqrdmlah.s32 tmp, data1, modulus // .................................*.............. - // vsub.u32 data1, data0, tmp // ..........................................*..... - // vadd.u32 data0, data0, tmp // ....................................*........... - // vqrdmulh.s32 tmp, data3, root2 // .....................*.......................... - // vmul.u32 data3, data3, root2_twisted // ...................................*............ - // vqrdmlah.s32 tmp, data3, modulus // .....................................*.......... - // vsub.u32 data3, data2, tmp // .............................................*.. - // vadd.u32 data2, data2, tmp // .......................................*........ - // vstrw.u32 data0, [in_low] , #16 // ......................................*......... - // vstrw.u32 data1, [in_low, #240] // ............................................*... - // vstrw.u32 data2, [in_high] , #16 // ........................................*....... - // vstrw.u32 data3, [in_high, #240] // ...............................................* - - le lr, layer12_loop -layer12_loop_end: - vldrw.u32 q5, [in_high] // ..*..................... - vqrdmulh.s32 q2, q5, root0 // ...*.................... - vadd.u32 q7, q0, q6 // ....*................... - vqrdmulh.s32 q4, q7, root1 // ........*............... - vsub.u32 q3, q0, q6 // *....................... - vmul.u32 q0, q5, root0_twisted // .....*.................. - vldrw.u32 q6, [in_low] // ......*................. - vqrdmlah.s32 q2, q0, modulus // .......*................ - nop // .......................* - vmul.u32 q5, q7, root1_twisted // ..........*............. - vsub.u32 q1, q6, q2 // .........*.............. - vqrdmlah.s32 q4, q5, modulus // ............*........... - vadd.u32 q7, q6, q2 // ...........*............ - vqrdmulh.s32 q2, q3, root2 // .*...................... - vsub.u32 q5, q7, q4 // ...................*.... - vmul.u32 q6, q3, root2_twisted // .............*.......... - vstrw.u32 q5, [in_low, #256] // ....................*... - vadd.u32 q4, q7, q4 // ..............*......... - vqrdmlah.s32 q2, q6, modulus // ...............*........ - vstrw.u32 q4, [in_low] , #16 // ................*....... - vsub.u32 q4, q1, q2 // .....................*.. - vstrw.u32 q4, [in_high, #256] // ......................*. - vadd.u32 q2, q1, q2 // .................*...... - vstrw.u32 q2, [in_high] , #16 // ..................*..... - - // original source code - // vsub.u32 q5, q0, q6 // ....*................... - // vqrdmulh.s32 q7, q5, root2 // .............*.......... - // vldrw.u32 q4, [in_high] // *....................... - // vqrdmulh.s32 q2, q4, root0 // .*...................... - // vadd.u32 q3, q0, q6 // ..*..................... - // vmul.u32 q0, q4, root0_twisted // .....*.................. - // vldrw.u32 q6, [in_low] // ......*................. - // vqrdmlah.s32 q2, q0, modulus // .......*................ - // vqrdmulh.s32 q4, q3, root1 // ...*.................... - // vsub.u32 q1, q6, q2 // ..........*............. - // vmul.u32 q3, q3, root1_twisted // .........*.............. - // vadd.u32 q2, q6, q2 // ............*........... - // vqrdmlah.s32 q4, q3, modulus // ...........*............ - // vmul.u32 q6, q5, root2_twisted // ...............*........ - // vadd.u32 q5, q2, q4 // .................*...... - // vqrdmlah.s32 q7, q6, modulus // ..................*..... - // vstrw.u32 q5, [in_low] , #16 // ...................*.... - // vadd.u32 q5, q1, q7 // ......................*. - // vstrw.u32 q5, [in_high] , #16 // .......................* - // vsub.u32 q2, q2, q4 // ..............*......... - // vstrw.u32 q2, [in_low, #240] // ................*....... - // vsub.u32 q3, q1, q7 // ....................*... - // vstrw.u32 q3, [in_high, #240] // .....................*.. - // nop // ........*............... - - - .unreq in_high - .unreq in_low - - in .req r0 - sub in, in, #(64*4) - - /* Layers 3,4 */ - - // 4 butterfly blocks per root config, 4 root configs - // loop over root configs - - count .req r1 - mov count, #4 - -out_start: - ldrd root0, root0_twisted, [root_ptr], #+8 - ldrd root1, root1_twisted, [root_ptr], #+8 - ldrd root2, root2_twisted, [root_ptr], #+8 - - mov lr, #4 - vldrw.u32 q6, [in, #192] // .*.... - vmul.u32 q2, q6, root0_twisted // ...*.. - vldrw.u32 q0, [in, #64] // *..... - vqrdmulh.s32 q6, q6, root0 // ..*... - nop // .....* - vqrdmlah.s32 q6, q2, modulus // ....*. - - // original source code - // vldrw.u32 q0, [in, #64] // ..*... - // vldrw.u32 q3, [in, #192] // *..... - // vqrdmulh.s32 q6, q3, root0 // ...*.. - // vmul.u32 q4, q3, root0_twisted // .*.... - // vqrdmlah.s32 q6, q4, modulus // .....* - // nop // ....*. - - sub lr, lr, #1 - wls lr, lr, layer34_loop_end -layer34_loop: - vsub.u32 q5, q0, q6 // ............*............... - vqrdmulh.s32 q7, q5, root2 // ...................*........ - vldrw.u32 q4, [in, #128] // ..*......................... - vqrdmulh.s32 q2, q4, root0 // ....*....................... - vadd.u32 q3, q0, q6 // .............*.............. - vmul.u32 q0, q4, root0_twisted // .....*...................... - vldrw.u32 q6, [in] // *........................... - vqrdmlah.s32 q2, q0, modulus // ......*..................... - vldrw.u32 q0, [in, #80] // .e.......................... - vqrdmulh.s32 q4, q3, root1 // ..............*............. - vsub.u32 q1, q6, q2 // .......*.................... - vmul.u32 q3, q3, root1_twisted // ...............*............ - vadd.u32 q2, q6, q2 // ........*................... - vqrdmlah.s32 q4, q3, modulus // ................*........... - vldrw.u32 q3, [in, #208] // ...e........................ - vmul.u32 q6, q5, root2_twisted // ....................*....... - vadd.u32 q5, q2, q4 // ..................*......... - vqrdmlah.s32 q7, q6, modulus // .....................*...... - vstrw.u32 q5, [in] , #16 // ........................*... - vadd.u32 q5, q1, q7 // .......................*.... - vstrw.u32 q5, [in, #112] // ..........................*. - vqrdmulh.s32 q6, q3, root0 // .........e.................. - vsub.u32 q2, q2, q4 // .................*.......... - vmul.u32 q4, q3, root0_twisted // ..........e................. - vstrw.u32 q2, [in, #48] // .........................*.. - vsub.u32 q3, q1, q7 // ......................*..... - vqrdmlah.s32 q6, q4, modulus // ...........e................ - vstrw.u32 q3, [in, #176] // ...........................* - - // original source code - // vldrw.u32 data0, [in] // ..........................*..................... - // vldrw.u32 data1, [in, #64] // e............................................... - // vldrw.u32 data2, [in, #128] // ......................*......................... - // vldrw.u32 data3, [in, #192] // ......e......................................... - // vqrdmulh.s32 tmp, data2, root0 // .......................*........................ - // vmul.u32 data2, data2, root0_twisted // .........................*...................... - // vqrdmlah.s32 tmp, data2, modulus // ...........................*.................... - // vsub.u32 data2, data0, tmp // ..............................*................. - // vadd.u32 data0, data0, tmp // ................................*............... - // vqrdmulh.s32 tmp, data3, root0 // .............e.................................. - // vmul.u32 data3, data3, root0_twisted // ...............e................................ - // vqrdmlah.s32 tmp, data3, modulus // ..................e............................. - // vsub.u32 data3, data1, tmp // ....................*........................... - // vadd.u32 data1, data1, tmp // ........................*....................... - // vqrdmulh.s32 tmp, data1, root1 // .............................*.................. - // vmul.u32 data1, data1, root1_twisted // ...............................*................ - // vqrdmlah.s32 tmp, data1, modulus // .................................*.............. - // vsub.u32 data1, data0, tmp // ..........................................*..... - // vadd.u32 data0, data0, tmp // ....................................*........... - // vqrdmulh.s32 tmp, data3, root2 // .....................*.......................... - // vmul.u32 data3, data3, root2_twisted // ...................................*............ - // vqrdmlah.s32 tmp, data3, modulus // .....................................*.......... - // vsub.u32 data3, data2, tmp // .............................................*.. - // vadd.u32 data2, data2, tmp // .......................................*........ - // vstrw.u32 data0, [in] , #16 // ......................................*......... - // vstrw.u32 data1, [in, #48] // ............................................*... - // vstrw.u32 data2, [in, #112] // ........................................*....... - // vstrw.u32 data3, [in, #176] // ...............................................* - - le lr, layer34_loop -layer34_loop_end: - vldrw.u32 q5, [in, #128] // ..*..................... - vqrdmulh.s32 q2, q5, root0 // ...*.................... - vadd.u32 q4, q0, q6 // ....*................... - vqrdmulh.s32 q7, q4, root1 // ........*............... - vsub.u32 q3, q0, q6 // *....................... - vmul.u32 q0, q5, root0_twisted // .....*.................. - vldrw.u32 q6, [in] // ......*................. - vqrdmlah.s32 q2, q0, modulus // .......*................ - nop // .......................* - vmul.u32 q4, q4, root1_twisted // ..........*............. - vsub.u32 q1, q6, q2 // .........*.............. - vqrdmlah.s32 q7, q4, modulus // ............*........... - vadd.u32 q5, q6, q2 // ...........*............ - vqrdmulh.s32 q2, q3, root2 // .*...................... - vadd.u32 q4, q5, q7 // ..............*......... - vstrw.u32 q4, [in] , #16 // ................*....... - vmul.u32 q6, q3, root2_twisted // .............*.......... - vsub.u32 q4, q5, q7 // ...................*.... - vqrdmlah.s32 q2, q6, modulus // ...............*........ - vstrw.u32 q4, [in, #48] // ....................*... - vsub.u32 q4, q1, q2 // .....................*.. - vstrw.u32 q4, [in, #176] // ......................*. - vadd.u32 q4, q1, q2 // .................*...... - vstrw.u32 q4, [in, #112] // ..................*..... - - // original source code - // vsub.u32 q5, q0, q6 // ....*................... - // vqrdmulh.s32 q7, q5, root2 // .............*.......... - // vldrw.u32 q4, [in, #128] // *....................... - // vqrdmulh.s32 q2, q4, root0 // .*...................... - // vadd.u32 q3, q0, q6 // ..*..................... - // vmul.u32 q0, q4, root0_twisted // .....*.................. - // vldrw.u32 q6, [in] // ......*................. - // vqrdmlah.s32 q2, q0, modulus // .......*................ - // vqrdmulh.s32 q4, q3, root1 // ...*.................... - // vsub.u32 q1, q6, q2 // ..........*............. - // vmul.u32 q3, q3, root1_twisted // .........*.............. - // vadd.u32 q2, q6, q2 // ............*........... - // vqrdmlah.s32 q4, q3, modulus // ...........*............ - // vmul.u32 q6, q5, root2_twisted // ................*....... - // vadd.u32 q5, q2, q4 // ..............*......... - // vqrdmlah.s32 q7, q6, modulus // ..................*..... - // vstrw.u32 q5, [in] , #16 // ...............*........ - // vadd.u32 q5, q1, q7 // ......................*. - // vstrw.u32 q5, [in, #112] // .......................* - // vsub.u32 q2, q2, q4 // .................*...... - // vstrw.u32 q2, [in, #48] // ...................*.... - // vsub.u32 q3, q1, q7 // ....................*... - // vstrw.u32 q3, [in, #176] // .....................*.. - // nop // ........*............... - - - - add in, in, #(4*64 - 4*16) - subs count, count, #1 - bne out_start - - sub in, in, #(4*256) - - /* Layers 5,6 */ - - // 1 butterfly blocks per root config, 16 root configs - // loop over root configs - - mov lr, #16 - ldrd r8, r5, [root_ptr] , #24 // .*..... - vldrw.u32 q6, [in, #48] // *...... - vmul.u32 q2, q6, r5 // ...*... - vldrw.u32 q4, [in, #16] // ....*.. - vqrdmulh.s32 q6, q6, r8 // .....*. - vldrw.u32 q5, [in, #32] // ..*.... - vqrdmlah.s32 q6, q2, modulus // ......* - - // original source code - // vldrw.u32 q7, [in, #48] // .*..... - // ldrd r8, r5, [root_ptr] , #24 // *...... - // vldrw.u32 q5, [in, #32] // .....*. - // vmul.u32 q3, q7, r5 // ..*.... - // vldrw.u32 q4, [in, #16] // ...*... - // vqrdmulh.s32 q6, q7, r8 // ....*.. - // vqrdmlah.s32 q6, q3, modulus // ......* - - sub lr, lr, #1 - wls lr, lr, layer56_loop_end -layer56_loop: - vsub.u32 q3, q4, q6 // ...............*............... - vqrdmulh.s32 q2, q5, r8 // .......*....................... - vadd.u32 q6, q4, q6 // ................*.............. - vmul.u32 q5, q5, r5 // ........*...................... - ldrd r9, r4, [root_ptr, #-16] // .*............................. - vqrdmlah.s32 q2, q5, modulus // .........*..................... - vldrw.u32 q0, [in] // ...*........................... - vqrdmulh.s32 q4, q6, r9 // .................*............. - vsub.u32 q1, q0, q2 // ..........*.................... - vmul.u32 q5, q6, r4 // ..................*............ - ldrd r10, r6, [root_ptr, #-8] // ..*............................ - vqrdmlah.s32 q4, q5, modulus // ...................*........... - vadd.u32 q0, q0, q2 // ...........*................... - vmul.u32 q6, q3, r6 // .......................*....... - vldrw.u32 q7, [in, #112] // ......e........................ - vqrdmulh.s32 q2, q3, r10 // ......................*........ - ldrd r8, r5, [root_ptr] , #24 // e.............................. - vqrdmlah.s32 q2, q6, modulus // ........................*...... - vldrw.u32 q5, [in, #96] // .....e......................... - vsub.u32 q3, q1, q2 // .........................*..... - vstrw.u32 q3, [in, #48] // ..............................* - vsub.u32 q6, q0, q4 // ....................*.......... - vstrw.u32 q6, [in, #16] // ............................*.. - vadd.u32 q0, q0, q4 // .....................*......... - vmul.u32 q3, q7, r5 // .............e................. - vldrw.u32 q4, [in, #80] // ....e.......................... - vqrdmulh.s32 q6, q7, r8 // ............e.................. - vstrw.u32 q0, [in] , #64 // ...........................*... - vqrdmlah.s32 q6, q3, modulus // ..............e................ - vadd.u32 q3, q1, q2 // ..........................*.... - vstrw.u32 q3, [in, #-32] // .............................*. - - // original source code - // ldrd root0, root0_twisted, [root_ptr] , #24 // ..e............................................. - // ldrd root1, root1_twisted, [root_ptr, #-16] // .....................*.......................... - // ldrd root2, root2_twisted, [root_ptr, #-8] // ...........................*.................... - // vldrw.u32 data0, [in] // .......................*........................ - // vldrw.u32 data1, [in, #16] // ...........e.................................... - // vldrw.u32 data2, [in, #32] // ....e........................................... - // vldrw.u32 data3, [in, #48] // e............................................... - // vqrdmulh.s32 tmp, data2, root0 // ..................*............................. - // vmul.u32 data2, data2, root0_twisted // ....................*........................... - // vqrdmlah.s32 tmp, data2, modulus // ......................*......................... - // vsub.u32 data2, data0, tmp // .........................*...................... - // vadd.u32 data0, data0, tmp // .............................*.................. - // vqrdmulh.s32 tmp, data3, root0 // ............e................................... - // vmul.u32 data3, data3, root0_twisted // ..........e..................................... - // vqrdmlah.s32 tmp, data3, modulus // ..............e................................. - // vsub.u32 data3, data1, tmp // .................*.............................. - // vadd.u32 data1, data1, tmp // ...................*............................ - // vqrdmulh.s32 tmp, data1, root1 // ........................*....................... - // vmul.u32 data1, data1, root1_twisted // ..........................*..................... - // vqrdmlah.s32 tmp, data1, modulus // ............................*................... - // vsub.u32 data1, data0, tmp // ......................................*......... - // vadd.u32 data0, data0, tmp // ........................................*....... - // vqrdmulh.s32 tmp, data3, root2 // ................................*............... - // vmul.u32 data3, data3, root2_twisted // ..............................*................. - // vqrdmlah.s32 tmp, data3, modulus // ..................................*............. - // vsub.u32 data3, data2, tmp // ....................................*........... - // vadd.u32 data2, data2, tmp // ..............................................*. - // vstrw.u32 data0, [in] , #64 // ............................................*... - // vstrw.u32 data1, [in, #-48] // .......................................*........ - // vstrw.u32 data2, [in, #-32] // ...............................................* - // vstrw.u32 data3, [in, #-16] // .....................................*.......... - - le lr, layer56_loop -layer56_loop_end: - vqrdmulh.s32 q2, q5, r8 // .*...................... - vsub.u32 q0, q4, q6 // *....................... - vmul.u32 q5, q5, r5 // ...*.................... - vadd.u32 q6, q4, q6 // ..*..................... - vqrdmlah.s32 q2, q5, modulus // .....*.................. - ldrd r3, r9, [root_ptr, #-16] // ....*................... - vqrdmulh.s32 q4, q6, r3 // .......*................ - vldrw.u32 q7, [in] // ......*................. - vmul.u32 q5, q6, r9 // .........*.............. - vsub.u32 q3, q7, q2 // ........*............... - vqrdmlah.s32 q4, q5, modulus // ...........*............ - ldrd r4, r9, [root_ptr, #-8] // ..........*............. - vadd.u32 q7, q7, q2 // ............*........... - vqrdmulh.s32 q2, q0, r4 // ..............*......... - vsub.u32 q6, q7, q4 // ..................*..... - vstrw.u32 q6, [in, #16] // ...................*.... - vmul.u32 q6, q0, r9 // .............*.......... - vadd.u32 q7, q7, q4 // ....................*... - vqrdmlah.s32 q2, q6, modulus // ...............*........ - vstrw.u32 q7, [in] , #64 // .....................*.. - vadd.u32 q7, q3, q2 // ......................*. - vstrw.u32 q7, [in, #-32] // .......................* - vsub.u32 q7, q3, q2 // ................*....... - vstrw.u32 q7, [in, #-16] // .................*...... - - // original source code - // vsub.u32 q3, q4, q6 // .*...................... - // vqrdmulh.s32 q2, q5, r8 // *....................... - // vadd.u32 q6, q4, q6 // ...*.................... - // vmul.u32 q5, q5, r5 // ..*..................... - // ldrd r9, r4, [root_ptr, #-16] // .....*.................. - // vqrdmlah.s32 q2, q5, modulus // ....*................... - // vldrw.u32 q0, [in] // .......*................ - // vqrdmulh.s32 q4, q6, r9 // ......*................. - // vsub.u32 q1, q0, q2 // .........*.............. - // vmul.u32 q5, q6, r4 // ........*............... - // ldrd r10, r6, [root_ptr, #-8] // ...........*............ - // vqrdmlah.s32 q4, q5, modulus // ..........*............. - // vadd.u32 q0, q0, q2 // ............*........... - // vmul.u32 q6, q3, r6 // ................*....... - // vqrdmulh.s32 q2, q3, r10 // .............*.......... - // vqrdmlah.s32 q2, q6, modulus // ..................*..... - // vsub.u32 q3, q1, q2 // ......................*. - // vstrw.u32 q3, [in, #48] // .......................* - // vsub.u32 q6, q0, q4 // ..............*......... - // vstrw.u32 q6, [in, #16] // ...............*........ - // vadd.u32 q0, q0, q4 // .................*...... - // vstrw.u32 q0, [in] , #64 // ...................*.... - // vadd.u32 q3, q1, q2 // ....................*... - // vstrw.u32 q3, [in, #-32] // .....................*.. - - - - // Restore MVE vector registers - vpop {d8-d15} - // Restore GPRs - pop {r4-r11,lr} - bx lr \ No newline at end of file diff --git a/tests/permute/main.c b/tests/permute/main.c index b37ef6d..ed69ced 100644 --- a/tests/permute/main.c +++ b/tests/permute/main.c @@ -125,5 +125,6 @@ int main (void) if( ret != 0 ) return( 1 ); + debug_printf( "ALL GOOD!\n" ); return( 0 ); } diff --git a/tests/permute/misc.c b/tests/permute/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/permute/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/permute/misc.h b/tests/permute/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/permute/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/permute/permute.mk b/tests/permute/permute.mk new file mode 100644 index 0000000..8277946 --- /dev/null +++ b/tests/permute/permute.mk @@ -0,0 +1,18 @@ +# Test name - needs to match the directory name +TESTS += permute + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +PERMUTE_PLATFORMS += m55-an547 +PERMUTE_PLATFORMS += m85-an555 + +# C sources required for this test +PERMUTE_SOURCES += main.c +PERMUTE_SOURCES += misc.c + +# Assembly sources required for this test +PERMUTE_ASM_DIR = ../../asm/auto/permute +PERMUTE_ASMS += $(PERMUTE_ASM_DIR)/permutation_test_u32.s +PERMUTE_ASMS += $(PERMUTE_ASM_DIR)/permutation_test_u16.s +PERMUTE_ASMS += $(PERMUTE_ASM_DIR)/permutation_test_u8.s diff --git a/tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_complete.s b/tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index d955338..0000000 --- a/tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,3394 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 20558213 // zeta^510 * 2^31 = 28678040^510 * 2^31 -.word 66424611 // zeta^382 * 2^31 = 28678040^382 * 2^31 -.word 59465515 // zeta^446 * 2^31 = 28678040^446 * 2^31 -.word 39560591 // zeta^318 * 2^31 = 28678040^318 * 2^31 -.word 2042724475 // zeta^510 * (q^(-1) mod 2^32) * 2^31 = 28678040^510 * 375649793 * 2^31 -.word 2817904349 // zeta^382 * (q^(-1) mod 2^32) * 2^31 = 28678040^382 * 375649793 * 2^31 -.word 2405453525 // zeta^446 * (q^(-1) mod 2^32) * 2^31 = 28678040^446 * 375649793 * 2^31 -.word 2621436017 // zeta^318 * (q^(-1) mod 2^32) * 2^31 = 28678040^318 * 375649793 * 2^31 -.word 35339857 // zeta^511 * 2^31 = 28678040^511 * 2^31 -.word 13377101 // zeta^447 * 2^31 = 28678040^447 * 2^31 -.word 33252123 // zeta^479 * 2^31 = 28678040^479 * 2^31 -.word 16713319 // zeta^415 * 2^31 = 28678040^415 * 2^31 -.word 10815985 // zeta^383 * 2^31 = 28678040^383 * 2^31 -.word 56247925 // zeta^319 * 2^31 = 28678040^319 * 2^31 -.word 26943959 // zeta^351 * 2^31 = 28678040^351 * 2^31 -.word 51316823 // zeta^287 * 2^31 = 28678040^287 * 2^31 -.word 3650773007 // zeta^383 * (q^(-1) mod 2^32) * 2^31 = 28678040^383 * 375649793 * 2^31 -.word 4021439371 // zeta^319 * (q^(-1) mod 2^32) * 2^31 = 28678040^319 * 375649793 * 2^31 -.word 1538999337 // zeta^351 * (q^(-1) mod 2^32) * 2^31 = 28678040^351 * 375649793 * 2^31 -.word 3611844009 // zeta^287 * (q^(-1) mod 2^32) * 2^31 = 28678040^287 * 375649793 * 2^31 -.word 42042379 // zeta^478 * 2^31 = 28678040^478 * 2^31 -.word 26419651 // zeta^350 * 2^31 = 28678040^350 * 2^31 -.word 61522009 // zeta^414 * 2^31 = 28678040^414 * 2^31 -.word 23758817 // zeta^286 * 2^31 = 28678040^286 * 2^31 -.word 2254105077 // zeta^478 * (q^(-1) mod 2^32) * 2^31 = 28678040^478 * 375649793 * 2^31 -.word 3415374909 // zeta^350 * (q^(-1) mod 2^32) * 2^31 = 28678040^350 * 375649793 * 2^31 -.word 3742677415 // zeta^414 * (q^(-1) mod 2^32) * 2^31 = 28678040^414 * 375649793 * 2^31 -.word 3187687967 // zeta^286 * (q^(-1) mod 2^32) * 2^31 = 28678040^286 * 375649793 * 2^31 -.word 35776599 // zeta^495 * 2^31 = 28678040^495 * 2^31 -.word 6731445 // zeta^431 * 2^31 = 28678040^431 * 2^31 -.word 3030459 // zeta^463 * 2^31 = 28678040^463 * 2^31 -.word 41085059 // zeta^399 * 2^31 = 28678040^399 * 2^31 -.word 6685305 // zeta^367 * 2^31 = 28678040^367 * 2^31 -.word 24840267 // zeta^303 * 2^31 = 28678040^303 * 2^31 -.word 21119839 // zeta^335 * 2^31 = 28678040^335 * 2^31 -.word 32376869 // zeta^271 * 2^31 = 28678040^271 * 2^31 -.word 2658056071 // zeta^367 * (q^(-1) mod 2^32) * 2^31 = 28678040^367 * 375649793 * 2^31 -.word 495707573 // zeta^303 * (q^(-1) mod 2^32) * 2^31 = 28678040^303 * 375649793 * 2^31 -.word 440627873 // zeta^335 * (q^(-1) mod 2^32) * 2^31 = 28678040^335 * 375649793 * 2^31 -.word 3991890395 // zeta^271 * (q^(-1) mod 2^32) * 2^31 = 28678040^271 * 375649793 * 2^31 -.word 11319751 // zeta^494 * 2^31 = 28678040^494 * 2^31 -.word 57449959 // zeta^366 * 2^31 = 28678040^366 * 2^31 -.word 47736605 // zeta^430 * 2^31 = 28678040^430 * 2^31 -.word 25310795 // zeta^302 * 2^31 = 28678040^302 * 2^31 -.word 316214329 // zeta^494 * (q^(-1) mod 2^32) * 2^31 = 28678040^494 * 375649793 * 2^31 -.word 2994890777 // zeta^366 * (q^(-1) mod 2^32) * 2^31 = 28678040^366 * 375649793 * 2^31 -.word 2883238627 // zeta^430 * (q^(-1) mod 2^32) * 2^31 = 28678040^430 * 375649793 * 2^31 -.word 1834006453 // zeta^302 * (q^(-1) mod 2^32) * 2^31 = 28678040^302 * 375649793 * 2^31 -.word 5649915 // zeta^503 * 2^31 = 28678040^503 * 2^31 -.word 25847843 // zeta^439 * 2^31 = 28678040^439 * 2^31 -.word 62444027 // zeta^471 * 2^31 = 28678040^471 * 2^31 -.word 57855139 // zeta^407 * 2^31 = 28678040^407 * 2^31 -.word 43953263 // zeta^375 * 2^31 = 28678040^375 * 2^31 -.word 3973257 // zeta^311 * 2^31 = 28678040^311 * 2^31 -.word 45754835 // zeta^343 * 2^31 = 28678040^343 * 2^31 -.word 47438647 // zeta^279 * 2^31 = 28678040^279 * 2^31 -.word 1254205841 // zeta^375 * (q^(-1) mod 2^32) * 2^31 = 28678040^375 * 375649793 * 2^31 -.word 3800349047 // zeta^311 * (q^(-1) mod 2^32) * 2^31 = 28678040^311 * 375649793 * 2^31 -.word 3397129261 // zeta^343 * (q^(-1) mod 2^32) * 2^31 = 28678040^343 * 375649793 * 2^31 -.word 3896527561 // zeta^279 * (q^(-1) mod 2^32) * 2^31 = 28678040^279 * 375649793 * 2^31 -.word 34946213 // zeta^462 * 2^31 = 28678040^462 * 2^31 -.word 33401995 // zeta^334 * 2^31 = 28678040^334 * 2^31 -.word 57707227 // zeta^398 * 2^31 = 28678040^398 * 2^31 -.word 43655235 // zeta^270 * 2^31 = 28678040^270 * 2^31 -.word 4090836315 // zeta^462 * (q^(-1) mod 2^32) * 2^31 = 28678040^462 * 375649793 * 2^31 -.word 2389950837 // zeta^334 * (q^(-1) mod 2^32) * 2^31 = 28678040^334 * 375649793 * 2^31 -.word 1383072549 // zeta^398 * (q^(-1) mod 2^32) * 2^31 = 28678040^398 * 375649793 * 2^31 -.word 2793176509 // zeta^270 * (q^(-1) mod 2^32) * 2^31 = 28678040^270 * 375649793 * 2^31 -.word 30218957 // zeta^487 * 2^31 = 28678040^487 * 2^31 -.word 13073717 // zeta^423 * 2^31 = 28678040^423 * 2^31 -.word 41547715 // zeta^455 * 2^31 = 28678040^455 * 2^31 -.word 51082899 // zeta^391 * 2^31 = 28678040^391 * 2^31 -.word 6539853 // zeta^359 * 2^31 = 28678040^359 * 2^31 -.word 52712977 // zeta^295 * 2^31 = 28678040^295 * 2^31 -.word 15171525 // zeta^327 * 2^31 = 28678040^327 * 2^31 -.word 41070365 // zeta^263 * 2^31 = 28678040^263 * 2^31 -.word 1097807795 // zeta^359 * (q^(-1) mod 2^32) * 2^31 = 28678040^359 * 375649793 * 2^31 -.word 1402229743 // zeta^295 * (q^(-1) mod 2^32) * 2^31 = 28678040^295 * 375649793 * 2^31 -.word 857879099 // zeta^327 * (q^(-1) mod 2^32) * 2^31 = 28678040^327 * 375649793 * 2^31 -.word 2467328739 // zeta^263 * (q^(-1) mod 2^32) * 2^31 = 28678040^263 * 375649793 * 2^31 -.word 1421525 // zeta^502 * 2^31 = 28678040^502 * 2^31 -.word 5608953 // zeta^374 * 2^31 = 28678040^374 * 2^31 -.word 3344309 // zeta^438 * 2^31 = 28678040^438 * 2^31 -.word 54192527 // zeta^310 * 2^31 = 28678040^310 * 2^31 -.word 2006884651 // zeta^502 * (q^(-1) mod 2^32) * 2^31 = 28678040^502 * 375649793 * 2^31 -.word 1547838471 // zeta^374 * (q^(-1) mod 2^32) * 2^31 = 28678040^374 * 375649793 * 2^31 -.word 1835403851 // zeta^438 * (q^(-1) mod 2^32) * 2^31 = 28678040^438 * 375649793 * 2^31 -.word 3288902769 // zeta^310 * (q^(-1) mod 2^32) * 2^31 = 28678040^310 * 375649793 * 2^31 -.word 55532487 // zeta^507 * 2^31 = 28678040^507 * 2^31 -.word 25878283 // zeta^443 * 2^31 = 28678040^443 * 2^31 -.word 7519477 // zeta^475 * 2^31 = 28678040^475 * 2^31 -.word 10400227 // zeta^411 * 2^31 = 28678040^411 * 2^31 -.word 66449241 // zeta^379 * 2^31 = 28678040^379 * 2^31 -.word 4428811 // zeta^315 * 2^31 = 28678040^315 * 2^31 -.word 30618985 // zeta^347 * 2^31 = 28678040^347 * 2^31 -.word 46942975 // zeta^283 * 2^31 = 28678040^283 * 2^31 -.word 1923058343 // zeta^379 * (q^(-1) mod 2^32) * 2^31 = 28678040^379 * 375649793 * 2^31 -.word 3711490549 // zeta^315 * (q^(-1) mod 2^32) * 2^31 = 28678040^315 * 375649793 * 2^31 -.word 1530848407 // zeta^347 * (q^(-1) mod 2^32) * 2^31 = 28678040^347 * 375649793 * 2^31 -.word 3263539969 // zeta^283 * (q^(-1) mod 2^32) * 2^31 = 28678040^283 * 375649793 * 2^31 -.word 34238409 // zeta^470 * 2^31 = 28678040^470 * 2^31 -.word 7278675 // zeta^342 * 2^31 = 28678040^342 * 2^31 -.word 26316985 // zeta^406 * 2^31 = 28678040^406 * 2^31 -.word 1738533 // zeta^278 * 2^31 = 28678040^278 * 2^31 -.word 1976527415 // zeta^470 * (q^(-1) mod 2^32) * 2^31 = 28678040^470 * 375649793 * 2^31 -.word 3553111469 // zeta^342 * (q^(-1) mod 2^32) * 2^31 = 28678040^342 * 375649793 * 2^31 -.word 1070704967 // zeta^406 * (q^(-1) mod 2^32) * 2^31 = 28678040^406 * 375649793 * 2^31 -.word 280554203 // zeta^278 * (q^(-1) mod 2^32) * 2^31 = 28678040^278 * 375649793 * 2^31 -.word 29493541 // zeta^491 * 2^31 = 28678040^491 * 2^31 -.word 46179537 // zeta^427 * 2^31 = 28678040^427 * 2^31 -.word 61070425 // zeta^459 * 2^31 = 28678040^459 * 2^31 -.word 47641435 // zeta^395 * 2^31 = 28678040^395 * 2^31 -.word 8700655 // zeta^363 * 2^31 = 28678040^363 * 2^31 -.word 49217369 // zeta^299 * 2^31 = 28678040^299 * 2^31 -.word 14037329 // zeta^331 * 2^31 = 28678040^331 * 2^31 -.word 57068693 // zeta^267 * 2^31 = 28678040^267 * 2^31 -.word 2143064849 // zeta^363 * (q^(-1) mod 2^32) * 2^31 = 28678040^363 * 375649793 * 2^31 -.word 3997596327 // zeta^299 * (q^(-1) mod 2^32) * 2^31 = 28678040^299 * 375649793 * 2^31 -.word 594737327 // zeta^331 * (q^(-1) mod 2^32) * 2^31 = 28678040^331 * 375649793 * 2^31 -.word 1214449003 // zeta^267 * (q^(-1) mod 2^32) * 2^31 = 28678040^267 * 375649793 * 2^31 -.word 5988919 // zeta^486 * 2^31 = 28678040^486 * 2^31 -.word 27781261 // zeta^358 * 2^31 = 28678040^358 * 2^31 -.word 33650523 // zeta^422 * 2^31 = 28678040^422 * 2^31 -.word 40314383 // zeta^294 * 2^31 = 28678040^294 * 2^31 -.word 2046739401 // zeta^486 * (q^(-1) mod 2^32) * 2^31 = 28678040^486 * 375649793 * 2^31 -.word 2556008819 // zeta^358 * (q^(-1) mod 2^32) * 2^31 = 28678040^358 * 375649793 * 2^31 -.word 2602309285 // zeta^422 * (q^(-1) mod 2^32) * 2^31 = 28678040^422 * 375649793 * 2^31 -.word 3711528945 // zeta^294 * (q^(-1) mod 2^32) * 2^31 = 28678040^294 * 375649793 * 2^31 -.word 25356533 // zeta^499 * 2^31 = 28678040^499 * 2^31 -.word 59712043 // zeta^435 * 2^31 = 28678040^435 * 2^31 -.word 59431885 // zeta^467 * 2^31 = 28678040^467 * 2^31 -.word 42783775 // zeta^403 * 2^31 = 28678040^403 * 2^31 -.word 15118727 // zeta^371 * 2^31 = 28678040^371 * 2^31 -.word 16104593 // zeta^307 * 2^31 = 28678040^307 * 2^31 -.word 66551101 // zeta^339 * 2^31 = 28678040^339 * 2^31 -.word 27099659 // zeta^275 * 2^31 = 28678040^275 * 2^31 -.word 256676985 // zeta^371 * (q^(-1) mod 2^32) * 2^31 = 28678040^371 * 375649793 * 2^31 -.word 2042883439 // zeta^307 * (q^(-1) mod 2^32) * 2^31 = 28678040^307 * 375649793 * 2^31 -.word 2098783427 // zeta^339 * (q^(-1) mod 2^32) * 2^31 = 28678040^339 * 375649793 * 2^31 -.word 1730866165 // zeta^275 * (q^(-1) mod 2^32) * 2^31 = 28678040^275 * 375649793 * 2^31 -.word 52622279 // zeta^454 * 2^31 = 28678040^454 * 2^31 -.word 48542309 // zeta^326 * 2^31 = 28678040^326 * 2^31 -.word 28412919 // zeta^390 * 2^31 = 28678040^390 * 2^31 -.word 61490063 // zeta^262 * 2^31 = 28678040^262 * 2^31 -.word 111596089 // zeta^454 * (q^(-1) mod 2^32) * 2^31 = 28678040^454 * 375649793 * 2^31 -.word 2392801179 // zeta^326 * (q^(-1) mod 2^32) * 2^31 = 28678040^326 * 375649793 * 2^31 -.word 122296841 // zeta^390 * (q^(-1) mod 2^32) * 2^31 = 28678040^390 * 375649793 * 2^31 -.word 4112339569 // zeta^262 * (q^(-1) mod 2^32) * 2^31 = 28678040^262 * 375649793 * 2^31 -.word 17544659 // zeta^483 * 2^31 = 28678040^483 * 2^31 -.word 26761761 // zeta^419 * 2^31 = 28678040^419 * 2^31 -.word 28138345 // zeta^451 * 2^31 = 28678040^451 * 2^31 -.word 6006005 // zeta^387 * 2^31 = 28678040^387 * 2^31 -.word 49338991 // zeta^355 * 2^31 = 28678040^355 * 2^31 -.word 59052279 // zeta^291 * 2^31 = 28678040^291 * 2^31 -.word 54131019 // zeta^323 * 2^31 = 28678040^323 * 2^31 -.word 49172137 // zeta^259 * 2^31 = 28678040^259 * 2^31 -.word 2285599633 // zeta^355 * (q^(-1) mod 2^32) * 2^31 = 28678040^355 * 375649793 * 2^31 -.word 1420334345 // zeta^291 * (q^(-1) mod 2^32) * 2^31 = 28678040^291 * 375649793 * 2^31 -.word 1832318133 // zeta^323 * (q^(-1) mod 2^32) * 2^31 = 28678040^323 * 375649793 * 2^31 -.word 203443031 // zeta^259 * (q^(-1) mod 2^32) * 2^31 = 28678040^259 * 375649793 * 2^31 -.word 41164657 // zeta^506 * 2^31 = 28678040^506 * 2^31 -.word 23553921 // zeta^378 * 2^31 = 28678040^378 * 2^31 -.word 51075303 // zeta^442 * 2^31 = 28678040^442 * 2^31 -.word 11244857 // zeta^314 * 2^31 = 28678040^314 * 2^31 -.word 2292337295 // zeta^506 * (q^(-1) mod 2^32) * 2^31 = 28678040^506 * 375649793 * 2^31 -.word 2218762879 // zeta^378 * (q^(-1) mod 2^32) * 2^31 = 28678040^378 * 375649793 * 2^31 -.word 3660688665 // zeta^442 * (q^(-1) mod 2^32) * 2^31 = 28678040^442 * 375649793 * 2^31 -.word 2196022471 // zeta^314 * (q^(-1) mod 2^32) * 2^31 = 28678040^314 * 375649793 * 2^31 -.word 27161421 // zeta^509 * 2^31 = 28678040^509 * 2^31 -.word 12259351 // zeta^445 * 2^31 = 28678040^445 * 2^31 -.word 42183787 // zeta^477 * 2^31 = 28678040^477 * 2^31 -.word 260949 // zeta^413 * 2^31 = 28678040^413 * 2^31 -.word 49379395 // zeta^381 * 2^31 = 28678040^381 * 2^31 -.word 45318697 // zeta^317 * 2^31 = 28678040^317 * 2^31 -.word 65417737 // zeta^349 * 2^31 = 28678040^349 * 2^31 -.word 60522221 // zeta^285 * 2^31 = 28678040^285 * 2^31 -.word 2945787325 // zeta^381 * (q^(-1) mod 2^32) * 2^31 = 28678040^381 * 375649793 * 2^31 -.word 2724075479 // zeta^317 * (q^(-1) mod 2^32) * 2^31 = 28678040^317 * 375649793 * 2^31 -.word 2827626487 // zeta^349 * (q^(-1) mod 2^32) * 2^31 = 28678040^349 * 375649793 * 2^31 -.word 482722579 // zeta^285 * (q^(-1) mod 2^32) * 2^31 = 28678040^285 * 375649793 * 2^31 -.word 3629237 // zeta^474 * 2^31 = 28678040^474 * 2^31 -.word 60326323 // zeta^346 * 2^31 = 28678040^346 * 2^31 -.word 30569867 // zeta^410 * 2^31 = 28678040^410 * 2^31 -.word 31921231 // zeta^282 * 2^31 = 28678040^282 * 2^31 -.word 3571167563 // zeta^474 * (q^(-1) mod 2^32) * 2^31 = 28678040^474 * 375649793 * 2^31 -.word 3851189325 // zeta^346 * (q^(-1) mod 2^32) * 2^31 = 28678040^346 * 375649793 * 2^31 -.word 1517877365 // zeta^410 * (q^(-1) mod 2^32) * 2^31 = 28678040^410 * 375649793 * 2^31 -.word 1275593137 // zeta^282 * (q^(-1) mod 2^32) * 2^31 = 28678040^282 * 375649793 * 2^31 -.word 51477925 // zeta^493 * 2^31 = 28678040^493 * 2^31 -.word 23177153 // zeta^429 * 2^31 = 28678040^429 * 2^31 -.word 42516129 // zeta^461 * 2^31 = 28678040^461 * 2^31 -.word 23261199 // zeta^397 * 2^31 = 28678040^397 * 2^31 -.word 50523083 // zeta^365 * 2^31 = 28678040^365 * 2^31 -.word 29024109 // zeta^301 * 2^31 = 28678040^301 * 2^31 -.word 62634975 // zeta^333 * 2^31 = 28678040^333 * 2^31 -.word 5116371 // zeta^269 * 2^31 = 28678040^269 * 2^31 -.word 2363949621 // zeta^365 * (q^(-1) mod 2^32) * 2^31 = 28678040^365 * 375649793 * 2^31 -.word 2792055443 // zeta^301 * (q^(-1) mod 2^32) * 2^31 = 28678040^301 * 375649793 * 2^31 -.word 3296655905 // zeta^333 * (q^(-1) mod 2^32) * 2^31 = 28678040^333 * 375649793 * 2^31 -.word 4093127725 // zeta^269 * (q^(-1) mod 2^32) * 2^31 = 28678040^269 * 375649793 * 2^31 -.word 55626043 // zeta^490 * 2^31 = 28678040^490 * 2^31 -.word 15630981 // zeta^362 * 2^31 = 28678040^362 * 2^31 -.word 43717491 // zeta^426 * 2^31 = 28678040^426 * 2^31 -.word 14342369 // zeta^298 * 2^31 = 28678040^298 * 2^31 -.word 2004845765 // zeta^490 * (q^(-1) mod 2^32) * 2^31 = 28678040^490 * 375649793 * 2^31 -.word 3862343547 // zeta^362 * (q^(-1) mod 2^32) * 2^31 = 28678040^362 * 375649793 * 2^31 -.word 2436590221 // zeta^426 * (q^(-1) mod 2^32) * 2^31 = 28678040^426 * 375649793 * 2^31 -.word 2109337887 // zeta^298 * (q^(-1) mod 2^32) * 2^31 = 28678040^298 * 375649793 * 2^31 -.word 6776583 // zeta^501 * 2^31 = 28678040^501 * 2^31 -.word 33530533 // zeta^437 * 2^31 = 28678040^437 * 2^31 -.word 43598203 // zeta^469 * 2^31 = 28678040^469 * 2^31 -.word 59373651 // zeta^405 * 2^31 = 28678040^405 * 2^31 -.word 37946425 // zeta^373 * 2^31 = 28678040^373 * 2^31 -.word 47668559 // zeta^309 * 2^31 = 28678040^309 * 2^31 -.word 10775673 // zeta^341 * 2^31 = 28678040^341 * 2^31 -.word 3826249 // zeta^277 * 2^31 = 28678040^277 * 2^31 -.word 262354375 // zeta^373 * (q^(-1) mod 2^32) * 2^31 = 28678040^373 * 375649793 * 2^31 -.word 703707313 // zeta^309 * (q^(-1) mod 2^32) * 2^31 = 28678040^309 * 375649793 * 2^31 -.word 2790542727 // zeta^341 * (q^(-1) mod 2^32) * 2^31 = 28678040^341 * 375649793 * 2^31 -.word 2635626423 // zeta^277 * (q^(-1) mod 2^32) * 2^31 = 28678040^277 * 375649793 * 2^31 -.word 53733071 // zeta^458 * 2^31 = 28678040^458 * 2^31 -.word 10734019 // zeta^330 * 2^31 = 28678040^330 * 2^31 -.word 25306471 // zeta^394 * 2^31 = 28678040^394 * 2^31 -.word 54139625 // zeta^266 * 2^31 = 28678040^266 * 2^31 -.word 284438321 // zeta^458 * (q^(-1) mod 2^32) * 2^31 = 28678040^458 * 375649793 * 2^31 -.word 3541161021 // zeta^330 * (q^(-1) mod 2^32) * 2^31 = 28678040^330 * 375649793 * 2^31 -.word 2646073497 // zeta^394 * (q^(-1) mod 2^32) * 2^31 = 28678040^394 * 375649793 * 2^31 -.word 3100573463 // zeta^266 * (q^(-1) mod 2^32) * 2^31 = 28678040^266 * 375649793 * 2^31 -.word 1468391 // zeta^485 * 2^31 = 28678040^485 * 2^31 -.word 4426959 // zeta^421 * 2^31 = 28678040^421 * 2^31 -.word 42735737 // zeta^453 * 2^31 = 28678040^453 * 2^31 -.word 38665093 // zeta^389 * 2^31 = 28678040^389 * 2^31 -.word 33133879 // zeta^357 * 2^31 = 28678040^357 * 2^31 -.word 7139481 // zeta^293 * 2^31 = 28678040^293 * 2^31 -.word 8438111 // zeta^325 * 2^31 = 28678040^325 * 2^31 -.word 50341189 // zeta^261 * 2^31 = 28678040^261 * 2^31 -.word 3126759625 // zeta^357 * (q^(-1) mod 2^32) * 2^31 = 28678040^357 * 375649793 * 2^31 -.word 523569511 // zeta^293 * (q^(-1) mod 2^32) * 2^31 = 28678040^293 * 375649793 * 2^31 -.word 1408300193 // zeta^325 * (q^(-1) mod 2^32) * 2^31 = 28678040^325 * 375649793 * 2^31 -.word 2172685499 // zeta^261 * (q^(-1) mod 2^32) * 2^31 = 28678040^261 * 375649793 * 2^31 -.word 47558821 // zeta^498 * 2^31 = 28678040^498 * 2^31 -.word 33268441 // zeta^370 * 2^31 = 28678040^370 * 2^31 -.word 63536237 // zeta^434 * 2^31 = 28678040^434 * 2^31 -.word 26272521 // zeta^306 * 2^31 = 28678040^306 * 2^31 -.word 664584539 // zeta^498 * (q^(-1) mod 2^32) * 2^31 = 28678040^498 * 375649793 * 2^31 -.word 2409420583 // zeta^370 * (q^(-1) mod 2^32) * 2^31 = 28678040^370 * 375649793 * 2^31 -.word 3799958931 // zeta^434 * (q^(-1) mod 2^32) * 2^31 = 28678040^434 * 375649793 * 2^31 -.word 835286775 // zeta^306 * (q^(-1) mod 2^32) * 2^31 = 28678040^306 * 375649793 * 2^31 -.word 1854317 // zeta^505 * 2^31 = 28678040^505 * 2^31 -.word 2223865 // zeta^441 * 2^31 = 28678040^441 * 2^31 -.word 22962475 // zeta^473 * 2^31 = 28678040^473 * 2^31 -.word 36888515 // zeta^409 * 2^31 = 28678040^409 * 2^31 -.word 59868297 // zeta^377 * 2^31 = 28678040^377 * 2^31 -.word 15191207 // zeta^313 * 2^31 = 28678040^313 * 2^31 -.word 59108143 // zeta^345 * 2^31 = 28678040^345 * 2^31 -.word 4355773 // zeta^281 * 2^31 = 28678040^281 * 2^31 -.word 538432887 // zeta^377 * (q^(-1) mod 2^32) * 2^31 = 28678040^377 * 375649793 * 2^31 -.word 3252336985 // zeta^313 * (q^(-1) mod 2^32) * 2^31 = 28678040^313 * 375649793 * 2^31 -.word 1330506449 // zeta^345 * (q^(-1) mod 2^32) * 2^31 = 28678040^345 * 375649793 * 2^31 -.word 4169984835 // zeta^281 * (q^(-1) mod 2^32) * 2^31 = 28678040^281 * 375649793 * 2^31 -.word 27411989 // zeta^466 * 2^31 = 28678040^466 * 2^31 -.word 52176833 // zeta^338 * 2^31 = 28678040^338 * 2^31 -.word 52660121 // zeta^402 * 2^31 = 28678040^402 * 2^31 -.word 23140553 // zeta^274 * 2^31 = 28678040^274 * 2^31 -.word 652643307 // zeta^466 * (q^(-1) mod 2^32) * 2^31 = 28678040^466 * 375649793 * 2^31 -.word 4178403903 // zeta^338 * (q^(-1) mod 2^32) * 2^31 = 28678040^338 * 375649793 * 2^31 -.word 1113879143 // zeta^402 * (q^(-1) mod 2^32) * 2^31 = 28678040^402 * 375649793 * 2^31 -.word 3574776119 // zeta^274 * (q^(-1) mod 2^32) * 2^31 = 28678040^274 * 375649793 * 2^31 -.word 50275685 // zeta^489 * 2^31 = 28678040^489 * 2^31 -.word 12903773 // zeta^425 * 2^31 = 28678040^425 * 2^31 -.word 25228433 // zeta^457 * 2^31 = 28678040^457 * 2^31 -.word 55395235 // zeta^393 * 2^31 = 28678040^393 * 2^31 -.word 3868449 // zeta^361 * 2^31 = 28678040^361 * 2^31 -.word 66432231 // zeta^297 * 2^31 = 28678040^297 * 2^31 -.word 31236859 // zeta^329 * 2^31 = 28678040^329 * 2^31 -.word 13658415 // zeta^265 * 2^31 = 28678040^265 * 2^31 -.word 2938651359 // zeta^361 * (q^(-1) mod 2^32) * 2^31 = 28678040^361 * 375649793 * 2^31 -.word 814700825 // zeta^297 * (q^(-1) mod 2^32) * 2^31 = 28678040^297 * 375649793 * 2^31 -.word 1618291461 // zeta^329 * (q^(-1) mod 2^32) * 2^31 = 28678040^329 * 375649793 * 2^31 -.word 49245393 // zeta^265 * (q^(-1) mod 2^32) * 2^31 = 28678040^265 * 375649793 * 2^31 -.word 34409967 // zeta^482 * 2^31 = 28678040^482 * 2^31 -.word 12619783 // zeta^354 * 2^31 = 28678040^354 * 2^31 -.word 54561811 // zeta^418 * 2^31 = 28678040^418 * 2^31 -.word 61632377 // zeta^290 * 2^31 = 28678040^290 * 2^31 -.word 2233616401 // zeta^482 * (q^(-1) mod 2^32) * 2^31 = 28678040^482 * 375649793 * 2^31 -.word 2820912633 // zeta^354 * (q^(-1) mod 2^32) * 2^31 = 28678040^354 * 375649793 * 2^31 -.word 684470765 // zeta^418 * (q^(-1) mod 2^32) * 2^31 = 28678040^418 * 375649793 * 2^31 -.word 3345631879 // zeta^290 * (q^(-1) mod 2^32) * 2^31 = 28678040^290 * 375649793 * 2^31 -.word 7605279 // zeta^497 * 2^31 = 28678040^497 * 2^31 -.word 58319315 // zeta^433 * 2^31 = 28678040^433 * 2^31 -.word 16342937 // zeta^465 * 2^31 = 28678040^465 * 2^31 -.word 48148431 // zeta^401 * 2^31 = 28678040^401 * 2^31 -.word 62377755 // zeta^369 * 2^31 = 28678040^369 * 2^31 -.word 35459369 // zeta^305 * 2^31 = 28678040^305 * 2^31 -.word 27513701 // zeta^337 * 2^31 = 28678040^337 * 2^31 -.word 18346679 // zeta^273 * 2^31 = 28678040^273 * 2^31 -.word 4057153253 // zeta^369 * (q^(-1) mod 2^32) * 2^31 = 28678040^369 * 375649793 * 2^31 -.word 3867838679 // zeta^305 * (q^(-1) mod 2^32) * 2^31 = 28678040^305 * 375649793 * 2^31 -.word 589962907 // zeta^337 * (q^(-1) mod 2^32) * 2^31 = 28678040^337 * 375649793 * 2^31 -.word 1692873545 // zeta^273 * (q^(-1) mod 2^32) * 2^31 = 28678040^273 * 375649793 * 2^31 -.word 1824951 // zeta^450 * 2^31 = 28678040^450 * 2^31 -.word 40410247 // zeta^322 * 2^31 = 28678040^322 * 2^31 -.word 25935987 // zeta^386 * 2^31 = 28678040^386 * 2^31 -.word 53409853 // zeta^258 * 2^31 = 28678040^258 * 2^31 -.word 3034533193 // zeta^450 * (q^(-1) mod 2^32) * 2^31 = 28678040^450 * 375649793 * 2^31 -.word 1425582457 // zeta^322 * (q^(-1) mod 2^32) * 2^31 = 28678040^322 * 375649793 * 2^31 -.word 1695333773 // zeta^386 * (q^(-1) mod 2^32) * 2^31 = 28678040^386 * 375649793 * 2^31 -.word 2628741571 // zeta^258 * (q^(-1) mod 2^32) * 2^31 = 28678040^258 * 375649793 * 2^31 -.word 44896477 // zeta^481 * 2^31 = 28678040^481 * 2^31 -.word 66621379 // zeta^417 * 2^31 = 28678040^417 * 2^31 -.word 35702907 // zeta^449 * 2^31 = 28678040^449 * 2^31 -.word 44158149 // zeta^385 * 2^31 = 28678040^385 * 2^31 -.word 32881793 // zeta^353 * 2^31 = 28678040^353 * 2^31 -.word 18033685 // zeta^289 * 2^31 = 28678040^289 * 2^31 -.word 29367795 // zeta^321 * 2^31 = 28678040^321 * 2^31 -.word 16787671 // zeta^257 * 2^31 = 28678040^257 * 2^31 -.word 3741535615 // zeta^353 * (q^(-1) mod 2^32) * 2^31 = 28678040^353 * 375649793 * 2^31 -.word 3094455787 // zeta^289 * (q^(-1) mod 2^32) * 2^31 = 28678040^289 * 375649793 * 2^31 -.word 3934216205 // zeta^321 * (q^(-1) mod 2^32) * 2^31 = 28678040^321 * 375649793 * 2^31 -.word 2459712809 // zeta^257 * (q^(-1) mod 2^32) * 2^31 = 28678040^257 * 375649793 * 2^31 -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 17514581 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 4460971 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_n256_u32_33556993_28678040, %function -.global inv_ntt_n256_u32_33556993_28678040 -inv_ntt_n256_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -.equ modulus_inv, 3919317503 -movw r4, #:lower16:modulus_inv -movt r4, #:upper16:modulus_inv -vldrw.s32 Q4, [r0, #0] -vldrw.s32 Q5, [r0, #16] -vsub.s32 Q6, Q4, Q5 -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vqrdmulh.s32 Q5, Q6, Q5 -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vmul.u32 Q6, Q6, Q5 -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vst41.s32 {Q0,Q1,Q2,Q3}, [r0] -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vst43.s32 {Q0,Q1,Q2,Q3}, [r0]! -sub r0, r0, #1024 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -8)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -108)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -120)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -44)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -36)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -32)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 64)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 50631221 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 2147319755 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 100)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3170 -// Instruction count: 2670 \ No newline at end of file diff --git a/tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s b/tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index 9f7b044..0000000 --- a/tests/poly/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2526 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 36501331 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 17843885 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_n256_u32_33556993_28678040_incomplete, %function -.global inv_ntt_n256_u32_33556993_28678040_incomplete -inv_ntt_n256_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -8)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 64)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -108)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -120)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -60)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -44)] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -36)] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -48)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -32)] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 0)] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 64)] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 34739919 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 4294311729 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 72)] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 100)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 108)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 120)] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 124)] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2302 -// Instruction count: 1802 \ No newline at end of file diff --git a/tests/poly/auto/ntt_n256_u32_33556993_28678040_complete.s b/tests/poly/auto/ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index 785fdaf..0000000 --- a/tests/poly/auto/ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,2889 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.word 13704133 // zeta^ 2 * 2^31 = 28678040^ 2 * 2^31 -.word 41177999 // zeta^130 * 2^31 = 28678040^130 * 2^31 -.word 26703739 // zeta^ 66 * 2^31 = 28678040^ 66 * 2^31 -.word 65289035 // zeta^194 * 2^31 = 28678040^194 * 2^31 -.word 1666225723 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 2 * 375649793 * 2^31 -.word 2599633521 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 28678040^130 * 375649793 * 2^31 -.word 2869384837 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 66 * 375649793 * 2^31 -.word 1260434101 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 28678040^194 * 375649793 * 2^31 -.word 50326315 // zeta^ 1 * 2^31 = 28678040^ 1 * 2^31 -.word 37746191 // zeta^ 65 * 2^31 = 28678040^ 65 * 2^31 -.word 49080301 // zeta^ 33 * 2^31 = 28678040^ 33 * 2^31 -.word 34232193 // zeta^ 97 * 2^31 = 28678040^ 97 * 2^31 -.word 1835254485 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 1 * 375649793 * 2^31 -.word 360751089 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 65 * 375649793 * 2^31 -.word 1200511507 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 33 * 375649793 * 2^31 -.word 553431679 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 97 * 375649793 * 2^31 -.word 22955837 // zeta^129 * 2^31 = 28678040^129 * 2^31 -.word 31411079 // zeta^193 * 2^31 = 28678040^193 * 2^31 -.word 492607 // zeta^161 * 2^31 = 28678040^161 * 2^31 -.word 22217509 // zeta^225 * 2^31 = 28678040^225 * 2^31 -.word 5481609 // zeta^ 34 * 2^31 = 28678040^ 34 * 2^31 -.word 12552175 // zeta^162 * 2^31 = 28678040^162 * 2^31 -.word 54494203 // zeta^ 98 * 2^31 = 28678040^ 98 * 2^31 -.word 32704019 // zeta^226 * 2^31 = 28678040^226 * 2^31 -.word 949335415 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 34 * 375649793 * 2^31 -.word 3610496529 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 28678040^162 * 375649793 * 2^31 -.word 1474054661 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 98 * 375649793 * 2^31 -.word 2061350893 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 28678040^226 * 375649793 * 2^31 -.word 48767307 // zeta^ 17 * 2^31 = 28678040^ 17 * 2^31 -.word 39600285 // zeta^ 81 * 2^31 = 28678040^ 81 * 2^31 -.word 31654617 // zeta^ 49 * 2^31 = 28678040^ 49 * 2^31 -.word 4736231 // zeta^113 * 2^31 = 28678040^113 * 2^31 -.word 2602093749 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 17 * 375649793 * 2^31 -.word 3705004387 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 81 * 375649793 * 2^31 -.word 427128615 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 49 * 375649793 * 2^31 -.word 237814041 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 28678040^113 * 375649793 * 2^31 -.word 18965555 // zeta^145 * 2^31 = 28678040^145 * 2^31 -.word 50771049 // zeta^209 * 2^31 = 28678040^209 * 2^31 -.word 8794671 // zeta^177 * 2^31 = 28678040^177 * 2^31 -.word 59508707 // zeta^241 * 2^31 = 28678040^241 * 2^31 -.word 43973433 // zeta^ 18 * 2^31 = 28678040^ 18 * 2^31 -.word 14453865 // zeta^146 * 2^31 = 28678040^146 * 2^31 -.word 14937153 // zeta^ 82 * 2^31 = 28678040^ 82 * 2^31 -.word 39701997 // zeta^210 * 2^31 = 28678040^210 * 2^31 -.word 720191175 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 18 * 375649793 * 2^31 -.word 3181088151 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 28678040^146 * 375649793 * 2^31 -.word 116563391 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 82 * 375649793 * 2^31 -.word 3642323987 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 28678040^210 * 375649793 * 2^31 -.word 53455571 // zeta^ 9 * 2^31 = 28678040^ 9 * 2^31 -.word 35877127 // zeta^ 73 * 2^31 = 28678040^ 73 * 2^31 -.word 681755 // zeta^ 41 * 2^31 = 28678040^ 41 * 2^31 -.word 63245537 // zeta^105 * 2^31 = 28678040^105 * 2^31 -.word 4245721901 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 9 * 375649793 * 2^31 -.word 2676675833 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 73 * 375649793 * 2^31 -.word 3480266469 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 41 * 375649793 * 2^31 -.word 1356315935 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 28678040^105 * 375649793 * 2^31 -.word 11718751 // zeta^137 * 2^31 = 28678040^137 * 2^31 -.word 41885553 // zeta^201 * 2^31 = 28678040^201 * 2^31 -.word 54210213 // zeta^169 * 2^31 = 28678040^169 * 2^31 -.word 16838301 // zeta^233 * 2^31 = 28678040^233 * 2^31 -.word 40841465 // zeta^ 50 * 2^31 = 28678040^ 50 * 2^31 -.word 3577749 // zeta^178 * 2^31 = 28678040^178 * 2^31 -.word 33845545 // zeta^114 * 2^31 = 28678040^114 * 2^31 -.word 19555165 // zeta^242 * 2^31 = 28678040^242 * 2^31 -.word 3459680519 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 50 * 375649793 * 2^31 -.word 495008363 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 28678040^178 * 375649793 * 2^31 -.word 1885546711 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 28678040^114 * 375649793 * 2^31 -.word 3630382755 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 28678040^242 * 375649793 * 2^31 -.word 62758213 // zeta^ 25 * 2^31 = 28678040^ 25 * 2^31 -.word 8005843 // zeta^ 89 * 2^31 = 28678040^ 89 * 2^31 -.word 51922779 // zeta^ 57 * 2^31 = 28678040^ 57 * 2^31 -.word 7245689 // zeta^121 * 2^31 = 28678040^121 * 2^31 -.word 124982459 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 25 * 375649793 * 2^31 -.word 2964460845 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 89 * 375649793 * 2^31 -.word 1042630309 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 57 * 375649793 * 2^31 -.word 3756534407 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 28678040^121 * 375649793 * 2^31 -.word 30225471 // zeta^153 * 2^31 = 28678040^153 * 2^31 -.word 44151511 // zeta^217 * 2^31 = 28678040^217 * 2^31 -.word 64890121 // zeta^185 * 2^31 = 28678040^185 * 2^31 -.word 65259669 // zeta^249 * 2^31 = 28678040^249 * 2^31 -.word 12974361 // zeta^ 10 * 2^31 = 28678040^ 10 * 2^31 -.word 41807515 // zeta^138 * 2^31 = 28678040^138 * 2^31 -.word 56379967 // zeta^ 74 * 2^31 = 28678040^ 74 * 2^31 -.word 13380915 // zeta^202 * 2^31 = 28678040^202 * 2^31 -.word 1194393831 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 10 * 375649793 * 2^31 -.word 1648893797 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 28678040^138 * 375649793 * 2^31 -.word 753806273 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 74 * 375649793 * 2^31 -.word 4010528973 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 28678040^202 * 375649793 * 2^31 -.word 16772797 // zeta^ 5 * 2^31 = 28678040^ 5 * 2^31 -.word 58675875 // zeta^ 69 * 2^31 = 28678040^ 69 * 2^31 -.word 59974505 // zeta^ 37 * 2^31 = 28678040^ 37 * 2^31 -.word 33980107 // zeta^101 * 2^31 = 28678040^101 * 2^31 -.word 2122281795 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 5 * 375649793 * 2^31 -.word 2886667101 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 69 * 375649793 * 2^31 -.word 3771397783 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 37 * 375649793 * 2^31 -.word 1168207669 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 28678040^101 * 375649793 * 2^31 -.word 28448893 // zeta^133 * 2^31 = 28678040^133 * 2^31 -.word 24378249 // zeta^197 * 2^31 = 28678040^197 * 2^31 -.word 62687027 // zeta^165 * 2^31 = 28678040^165 * 2^31 -.word 65645595 // zeta^229 * 2^31 = 28678040^229 * 2^31 -.word 52771617 // zeta^ 42 * 2^31 = 28678040^ 42 * 2^31 -.word 23396495 // zeta^170 * 2^31 = 28678040^170 * 2^31 -.word 51483005 // zeta^106 * 2^31 = 28678040^106 * 2^31 -.word 11487943 // zeta^234 * 2^31 = 28678040^234 * 2^31 -.word 2185629407 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 42 * 375649793 * 2^31 -.word 1858377073 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 28678040^170 * 375649793 * 2^31 -.word 432623747 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 28678040^106 * 375649793 * 2^31 -.word 2290121529 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 28678040^234 * 375649793 * 2^31 -.word 63287737 // zeta^ 21 * 2^31 = 28678040^ 21 * 2^31 -.word 56338313 // zeta^ 85 * 2^31 = 28678040^ 85 * 2^31 -.word 19445427 // zeta^ 53 * 2^31 = 28678040^ 53 * 2^31 -.word 29167561 // zeta^117 * 2^31 = 28678040^117 * 2^31 -.word 1659340871 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 21 * 375649793 * 2^31 -.word 1504424567 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 85 * 375649793 * 2^31 -.word 3591259981 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 53 * 375649793 * 2^31 -.word 4032612919 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 28678040^117 * 375649793 * 2^31 -.word 7740335 // zeta^149 * 2^31 = 28678040^149 * 2^31 -.word 23515783 // zeta^213 * 2^31 = 28678040^213 * 2^31 -.word 33583453 // zeta^181 * 2^31 = 28678040^181 * 2^31 -.word 60337403 // zeta^245 * 2^31 = 28678040^245 * 2^31 -.word 35192755 // zeta^ 26 * 2^31 = 28678040^ 26 * 2^31 -.word 36544119 // zeta^154 * 2^31 = 28678040^154 * 2^31 -.word 6787663 // zeta^ 90 * 2^31 = 28678040^ 90 * 2^31 -.word 63484749 // zeta^218 * 2^31 = 28678040^218 * 2^31 -.word 3019374157 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 26 * 375649793 * 2^31 -.word 2777089929 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 28678040^154 * 375649793 * 2^31 -.word 443777969 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 90 * 375649793 * 2^31 -.word 723799731 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 28678040^218 * 375649793 * 2^31 -.word 61997615 // zeta^ 13 * 2^31 = 28678040^ 13 * 2^31 -.word 4479011 // zeta^ 77 * 2^31 = 28678040^ 77 * 2^31 -.word 38089877 // zeta^ 45 * 2^31 = 28678040^ 45 * 2^31 -.word 16590903 // zeta^109 * 2^31 = 28678040^109 * 2^31 -.word 201839569 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 13 * 375649793 * 2^31 -.word 998311389 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 77 * 375649793 * 2^31 -.word 1502911851 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 45 * 375649793 * 2^31 -.word 1931017673 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 28678040^109 * 375649793 * 2^31 -.word 43852787 // zeta^141 * 2^31 = 28678040^141 * 2^31 -.word 24597857 // zeta^205 * 2^31 = 28678040^205 * 2^31 -.word 43936833 // zeta^173 * 2^31 = 28678040^173 * 2^31 -.word 15636061 // zeta^237 * 2^31 = 28678040^237 * 2^31 -.word 55869129 // zeta^ 58 * 2^31 = 28678040^ 58 * 2^31 -.word 16038683 // zeta^186 * 2^31 = 28678040^186 * 2^31 -.word 43560065 // zeta^122 * 2^31 = 28678040^122 * 2^31 -.word 25949329 // zeta^250 * 2^31 = 28678040^250 * 2^31 -.word 2098944823 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 58 * 375649793 * 2^31 -.word 634278629 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 28678040^186 * 375649793 * 2^31 -.word 2076204415 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 28678040^122 * 375649793 * 2^31 -.word 2002629999 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 28678040^250 * 375649793 * 2^31 -.word 6591765 // zeta^ 29 * 2^31 = 28678040^ 29 * 2^31 -.word 1696249 // zeta^ 93 * 2^31 = 28678040^ 93 * 2^31 -.word 21795289 // zeta^ 61 * 2^31 = 28678040^ 61 * 2^31 -.word 17734591 // zeta^125 * 2^31 = 28678040^125 * 2^31 -.word 3812244715 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 29 * 375649793 * 2^31 -.word 1467340807 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 93 * 375649793 * 2^31 -.word 1570891815 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 61 * 375649793 * 2^31 -.word 1349179969 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 28678040^125 * 375649793 * 2^31 -.word 66853037 // zeta^157 * 2^31 = 28678040^157 * 2^31 -.word 24930199 // zeta^221 * 2^31 = 28678040^221 * 2^31 -.word 54854635 // zeta^189 * 2^31 = 28678040^189 * 2^31 -.word 39952565 // zeta^253 * 2^31 = 28678040^253 * 2^31 -.word 5623923 // zeta^ 6 * 2^31 = 28678040^ 6 * 2^31 -.word 38701067 // zeta^134 * 2^31 = 28678040^134 * 2^31 -.word 18571677 // zeta^ 70 * 2^31 = 28678040^ 70 * 2^31 -.word 14491707 // zeta^198 * 2^31 = 28678040^198 * 2^31 -.word 182627725 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 6 * 375649793 * 2^31 -.word 4172670453 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 28678040^134 * 375649793 * 2^31 -.word 1902166115 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 70 * 375649793 * 2^31 -.word 4183371205 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 28678040^198 * 375649793 * 2^31 -.word 17941849 // zeta^ 3 * 2^31 = 28678040^ 3 * 2^31 -.word 12982967 // zeta^ 67 * 2^31 = 28678040^ 67 * 2^31 -.word 8061707 // zeta^ 35 * 2^31 = 28678040^ 35 * 2^31 -.word 17774995 // zeta^ 99 * 2^31 = 28678040^ 99 * 2^31 -.word 4091524263 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 3 * 375649793 * 2^31 -.word 2462649161 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 67 * 375649793 * 2^31 -.word 2874632949 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 35 * 375649793 * 2^31 -.word 2009367661 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 99 * 375649793 * 2^31 -.word 61107981 // zeta^131 * 2^31 = 28678040^131 * 2^31 -.word 38975641 // zeta^195 * 2^31 = 28678040^195 * 2^31 -.word 40352225 // zeta^163 * 2^31 = 28678040^163 * 2^31 -.word 49569327 // zeta^227 * 2^31 = 28678040^227 * 2^31 -.word 26799603 // zeta^ 38 * 2^31 = 28678040^ 38 * 2^31 -.word 33463463 // zeta^166 * 2^31 = 28678040^166 * 2^31 -.word 39332725 // zeta^102 * 2^31 = 28678040^102 * 2^31 -.word 61125067 // zeta^230 * 2^31 = 28678040^230 * 2^31 -.word 583438349 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 38 * 375649793 * 2^31 -.word 1692658009 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 28678040^166 * 375649793 * 2^31 -.word 1738958475 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 28678040^102 * 375649793 * 2^31 -.word 2248227893 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 28678040^230 * 375649793 * 2^31 -.word 40014327 // zeta^ 19 * 2^31 = 28678040^ 19 * 2^31 -.word 562885 // zeta^ 83 * 2^31 = 28678040^ 83 * 2^31 -.word 51009393 // zeta^ 51 * 2^31 = 28678040^ 51 * 2^31 -.word 51995259 // zeta^115 * 2^31 = 28678040^115 * 2^31 -.word 2564101129 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 19 * 375649793 * 2^31 -.word 2196183867 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 83 * 375649793 * 2^31 -.word 2252083855 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 51 * 375649793 * 2^31 -.word 4038290309 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 28678040^115 * 375649793 * 2^31 -.word 24330211 // zeta^147 * 2^31 = 28678040^147 * 2^31 -.word 7682101 // zeta^211 * 2^31 = 28678040^211 * 2^31 -.word 7401943 // zeta^179 * 2^31 = 28678040^179 * 2^31 -.word 41757453 // zeta^243 * 2^31 = 28678040^243 * 2^31 -.word 65375453 // zeta^ 22 * 2^31 = 28678040^ 22 * 2^31 -.word 40797001 // zeta^150 * 2^31 = 28678040^150 * 2^31 -.word 59835311 // zeta^ 86 * 2^31 = 28678040^ 86 * 2^31 -.word 32875577 // zeta^214 * 2^31 = 28678040^214 * 2^31 -.word 4014413091 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 22 * 375649793 * 2^31 -.word 3224262327 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 28678040^150 * 375649793 * 2^31 -.word 741855825 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 86 * 375649793 * 2^31 -.word 2318439879 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 28678040^214 * 375649793 * 2^31 -.word 10045293 // zeta^ 11 * 2^31 = 28678040^ 11 * 2^31 -.word 53076657 // zeta^ 75 * 2^31 = 28678040^ 75 * 2^31 -.word 17896617 // zeta^ 43 * 2^31 = 28678040^ 43 * 2^31 -.word 58413331 // zeta^107 * 2^31 = 28678040^107 * 2^31 -.word 3080518291 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 11 * 375649793 * 2^31 -.word 3700229967 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 75 * 375649793 * 2^31 -.word 297370967 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 43 * 375649793 * 2^31 -.word 2151902445 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 28678040^107 * 375649793 * 2^31 -.word 19472551 // zeta^139 * 2^31 = 28678040^139 * 2^31 -.word 6043561 // zeta^203 * 2^31 = 28678040^203 * 2^31 -.word 20934449 // zeta^171 * 2^31 = 28678040^171 * 2^31 -.word 37620445 // zeta^235 * 2^31 = 28678040^235 * 2^31 -.word 12921459 // zeta^ 54 * 2^31 = 28678040^ 54 * 2^31 -.word 63769677 // zeta^182 * 2^31 = 28678040^182 * 2^31 -.word 61505033 // zeta^118 * 2^31 = 28678040^118 * 2^31 -.word 65692461 // zeta^246 * 2^31 = 28678040^246 * 2^31 -.word 1006064525 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 54 * 375649793 * 2^31 -.word 2459563443 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 28678040^182 * 375649793 * 2^31 -.word 2747128823 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 28678040^118 * 375649793 * 2^31 -.word 2288082643 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 28678040^246 * 375649793 * 2^31 -.word 20171011 // zeta^ 27 * 2^31 = 28678040^ 27 * 2^31 -.word 36495001 // zeta^ 91 * 2^31 = 28678040^ 91 * 2^31 -.word 62685175 // zeta^ 59 * 2^31 = 28678040^ 59 * 2^31 -.word 664745 // zeta^123 * 2^31 = 28678040^123 * 2^31 -.word 1031427325 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 27 * 375649793 * 2^31 -.word 2764118887 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 91 * 375649793 * 2^31 -.word 583476745 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 59 * 375649793 * 2^31 -.word 2371908951 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 28678040^123 * 375649793 * 2^31 -.word 56713759 // zeta^155 * 2^31 = 28678040^155 * 2^31 -.word 59594509 // zeta^219 * 2^31 = 28678040^219 * 2^31 -.word 41235703 // zeta^187 * 2^31 = 28678040^187 * 2^31 -.word 11581499 // zeta^251 * 2^31 = 28678040^251 * 2^31 -.word 23458751 // zeta^ 14 * 2^31 = 28678040^ 14 * 2^31 -.word 9406759 // zeta^142 * 2^31 = 28678040^142 * 2^31 -.word 33711991 // zeta^ 78 * 2^31 = 28678040^ 78 * 2^31 -.word 32167773 // zeta^206 * 2^31 = 28678040^206 * 2^31 -.word 1501790785 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 14 * 375649793 * 2^31 -.word 2911894745 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 28678040^142 * 375649793 * 2^31 -.word 1905016457 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 78 * 375649793 * 2^31 -.word 204130979 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 28678040^206 * 375649793 * 2^31 -.word 26043621 // zeta^ 7 * 2^31 = 28678040^ 7 * 2^31 -.word 51942461 // zeta^ 71 * 2^31 = 28678040^ 71 * 2^31 -.word 14401009 // zeta^ 39 * 2^31 = 28678040^ 39 * 2^31 -.word 60574133 // zeta^103 * 2^31 = 28678040^103 * 2^31 -.word 1827638555 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 7 * 375649793 * 2^31 -.word 3437088195 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 71 * 375649793 * 2^31 -.word 2892737551 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 39 * 375649793 * 2^31 -.word 3197159499 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 28678040^103 * 375649793 * 2^31 -.word 16031087 // zeta^135 * 2^31 = 28678040^135 * 2^31 -.word 25566271 // zeta^199 * 2^31 = 28678040^199 * 2^31 -.word 54040269 // zeta^167 * 2^31 = 28678040^167 * 2^31 -.word 36895029 // zeta^231 * 2^31 = 28678040^231 * 2^31 -.word 41803191 // zeta^ 46 * 2^31 = 28678040^ 46 * 2^31 -.word 19377381 // zeta^174 * 2^31 = 28678040^174 * 2^31 -.word 9664027 // zeta^110 * 2^31 = 28678040^110 * 2^31 -.word 55794235 // zeta^238 * 2^31 = 28678040^238 * 2^31 -.word 2460960841 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 46 * 375649793 * 2^31 -.word 1411728667 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 28678040^174 * 375649793 * 2^31 -.word 1300076517 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 28678040^110 * 375649793 * 2^31 -.word 3978752965 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 28678040^238 * 375649793 * 2^31 -.word 19675339 // zeta^ 23 * 2^31 = 28678040^ 23 * 2^31 -.word 21359151 // zeta^ 87 * 2^31 = 28678040^ 87 * 2^31 -.word 63140729 // zeta^ 55 * 2^31 = 28678040^ 55 * 2^31 -.word 23160723 // zeta^119 * 2^31 = 28678040^119 * 2^31 -.word 398439733 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 23 * 375649793 * 2^31 -.word 897838033 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 87 * 375649793 * 2^31 -.word 494618247 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 55 * 375649793 * 2^31 -.word 3040761453 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 28678040^119 * 375649793 * 2^31 -.word 9258847 // zeta^151 * 2^31 = 28678040^151 * 2^31 -.word 4669959 // zeta^215 * 2^31 = 28678040^215 * 2^31 -.word 41266143 // zeta^183 * 2^31 = 28678040^183 * 2^31 -.word 61464071 // zeta^247 * 2^31 = 28678040^247 * 2^31 -.word 43355169 // zeta^ 30 * 2^31 = 28678040^ 30 * 2^31 -.word 5591977 // zeta^158 * 2^31 = 28678040^158 * 2^31 -.word 40694335 // zeta^ 94 * 2^31 = 28678040^ 94 * 2^31 -.word 25071607 // zeta^222 * 2^31 = 28678040^222 * 2^31 -.word 1107279327 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 30 * 375649793 * 2^31 -.word 552289879 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 28678040^158 * 375649793 * 2^31 -.word 879592385 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 94 * 375649793 * 2^31 -.word 2040862217 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 28678040^222 * 375649793 * 2^31 -.word 34737117 // zeta^ 15 * 2^31 = 28678040^ 15 * 2^31 -.word 45994147 // zeta^ 79 * 2^31 = 28678040^ 79 * 2^31 -.word 42273719 // zeta^ 47 * 2^31 = 28678040^ 47 * 2^31 -.word 60428681 // zeta^111 * 2^31 = 28678040^111 * 2^31 -.word 303076899 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 15 * 375649793 * 2^31 -.word 3854339421 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 79 * 375649793 * 2^31 -.word 3799259721 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 47 * 375649793 * 2^31 -.word 1636911223 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 28678040^111 * 375649793 * 2^31 -.word 26028927 // zeta^143 * 2^31 = 28678040^143 * 2^31 -.word 64083527 // zeta^207 * 2^31 = 28678040^207 * 2^31 -.word 60382541 // zeta^175 * 2^31 = 28678040^175 * 2^31 -.word 31337387 // zeta^239 * 2^31 = 28678040^239 * 2^31 -.word 27553395 // zeta^ 62 * 2^31 = 28678040^ 62 * 2^31 -.word 7648471 // zeta^190 * 2^31 = 28678040^190 * 2^31 -.word 689375 // zeta^126 * 2^31 = 28678040^126 * 2^31 -.word 46555773 // zeta^254 * 2^31 = 28678040^254 * 2^31 -.word 1673531277 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 62 * 375649793 * 2^31 -.word 1889513769 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 28678040^190 * 375649793 * 2^31 -.word 1477062945 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 28678040^126 * 375649793 * 2^31 -.word 2252242819 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 28678040^254 * 375649793 * 2^31 -.word 15797163 // zeta^ 31 * 2^31 = 28678040^ 31 * 2^31 -.word 40170027 // zeta^ 95 * 2^31 = 28678040^ 95 * 2^31 -.word 10866061 // zeta^ 63 * 2^31 = 28678040^ 63 * 2^31 -.word 56298001 // zeta^127 * 2^31 = 28678040^127 * 2^31 -.word 683123285 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 31 * 375649793 * 2^31 -.word 2755967957 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 95 * 375649793 * 2^31 -.word 273527923 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 63 * 375649793 * 2^31 -.word 644194287 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 28678040^127 * 375649793 * 2^31 -.word 50400667 // zeta^159 * 2^31 = 28678040^159 * 2^31 -.word 33861863 // zeta^223 * 2^31 = 28678040^223 * 2^31 -.word 53736885 // zeta^191 * 2^31 = 28678040^191 * 2^31 -.word 31774129 // zeta^255 * 2^31 = 28678040^255 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040, %function -.global ntt_n256_u32_33556993_28678040 -ntt_n256_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r11, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q5, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r12 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2857 -// Instruction count: 2421 \ No newline at end of file diff --git a/tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete.s b/tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index 5915c64..0000000 --- a/tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2025 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete, %function -.global ntt_n256_u32_33556993_28678040_incomplete -ntt_n256_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -52)] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1993 -// Instruction count: 1557 \ No newline at end of file diff --git a/tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s b/tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s deleted file mode 100644 index eed73db..0000000 --- a/tests/poly/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s +++ /dev/null @@ -1,2316 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete_double, %function -.global ntt_n256_u32_33556993_28678040_incomplete_double -ntt_n256_u32_33556993_28678040_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -56)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -40)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -24)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -80)] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -20)] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q1, [r1,#(96)] -vqrdmulh.s32 Q7, Q1, r6 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q3, [r1,#(64)] -vqrdmlah.s32 Q7, Q1, r12 -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(112)] -vqrdmulh.s32 Q7, Q3, r6 -vsub.s32 Q4, Q2, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q4, [r1,#(32)] -vqrdmlah.s32 Q7, Q3, r12 -vstrw.u32 Q7, [r1,#(80)] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r8 -vadd.s32 Q2, Q2, Q6 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q2, [r1,#(0)]! -vqrdmlah.s32 Q7, Q4, r12 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q1, Q2, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q2, Q2, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[0] from Q2 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q0, [r1,#(224)] -vqrdmulh.s32 Q7, Q0, r6 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q0, Q0, r5 -vstrw.u32 Q2, [r1,#(192)] -vqrdmlah.s32 Q7, Q0, r12 -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q2, r6 -vsub.s32 Q3, Q1, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#(160)] -vqrdmlah.s32 Q7, Q2, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[24] from Q2 -vqrdmulh.s32 Q7, Q3, r8 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q1, [r1,#(128)]! -vqrdmlah.s32 Q7, Q3, r12 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q0, Q1, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q1, Q1, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q0, [r1,#(16)] -// Release input[16] from Q1 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[32] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[48] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[64] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vmul.u32 Q3, Q3, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[80] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q4, Q4, r9 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[96] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[112] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[128] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r9 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[144] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q4, Q4, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[160] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vmul.u32 Q3, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[176] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q4, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[192] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -44)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[208] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1,#(224)] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1,#(240)] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1,#(208)] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[224] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -12)] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1,#(224)] -vqrdmulh.s32 Q6, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q1, [r1,#(192)] -vqrdmlah.s32 Q6, Q3, r12 -// Release input[252] from Q3 -vqrdmlah.s32 Q4, Q2, r12 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1,#(240)] -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#(160)] -vqrdmlah.s32 Q6, Q1, r12 -vstrw.u32 Q6, [r1,#(208)] -// Release input[248] from Q1 -vqrdmulh.s32 Q6, Q2, r8 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1,#(128)]! -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q6, Q6 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q6, [r1,#(48)] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#(16)] -// Release input[240] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2284 -// Instruction count: 1848 \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_16_anticyclic_mve_simd.s b/tests/poly/auto/poly_u16_mul_16_anticyclic_mve_simd.s deleted file mode 100644 index cf128b4..0000000 --- a/tests/poly/auto/poly_u16_mul_16_anticyclic_mve_simd.s +++ /dev/null @@ -1,108 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_mve_simd, %function -.global poly_u16_mul_16_anticyclic_mve_simd -poly_u16_mul_16_anticyclic_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -vmov.u16 Q2, #0 -mov r12, #0 -vldrh.u16 Q3, [r2, #(2 * 0)] -vldrh.u16 Q4, [r2, #(2 * 8)] -vneg.s16 Q5, Q3 -ldrd r14, r11, [r1, #8] -ldrd r10, r9, [r1, #24] -vmul.u16 Q0, Q4, r14 -vmla.s16 Q0, Q3, r10 -vmul.u16 Q1, Q4, r10 -vmla.s16 Q1, Q5, r14 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -ldrd r8, r7, [r1, #0] -ldrd r6, r5, [r1, #16] -vmla.s16 Q0, Q4, r7 -vmla.s16 Q0, Q3, r5 -vmla.s16 Q1, Q4, r5 -vmla.s16 Q1, Q5, r7 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vmla.s16 Q0, Q4, r8 -vmla.s16 Q0, Q3, r6 -vmla.s16 Q1, Q4, r6 -vmla.s16 Q1, Q5, r8 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vneg.s16 Q5, Q4 -vmla.s16 Q0, Q3, r11 -vmla.s16 Q0, Q5, r9 -vmla.s16 Q1, Q4, r11 -vmla.s16 Q1, Q3, r9 -vsub.u16 Q0, Q0, Q2 -vmov.u16 Q2, #0 -vshlc Q0, r12, #16 -vshlc Q1, r12, #16 -vshlc Q2, r12, #16 -asrl r14, r11, #16 -asrl r10, r9, #16 -vmla.s16 Q0, Q3, r14 -vmla.s16 Q0, Q5, r10 -vmla.s16 Q1, Q4, r14 -vmla.s16 Q1, Q3, r10 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -asrl r8, r7, #16 -asrl r6, r5, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q0, Q5, r5 -vmla.s16 Q1, Q4, r7 -vmla.s16 Q1, Q3, r5 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vmla.s16 Q0, Q3, r8 -vmla.s16 Q0, Q5, r6 -vmla.s16 Q1, Q4, r8 -vmla.s16 Q1, Q3, r6 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -neg r9, r9 -vmla.s16 Q0, Q3, r9 -vmla.s16 Q0, Q5, r11 -vmla.s16 Q1, Q4, r9 -vmla.s16 Q1, Q3, r11 -vsub.u16 Q0, Q0, Q2 -vstrh.u16 Q0, [r0,#(0)] -vstrh.u16 Q1, [r0,#(16)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s b/tests/poly/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s deleted file mode 100644 index 4728c69..0000000 --- a/tests/poly/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,425 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_opt_mve_simd, %function -.global poly_u16_mul_16_anticyclic_opt_mve_simd -poly_u16_mul_16_anticyclic_opt_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -vldrh.u16 Q2, [r0, #(2 * 0)] -vldrh.u16 Q3, [r0, #(2 * 8)] -ldrd r10, r11, [r1, #0] -ldrd r8, r9, [r1, #16] -ldrd r6, r7, [r1, #24] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2, #(2 * 8)] -vmla.s16 Q3, Q1, r6 -ldrd r4, r5, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r9 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r11 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r7 -asrl r4, r5, #16 -asrl r6, r7, #16 -asrl r10, r11, #16 -asrl r8, r9, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r5 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r11 -vmla.s16 Q3, Q0, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r8 -vstrh.u16 Q3, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -mov r14, #30 -wls r14, r14, loop_end -loop_start: -mov r11, r11 -mov r11, r11 -mov r11, r11 -vldrh.u16 Q5, [r0, #(2 * 0)] -vldrh.u16 Q4, [r0, #(2 * 8)] -ldrd r10, r9, [r1, #0] -ldrd r8, r7, [r1, #16] -ldrd r6, r5, [r1, #24] -vldrh.u16 Q7, [r2, #(2 * 0)] -vmla.s16 Q5, Q7, r6 -vldrh.u16 Q6, [r2, #(2 * 8)] -vmla.s16 Q4, Q6, r6 -ldrd r4, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q5, Q6, r4 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r4 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r8 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r10 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r5 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r3 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r5 -asrl r4, r3, #16 -asrl r6, r5, #16 -asrl r10, r9, #16 -asrl r8, r7, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r5 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r3 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r5 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r4 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r6 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r4 -vmla.s16 Q4, Q7, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r9 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0,#(0)] -vmla.s16 Q4, Q7, r8 -vstrh.u16 Q4, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -vldrh.u16 Q2, [r0, #(2 * 0)] -vldrh.u16 Q3, [r0, #(2 * 8)] -ldrd r10, r9, [r1, #0] -ldrd r8, r7, [r1, #16] -ldrd r6, r5, [r1, #24] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2, #(2 * 8)] -vmla.s16 Q3, Q1, r6 -ldrd r4, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r5 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r5 -asrl r4, r3, #16 -asrl r6, r5, #16 -asrl r10, r9, #16 -asrl r8, r7, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r5 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r3 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r9 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r8 -vstrh.u16 Q3, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -le r14, loop_start -loop_end: -vldrh.u16 Q5, [r0, #(2 * 0)] -vldrh.u16 Q4, [r0, #(2 * 8)] -ldrd r14, r9, [r1, #0] -ldrd r10, r7, [r1, #16] -ldrd r8, r5, [r1, #24] -vldrh.u16 Q7, [r2, #(2 * 0)] -vmla.s16 Q5, Q7, r8 -vldrh.u16 Q6, [r2, #(2 * 8)] -vmla.s16 Q4, Q6, r8 -ldrd r6, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q5, Q6, r6 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r14 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r5 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r3 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r5 -asrl r6, r3, #16 -asrl r8, r5, #16 -asrl r14, r9, #16 -asrl r10, r7, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r5 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r3 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r5 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r6 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r6 -vmla.s16 Q4, Q7, r8 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r9 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r14 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0,#(0)] -vmla.s16 Q4, Q7, r10 -vstrh.u16 Q4, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -vldrh.u16 Q2, [r0, #(2 * 0)] -vldrh.u16 Q3, [r0, #(2 * 8)] -ldrd r14, r9, [r1, #0] -ldrd r10, r7, [r1, #16] -ldrd r8, r5, [r1, #24] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmla.s16 Q2, Q0, r8 -vldrh.u16 Q1, [r2, #(2 * 8)] -vmla.s16 Q3, Q1, r8 -ldrd r6, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r6 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r14 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r5 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r5 -asrl r6, r3, #16 -asrl r8, r5, #16 -asrl r14, r9, #16 -asrl r10, r7, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r5 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r3 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r6 -vmla.s16 Q3, Q0, r8 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r9 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r14 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r10 -vstrh.u16 Q3, [r0,#(16)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_16_mve_simd.s b/tests/poly/auto/poly_u16_mul_16_mve_simd.s deleted file mode 100644 index cac2bab..0000000 --- a/tests/poly/auto/poly_u16_mul_16_mve_simd.s +++ /dev/null @@ -1,178 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_mve_simd, %function -.global poly_u16_mul_16_mve_simd -poly_u16_mul_16_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmul.u16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vmul.u16 Q1, Q0, r11 -vldrh.u16 Q2, [r2, #(2 * 8)] -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vmul.u16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vmov.u16 Q1, #0 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_256_toom4_mve.s b/tests/poly/auto/poly_u16_mul_256_toom4_mve.s deleted file mode 100644 index 5b5271e..0000000 --- a/tests/poly/auto/poly_u16_mul_256_toom4_mve.s +++ /dev/null @@ -1,1287 +0,0 @@ -.syntax unified -.type poly_u16_mul_64_C, %function -.global poly_u16_mul_64_C -.syntax unified -.type poly_u16_mul_256_toom4_mve, %function -.global poly_u16_mul_256_toom4_mve -poly_u16_mul_256_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #1792 -add sp, sp, #504 -add r14, sp, #1008 -add r1, r1, #504 -add r2, r2, #504 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -94)] -vldrw.u32 Q2, [r1, #(4 * -62)] -vldrw.u32 Q3, [r1, #(4 * -30)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(264)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -90)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * -58)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * -26)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-488)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-248)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(280)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -86)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * -54)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * -22)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-472)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-232)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(296)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -82)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * -50)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * -18)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-456)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-216)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(312)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -78)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * -46)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * -14)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-440)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-200)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(328)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -74)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * -42)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * -10)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-424)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-184)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(344)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -70)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * -38)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * -6)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-408)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-168)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(360)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -66)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * -34)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * -2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-392)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-152)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -94)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * -62)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * -30)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-376)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-136)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(392)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -90)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * -58)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * -26)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-360)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-120)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(408)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -86)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * -54)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * -22)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-344)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-104)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(424)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -82)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * -50)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * -18)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-328)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-88)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -78)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * -46)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * -14)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-312)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-72)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(456)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -74)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * -42)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * -10)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-296)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-56)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(472)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -70)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * -38)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * -6)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-280)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-40)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(488)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -66)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * -34)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * -2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-264)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-24)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(248)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-248)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-264)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-8)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #256 -add r10, r2, #256 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(1024) -add r2, sp, #(1152) -add r0, sp, #(1280) -bl poly_u16_mul_64_C -add r1, sp, #(768) -add r2, sp, #(896) -add r0, sp, #(1024) -bl poly_u16_mul_64_C -add r1, sp, #(512) -add r2, sp, #(640) -add r0, sp, #(768) -bl poly_u16_mul_64_C -add r1, sp, #(256) -add r2, sp, #(384) -add r0, sp, #(512) -bl poly_u16_mul_64_C -add r1, sp, #(0) -add r2, sp, #(128) -add r0, sp, #(256) -bl poly_u16_mul_64_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_64_C -add r1, r11, #(128) -add r2, r10, #(128) -add r0, sp, #(1536) -bl poly_u16_mul_64_C -add sp, sp, #504 -add r14, sp, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vldrw.u32 Q1, [sp, #(4 * 2)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -62)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -122)] -vldrw.u32 Q4, [sp, #(4 * 66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [sp, #(4 * 70)] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [sp, #(4 * 6)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [sp,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [sp,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -118)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 74)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -114)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 78)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 82)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 86)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 90)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 94)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 98)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -62)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-248)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 102)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 2)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -58)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-232)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 106)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(24)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 10)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -54)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-216)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 110)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(40)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 14)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -50)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 114)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(56)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 18)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -46)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-184)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 118)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(72)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 22)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -42)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-168)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 122)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(88)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 26)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -38)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-152)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 126)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(104)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 30)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #(4 * -34)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(-136)] -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q3, [sp, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r14,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q1, [sp, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(120)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 34)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(136)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -add sp, sp, #1792 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s b/tests/poly/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s deleted file mode 100644 index f0fd03e..0000000 --- a/tests/poly/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s +++ /dev/null @@ -1,773 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd, %function -.global poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd -poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #224 -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add sp, sp, #224 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s b/tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s deleted file mode 100644 index 3f7bb39..0000000 --- a/tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s +++ /dev/null @@ -1,268 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd, %function -.global poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd -poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -nop -nop -nop -nop -nop -nop -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #24] -vmul.u16 Q0, Q4, r11 -ldrd r10, r9, [r1, #56] -vmul.u16 Q1, Q4, r9 -vneg.s16 Q3, Q6 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #16] -vmla.s16 Q1, Q6, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #48] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #8] -vmla.s16 Q1, Q6, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #40] -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #0] -vmla.s16 Q1, Q6, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #32] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q6, r11 -vstrh.u16 Q1, [r0,#(16)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(0)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vld20.u16 {Q2, Q3}, [r1] -vld21.u16 {Q2, Q3}, [r1]! -vst20.u16 {Q1, Q2}, [r1] -vst21.u16 {Q1, Q2}, [r1]! -vst20.u16 {Q3, Q4}, [r1] -vst21.u16 {Q3, Q4}, [r1]! -vadd.u16 Q0, Q0, Q1 -vadd.u16 Q2, Q2, Q3 -vst20.u16 {Q0, Q1}, [r1] -vst21.u16 {Q0, Q1}, [r1]! -vst20.u16 {Q2, Q3}, [r1] -vst21.u16 {Q2, Q3}, [r1]! -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #-104] -vmul.u16 Q0, Q5, r11 -ldrd r10, r9, [r1, #-72] -vmul.u16 Q1, Q5, r9 -vneg.s16 Q3, Q7 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #-112] -vmla.s16 Q1, Q7, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #-80] -vmla.s16 Q1, Q7, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #-120] -vmla.s16 Q1, Q7, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #-88] -vmla.s16 Q1, Q7, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #-128] -vmla.s16 Q1, Q7, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #-96] -vmla.s16 Q1, Q7, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q7, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q7, r11 -vstrh.u16 Q1, [r0,#(48)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(32)] -vadd.u16 Q4, Q4, Q5 -vadd.u16 Q6, Q6, Q7 -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #-40] -vmul.u16 Q0, Q4, r11 -ldrd r10, r9, [r1, #-8] -vmul.u16 Q1, Q4, r9 -vneg.s16 Q3, Q6 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #-48] -vmla.s16 Q1, Q6, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #-16] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #-56] -vmla.s16 Q1, Q6, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #-24] -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #-64] -vmla.s16 Q1, Q6, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #-32] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q6, r11 -vstrh.u16 Q1, [r0,#(80)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(64)] -nop -nop -nop -nop -nop -nop -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s b/tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s deleted file mode 100644 index 43aefad..0000000 --- a/tests/poly/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s +++ /dev/null @@ -1,749 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd -poly_u16_mul_32_anticyclic_karatsuba_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #224 -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add sp, sp, #224 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_32_anticyclic_mve_simd.s b/tests/poly/auto/poly_u16_mul_32_anticyclic_mve_simd.s deleted file mode 100644 index f26a204..0000000 --- a/tests/poly/auto/poly_u16_mul_32_anticyclic_mve_simd.s +++ /dev/null @@ -1,274 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_mve_simd, %function -.global poly_u16_mul_32_anticyclic_mve_simd -poly_u16_mul_32_anticyclic_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -nop // XXX -mov r14, #0x42 -mov r14, #0x3 -vmsr p0, r14 -vldrh.u16 Q0, [r2, #(2 * 0)] -vldrh.u16 Q1, [r2, #(2 * 8)] -vldrh.u16 Q2, [r2, #(2 * 16)] -vldrh.u16 Q3, [r2, #(2 * 24)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -ldrh r10, [r1, #46] -ldrh r9, [r1, #62] -vmul.u16 Q4, Q0, r14 -vmul.u16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmul.u16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmul.u16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #12] -ldrh r11, [r1, #28] -ldrh r10, [r1, #44] -ldrh r9, [r1, #60] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #10] -ldrh r11, [r1, #26] -ldrh r10, [r1, #42] -ldrh r9, [r1, #58] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #8] -ldrh r11, [r1, #24] -ldrh r10, [r1, #40] -ldrh r9, [r1, #56] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #6] -ldrh r11, [r1, #22] -ldrh r10, [r1, #38] -ldrh r9, [r1, #54] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #4] -ldrh r11, [r1, #20] -ldrh r10, [r1, #36] -ldrh r9, [r1, #52] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #2] -ldrh r11, [r1, #18] -ldrh r10, [r1, #34] -ldrh r9, [r1, #50] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #0] -ldrh r11, [r1, #16] -ldrh r10, [r1, #32] -ldrh r9, [r1, #48] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vmla.s16 Q6, Q3, r9 -neg r12, r12 -vstrh.u16 Q4, [r0,#(0)] -vstrh.u16 Q5, [r0,#(16)] -vstrh.u16 Q6, [r0,#(32)] -vstrh.u16 Q7, [r0,#(48)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s b/tests/poly/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s deleted file mode 100644 index 6724253..0000000 --- a/tests/poly/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,274 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_opt_mve_simd, %function -.global poly_u16_mul_32_anticyclic_opt_mve_simd -poly_u16_mul_32_anticyclic_opt_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -ldrh r14, [r1, #30] -vldrh.u16 Q4, [r2, #(2 * 0)] -vmul.u16 Q1, Q4, r14 -ldrh r11, [r1, #46] -vldrh.u16 Q5, [r2, #(2 * 8)] -vmul.u16 Q2, Q4, r11 -ldrh r10, [r1, #62] -vldrh.u16 Q6, [r2, #(2 * 16)] -vmul.u16 Q3, Q4, r10 -ldrh r9, [r1, #14] -vldrh.u16 Q7, [r2, #(2 * 24)] -vmla.s16 Q3, Q5, r11 -neg r10, r10 -vmla.s16 Q2, Q5, r14 -neg r11, r11 -vmla.s16 Q3, Q6, r14 -neg r14, r14 -vmla.s16 Q3, Q7, r9 -ldrh r8, [r1, #12] -vmul.u16 Q0, Q7, r14 -ldrh r7, [r1, #28] -vmla.s16 Q0, Q6, r11 -ldrh r6, [r1, #44] -vmla.s16 Q0, Q5, r10 -ldrh r5, [r1, #60] -vmla.s16 Q0, Q4, r9 -vmla.s16 Q1, Q5, r9 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r11 -vmla.s16 Q1, Q6, r10 -neg r12, r12 -vmla.s16 Q2, Q6, r9 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #10] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #26] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #42] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #58] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #8] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #24] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #40] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #56] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #6] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #22] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #38] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #54] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #4] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #20] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #36] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #52] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #2] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #18] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #34] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #50] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #0] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #16] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #32] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #48] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #-2] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #14] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #30] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #46] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vstrh.u16 Q3, [r0,#(48)] -vmla.s16 Q1, Q7, r6 -vstrh.u16 Q0, [r0,#(0)] -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vstrh.u16 Q1, [r0,#(16)] -vmla.s16 Q2, Q7, r5 -vstrh.u16 Q2, [r0,#(32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_32_mve_simd.s b/tests/poly/auto/poly_u16_mul_32_mve_simd.s deleted file mode 100644 index 0d85860..0000000 --- a/tests/poly/auto/poly_u16_mul_32_mve_simd.s +++ /dev/null @@ -1,386 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_mve_simd, %function -.global poly_u16_mul_32_mve_simd -poly_u16_mul_32_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -mov r0, r0 -mov r0, r0 -mov r12, #0 -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -ldrh r10, [r1, #46] -ldrh r9, [r1, #62] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmul.u16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vmul.u16 Q1, Q0, r11 -vldrh.u16 Q2, [r2, #(2 * 8)] -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vmul.u16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vldrh.u16 Q3, [r2, #(2 * 16)] -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vmul.u16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vldrh.u16 Q4, [r2, #(2 * 24)] -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vmul.u16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vmul.u16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vmul.u16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vmov.u16 Q1, #0 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #12] -ldrh r11, [r1, #28] -ldrh r10, [r1, #44] -ldrh r9, [r1, #60] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #10] -ldrh r11, [r1, #26] -ldrh r10, [r1, #42] -ldrh r9, [r1, #58] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #8] -ldrh r11, [r1, #24] -ldrh r10, [r1, #40] -ldrh r9, [r1, #56] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #6] -ldrh r11, [r1, #22] -ldrh r10, [r1, #38] -ldrh r9, [r1, #54] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #4] -ldrh r11, [r1, #20] -ldrh r10, [r1, #36] -ldrh r9, [r1, #52] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #2] -ldrh r11, [r1, #18] -ldrh r10, [r1, #34] -ldrh r9, [r1, #50] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #0] -ldrh r11, [r1, #16] -ldrh r10, [r1, #32] -ldrh r9, [r1, #48] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vstrh.u16 Q1, [r0,#(112)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_512_toom4_mve.s b/tests/poly/auto/poly_u16_mul_512_toom4_mve.s deleted file mode 100644 index 87219a7..0000000 --- a/tests/poly/auto/poly_u16_mul_512_toom4_mve.s +++ /dev/null @@ -1,2501 +0,0 @@ -.syntax unified -.type poly_u16_mul_128_C, %function -.global poly_u16_mul_128_C -.syntax unified -.type poly_u16_mul_512_toom4_mve, %function -.global poly_u16_mul_512_toom4_mve -poly_u16_mul_512_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #3584 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r1, r1, #504 -add r10, r1, #1008 -add r2, r2, #504 -add r9, r2, #1008 -mov r8, #1 -mov r7, #2 -mov r6, #3 -mov r5, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -62)] -vldrw.u32 Q2, [r1, #(4 * 2)] -vldrw.u32 Q3, [r1, #(4 * 66)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-488)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(24)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -58)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 6)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-472)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(8)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-472)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(40)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -54)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 10)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 74)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-456)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-456)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(56)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -50)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 14)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-440)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(40)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -46)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 18)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * 82)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-424)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-424)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(88)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -42)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 22)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-408)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(72)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-408)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(104)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -38)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 26)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 90)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-392)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(120)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -34)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 30)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 94)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(104)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -30)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 34)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * 98)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-360)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-360)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 38)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 102)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-344)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(136)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-344)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 42)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 106)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-328)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 46)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 110)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(168)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 50)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * 114)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-296)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-296)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 54)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 118)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-280)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(200)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 58)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 122)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-264)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 62)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 126)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(232)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-248)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -62)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 2)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 66)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-232)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -58)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 6)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(264)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-216)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -54)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 10)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 74)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-200)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -50)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 14)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-184)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(296)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-184)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -46)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 18)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 82)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-168)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -42)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 22)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-152)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -38)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 26)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 90)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-136)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -34)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 30)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 94)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-120)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-120)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -30)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 34)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 98)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-104)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 38)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 102)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-88)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 42)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 106)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-72)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 46)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 110)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-56)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-56)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 50)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 114)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-40)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 54)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 118)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-24)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 58)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 122)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-8)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 62)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 126)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(8)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(8)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(24)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-8)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(504)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #512 -add r10, r2, #512 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(2048) -add r2, sp, #(2304) -add r0, sp, #(2560) -bl poly_u16_mul_128_C -add r1, sp, #(1536) -add r2, sp, #(1792) -add r0, sp, #(2048) -bl poly_u16_mul_128_C -add r1, sp, #(1024) -add r2, sp, #(1280) -add r0, sp, #(1536) -bl poly_u16_mul_128_C -add r1, sp, #(512) -add r2, sp, #(768) -add r0, sp, #(1024) -bl poly_u16_mul_128_C -add r1, sp, #(0) -add r2, sp, #(256) -add r0, sp, #(512) -bl poly_u16_mul_128_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_128_C -add r1, r11, #(256) -add r2, r10, #(256) -add r0, sp, #(3072) -bl poly_u16_mul_128_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #-64 -mov r9, #45 -mov r8, #-8 -mov r6, #43691 -mov r5, #16 -mov r4, #30 -mov r3, #61167 -mov r2, #-65 -mov r1, #36409 -mov r0, #1 -vldrw.u32 Q0, [r12, #(4 * 10)] -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * 2)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * -118)] -vldrw.u32 Q4, [r14, #(4 * 6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r11, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r2 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r8 -vldrw.u32 Q5, [r14, #(4 * 10)] -vmla.s16 Q0, Q3, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r5 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(-472)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r1 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r4 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r3 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 14)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 18)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 22)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 26)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 30)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 34)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 38)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 42)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 46)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 50)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 54)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 58)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -70)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 62)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -66)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 66)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 70)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 2)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(8)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 74)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 6)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(24)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 78)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -114)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 10)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(40)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 82)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -110)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 14)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 86)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -106)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 18)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(72)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 90)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -102)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 22)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(88)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 94)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -98)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 26)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(104)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 98)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -94)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 30)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 102)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -90)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 34)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(136)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 106)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -86)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 38)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(152)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 110)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -82)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 42)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(168)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 114)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -78)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 46)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 118)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -74)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 50)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(200)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -6)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -70)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 54)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(216)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 126)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -2)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -66)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 58)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(232)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r12, #(4 * -122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 2)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -62)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-264)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #(4 * 62)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(248)] -vmla.s16 Q1, Q2, r8 -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q3, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * 70)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r12,#(280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-216)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -add sp, sp, #3584 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_64_toom4_mve.s b/tests/poly/auto/poly_u16_mul_64_toom4_mve.s deleted file mode 100644 index 1e57d8e..0000000 --- a/tests/poly/auto/poly_u16_mul_64_toom4_mve.s +++ /dev/null @@ -1,379 +0,0 @@ -.syntax unified -.type poly_u16_mul_16_C, %function -.global poly_u16_mul_16_C -.syntax unified -.type poly_u16_mul_64_toom4_mve, %function -.global poly_u16_mul_64_toom4_mve -poly_u16_mul_64_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #448 -add sp, sp, #504 -add r1, r1, #504 -add r2, r2, #504 -mov r14, #1 -mov r12, #2 -mov r11, #3 -mov r10, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -118)] -vldrw.u32 Q2, [r1, #(4 * -110)] -vldrw.u32 Q3, [r1, #(4 * -102)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r11 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q6, Q5, r12 -vmla.s16 Q5, Q1, r11 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r10 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -114)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * -106)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * -98)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [sp,#(-248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-440)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r11 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q4, Q7, r12 -vmla.s16 Q7, Q1, r11 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q6, Q2, r11 -vmla.s16 Q6, Q3, r10 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -118)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * -110)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * -102)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [sp,#(-232)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-424)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r11 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q5, Q6, r12 -vmla.s16 Q6, Q1, r11 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q4, Q2, r11 -vmla.s16 Q4, Q3, r10 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -114)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * -106)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * -98)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [sp,#(-216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-408)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r11 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q7, Q4, r12 -vmla.s16 Q4, Q1, r11 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q5, Q2, r11 -vmla.s16 Q5, Q3, r10 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [sp,#(-200)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-456)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-392)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #64 -add r10, r2, #64 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(256) -add r2, sp, #(288) -add r0, sp, #(320) -bl poly_u16_mul_16_C -add r1, sp, #(192) -add r2, sp, #(224) -add r0, sp, #(256) -bl poly_u16_mul_16_C -add r1, sp, #(128) -add r2, sp, #(160) -add r0, sp, #(192) -bl poly_u16_mul_16_C -add r1, sp, #(64) -add r2, sp, #(96) -add r0, sp, #(128) -bl poly_u16_mul_16_C -add r1, sp, #(0) -add r2, sp, #(32) -add r0, sp, #(64) -bl poly_u16_mul_16_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_16_C -add r1, r11, #(32) -add r2, r10, #(32) -add r0, sp, #(384) -bl poly_u16_mul_16_C -add sp, sp, #504 -mov r14, #-64 -mov r12, #45 -mov r11, #-8 -mov r10, #43691 -mov r9, #16 -mov r8, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [sp, #(4 * -46)] -vldrw.u32 Q1, [sp, #(4 * -94)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -110)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [sp, #(4 * -62)] -vldrw.u32 Q4, [sp, #(4 * -78)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [sp, #(4 * -30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [sp, #(4 * -42)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r11 -vldrw.u32 Q5, [sp, #(4 * -74)] -vmla.s16 Q0, Q3, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [sp,#(-376)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r9 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(-248)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [sp, #(4 * -90)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r8 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [sp,#(-312)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [sp,#(-440)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [sp,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [sp, #(4 * -26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [sp, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [sp, #(4 * -70)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [sp,#(-360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [sp,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [sp,#(-424)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [sp,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [sp, #(4 * -22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [sp, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -110)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-440)] -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [sp, #(4 * -66)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vldrw.u32 Q7, [sp, #(4 * -78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [sp, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [sp,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [sp, #(4 * -62)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [sp,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [sp, #(4 * -94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(-376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [sp, #(4 * -30)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [sp, #(4 * -18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #(4 * -106)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q1, Q2, r11 -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vldrw.u32 Q3, [sp, #(4 * -74)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(-296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [sp, #(4 * -42)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [sp,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [sp, #(4 * -58)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q1, [sp, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(-360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [sp, #(4 * -26)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [sp,#(-104)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -add sp, sp, #448 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_768_toom4_mve.s b/tests/poly/auto/poly_u16_mul_768_toom4_mve.s deleted file mode 100644 index ae6d8a7..0000000 --- a/tests/poly/auto/poly_u16_mul_768_toom4_mve.s +++ /dev/null @@ -1,3759 +0,0 @@ -.syntax unified -.type poly_u16_mul_192_C, %function -.global poly_u16_mul_192_C -.syntax unified -.type poly_u16_mul_768_toom4_mve, %function -.global poly_u16_mul_768_toom4_mve -poly_u16_mul_768_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #5376 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #1 -mov r5, #2 -mov r4, #3 -mov r3, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -30)] -vldrw.u32 Q2, [r1, #(4 * 66)] -vldrw.u32 Q3, [r8, #(4 * -90)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(24)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-216)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 70)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-456)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(264)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(40)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-200)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 74)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -82)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-440)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(56)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-184)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 78)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-424)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(296)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(72)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 82)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -74)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-408)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(88)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-152)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-392)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(104)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-136)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -66)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-376)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(120)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-120)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -62)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-360)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(136)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 2)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -58)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-344)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-88)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 6)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -54)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-328)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-72)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 10)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -50)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-312)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-56)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 14)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -46)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-296)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 18)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -42)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-280)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-24)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 22)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -38)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-264)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-8)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 26)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -34)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-248)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(8)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 30)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 126)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -30)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-232)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -62)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 34)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -26)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-216)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(504)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(40)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -58)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 38)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -22)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-200)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-488)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(56)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -54)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 42)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -18)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-184)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(72)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -50)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 46)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -14)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-168)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-456)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -46)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 50)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -10)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-152)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(104)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -42)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 54)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -6)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-136)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-424)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(120)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -38)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 58)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -2)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-120)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(136)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -34)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 62)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * 2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-104)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-392)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -30)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 66)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -90)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-88)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(168)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 70)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-72)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-360)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(184)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 74)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -82)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-56)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(200)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 78)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-40)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-328)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 82)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -74)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-24)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(232)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-8)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-296)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -66)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(264)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -62)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(24)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-264)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 2)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -58)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-8)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(296)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 6)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -54)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(56)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-232)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-456)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 10)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -50)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(328)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 14)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -46)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(88)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-200)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-424)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 18)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -42)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(360)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 22)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -38)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(120)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-168)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-392)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 26)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -34)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(392)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 30)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 126)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -30)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(152)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-136)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-360)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -62)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 34)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -26)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(424)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -58)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 38)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -22)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(184)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-104)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-328)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -54)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 42)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -18)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(456)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -50)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 46)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -14)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-72)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-296)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -46)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 50)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -10)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(488)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -42)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 54)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -6)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-40)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-264)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(504)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -38)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 58)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -2)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-488)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -34)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 62)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(280)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-8)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-232)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r11,#(296)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(248)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(8)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #768 -add r10, r2, #768 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(3072) -add r2, sp, #(3456) -add r0, sp, #(3840) -bl poly_u16_mul_192_C -add r1, sp, #(2304) -add r2, sp, #(2688) -add r0, sp, #(3072) -bl poly_u16_mul_192_C -add r1, sp, #(1536) -add r2, sp, #(1920) -add r0, sp, #(2304) -bl poly_u16_mul_192_C -add r1, sp, #(768) -add r2, sp, #(1152) -add r0, sp, #(1536) -bl poly_u16_mul_192_C -add r1, sp, #(0) -add r2, sp, #(384) -add r0, sp, #(768) -bl poly_u16_mul_192_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_192_C -add r1, r11, #(384) -add r2, r10, #(384) -add r0, sp, #(4608) -bl poly_u16_mul_192_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r6, #45 -mov r5, #-8 -mov r4, #43691 -mov r3, #16 -mov r2, #30 -mov r1, #61167 -mov r0, #-65 -vldrw.u32 Q0, [r11, #(4 * 78)] -vldrw.u32 Q1, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * 66)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -114)] -vldrw.u32 Q4, [r12, #(4 * -54)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r0 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r11, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r5 -vldrw.u32 Q5, [r12, #(4 * -50)] -vmla.s16 Q0, Q3, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(24)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r3 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-456)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 10)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r2 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-216)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r1 -vstrw.u32 Q2, [sp,#(264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r11,#(312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -46)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -42)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -38)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -34)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -30)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -26)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -22)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -18)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -14)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -10)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -6)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -58)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -54)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -50)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 66)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(264)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 18)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 70)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(280)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(40)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(296)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 26)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(312)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(72)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 34)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -94)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(104)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -90)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 30)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -86)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 34)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(136)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -82)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -78)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -74)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -70)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 126)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -66)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -62)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -58)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -54)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -94)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-376)] -vmla.s16 Q1, Q2, r5 -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q3, [r12, #(4 * 38)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * -82)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * -22)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(440)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -add sp, sp, #5376 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_mul_832_toom4_mve.s b/tests/poly/auto/poly_u16_mul_832_toom4_mve.s deleted file mode 100644 index 2a95384..0000000 --- a/tests/poly/auto/poly_u16_mul_832_toom4_mve.s +++ /dev/null @@ -1,4065 +0,0 @@ -.syntax unified -.type poly_u16_mul_208_C, %function -.global poly_u16_mul_208_C -.syntax unified -.type poly_u16_mul_832_toom4_mve, %function -.global poly_u16_mul_832_toom4_mve -poly_u16_mul_832_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #5824 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #1 -mov r5, #2 -mov r4, #3 -mov r3, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -22)] -vldrw.u32 Q2, [r1, #(4 * 82)] -vldrw.u32 Q3, [r8, #(4 * -66)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-24)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -18)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -62)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-200)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-8)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -14)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -58)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-184)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(8)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -10)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -54)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-168)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -6)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -50)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-152)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(40)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -2)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -46)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-136)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(56)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 2)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -42)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-120)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(72)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 6)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -38)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-104)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 10)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -34)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-88)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(104)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 14)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -30)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-72)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(120)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 18)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -26)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-56)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(136)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 22)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 126)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -22)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-40)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 26)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -18)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-24)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(504)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(168)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 30)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -14)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-8)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-488)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(184)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 34)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -10)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(200)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 38)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -6)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(24)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-456)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -62)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 42)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -2)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(232)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -58)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 46)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * 2)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(56)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-424)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -54)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 50)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * 6)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(264)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -50)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 54)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * 10)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(88)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-392)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -46)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 58)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -90)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * 14)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(296)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -42)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 62)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * 18)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(120)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-360)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -38)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 66)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -82)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * 22)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(328)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -34)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 70)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -78)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * 26)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(152)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-328)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -30)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 74)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -74)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * 30)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(360)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -26)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 78)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -70)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * 34)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(184)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-296)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-456)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 82)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -66)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(392)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 86)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -62)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-264)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-424)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 90)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -58)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(424)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 94)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -54)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-232)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-392)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -50)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(456)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 102)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -46)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(280)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-200)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-360)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 2)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -42)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(296)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-8)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(488)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 6)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 110)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -38)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(312)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-168)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-328)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(504)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -94)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 10)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -34)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-488)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -90)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 14)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 118)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -30)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-136)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-296)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -86)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 18)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -26)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(360)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-456)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -82)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 22)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 126)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -22)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(376)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-104)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-264)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -78)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 26)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -18)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-424)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -74)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 30)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -118)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -14)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-72)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-232)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -70)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 34)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -10)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(424)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-392)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -66)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 38)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -110)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -6)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(440)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-40)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-200)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -62)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 42)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -2)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-360)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -58)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 46)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -102)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-8)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-168)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -54)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 50)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * 6)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(488)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(8)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-328)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -50)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 54)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -94)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * 10)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(504)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(24)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-136)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -46)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 58)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * 14)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(40)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-120)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-296)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -42)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 62)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -86)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 18)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(56)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-104)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -38)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 66)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -82)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * 22)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r10,#(-456)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(72)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(248)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-88)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-264)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -34)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 70)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -78)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * 26)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(88)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(264)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-72)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -30)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 74)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -74)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * 30)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(104)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(280)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-56)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-232)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -26)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 78)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -70)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 34)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(120)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(296)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-40)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r10,#(-392)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(312)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(136)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #832 -add r10, r2, #832 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(3328) -add r2, sp, #(3744) -add r0, sp, #(4160) -bl poly_u16_mul_208_C -add r1, sp, #(2496) -add r2, sp, #(2912) -add r0, sp, #(3328) -bl poly_u16_mul_208_C -add r1, sp, #(1664) -add r2, sp, #(2080) -add r0, sp, #(2496) -bl poly_u16_mul_208_C -add r1, sp, #(832) -add r2, sp, #(1248) -add r0, sp, #(1664) -bl poly_u16_mul_208_C -add r1, sp, #(0) -add r2, sp, #(416) -add r0, sp, #(832) -bl poly_u16_mul_208_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_208_C -add r1, r11, #(416) -add r2, r10, #(416) -add r0, sp, #(4992) -bl poly_u16_mul_208_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r6, #45 -mov r5, #-8 -mov r4, #43691 -mov r3, #16 -mov r2, #30 -mov r1, #61167 -mov r0, #-65 -vldrw.u32 Q0, [r10, #(4 * -94)] -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * 82)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -50)] -vldrw.u32 Q4, [r12, #(4 * -6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r0 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r5 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q0, Q3, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r3 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-200)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r2 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r1 -vstrw.u32 Q2, [sp,#(328)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(408)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(424)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(440)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(504)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-8)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 126)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(504)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -74)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -70)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -22)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -18)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -114)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -14)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -110)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -10)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -106)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -6)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -102)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -2)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -98)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 2)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -94)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 6)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -90)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 10)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -86)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 14)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -82)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 18)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(424)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 126)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -78)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 22)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -74)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 26)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -70)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 30)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -66)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 34)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 66)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -62)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -6)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 38)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 126)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 70)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -58)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 42)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r12, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 74)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -54)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * 2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -10)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 46)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 78)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -70)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q1, Q2, r5 -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q3, [r12, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * 6)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * 50)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-152)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -add sp, sp, #5824 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256.s b/tests/poly/auto/poly_u16_toom4_fwd_256.s deleted file mode 100644 index 862a5b9..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256.s +++ /dev/null @@ -1,182 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_256_mve, %function -.global poly_u16_toom4_fwd_256_mve -poly_u16_toom4_fwd_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-240)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-224)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-208)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-192)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-304)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-176)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-160)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-272)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-144)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-256)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-128)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(368)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_bottom.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_bottom.s deleted file mode 100644 index cdce8ae..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_bottom.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_bottom_256_mve, %function -.global poly_u16_toom4_fwd_dual_bottom_256_mve -poly_u16_toom4_fwd_dual_bottom_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #-384 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s deleted file mode 100644 index 500ab41..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(144)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-240)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-48)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(384)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(400)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(176)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-208)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-16)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(416)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-192)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(0)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(432)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(240)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(208)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(16)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(448)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(256)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(464)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(272)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(240)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-144)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(48)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(480)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(288)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(256)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-128)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(64)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(304)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(496)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s deleted file mode 100644 index ab5d324..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r14, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-144)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r12,#(-288)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(432)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(288)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-128)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r12,#(-272)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(304)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-112)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r12,#(-256)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(464)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(320)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-96)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r12,#(-240)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(336)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-48)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r12,#(-192)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(240)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-32)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r12,#(-176)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(256)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-320)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-16)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(128)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r12,#(-160)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(272)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-304)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(0)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(144)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r12,#(-144)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(288)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-432)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(432)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-288)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s deleted file mode 100644 index 9a2599d..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_oop_256_mve -poly_u16_toom4_fwd_dual_packed_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_top.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_top.s deleted file mode 100644 index 27ca088..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_top.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_top_256_mve, %function -.global poly_u16_toom4_fwd_dual_top_256_mve -poly_u16_toom4_fwd_dual_top_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #512 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_top_oop.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_top_oop.s deleted file mode 100644 index a076a12..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_top_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_top_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_top_oop_256_mve -poly_u16_toom4_fwd_dual_top_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #512 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(-32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_512.s b/tests/poly/auto/poly_u16_toom4_fwd_512.s deleted file mode 100644 index c2b43fe..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_512.s +++ /dev/null @@ -1,351 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_512_mve, %function -.global poly_u16_toom4_fwd_512_mve -poly_u16_toom4_fwd_512_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 64)] -vldrw.u32 Q2, [r14, #(4 * -124)] -vldrw.u32 Q3, [r14, #(4 * -60)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(16)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(272)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -120)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -56)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-496)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(256)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(32)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -52)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-480)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(272)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(48)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(304)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -112)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * -48)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-464)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(288)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(64)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(320)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 80)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -108)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * -44)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-432)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-448)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(304)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(80)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(336)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 84)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -40)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-416)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(320)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(96)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -36)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-400)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(336)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(112)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(368)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -96)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * -32)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-384)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(352)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(128)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(384)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 32)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * -28)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-368)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(368)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(400)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 36)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -24)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-352)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-368)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(384)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 40)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -20)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-336)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-352)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(400)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(432)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 44)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * -16)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-320)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-336)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(416)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(448)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 48)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -76)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * -12)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-304)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-320)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(432)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(208)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(464)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 52)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -8)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-288)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-304)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(448)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(224)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 56)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -4)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-272)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-288)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(464)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(240)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(496)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 60)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -64)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 0)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-256)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-272)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(480)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(256)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-496)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-240)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(496)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-256)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_768.s b/tests/poly/auto/poly_u16_toom4_fwd_768.s deleted file mode 100644 index 03ea09d..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_768.s +++ /dev/null @@ -1,520 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_768_mve, %function -.global poly_u16_toom4_fwd_768_mve -poly_u16_toom4_fwd_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 96)] -vldrw.u32 Q2, [r14, #(4 * -60)] -vldrw.u32 Q3, [r14, #(4 * 36)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-480)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-96)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 40)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(288)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-240)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(384)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-464)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-80)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 44)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(304)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-224)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(400)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-448)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-64)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 48)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(320)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-208)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(416)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-432)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-48)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 52)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(336)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-192)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(432)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-416)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 56)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(352)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(448)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-400)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-16)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 60)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(368)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(464)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-384)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(0)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 64)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(384)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-144)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(480)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-368)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(16)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 68)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(400)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(496)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-352)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(32)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 36)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 72)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(416)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-336)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(48)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 40)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 76)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(432)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-320)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(64)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 44)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 80)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-80)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-304)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(80)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 48)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -12)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 84)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(464)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-288)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(96)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 52)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 88)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-272)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(112)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 56)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 92)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(496)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-256)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(128)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 60)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 0)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 96)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-16)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-240)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(144)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 4)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 100)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(0)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-224)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(160)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 68)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 104)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-464)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(16)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-208)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(176)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 72)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 12)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 108)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-192)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(192)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 76)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 112)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-432)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(48)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-176)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(208)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 80)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 20)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 116)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-416)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-160)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(224)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 84)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 120)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-400)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(80)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-304)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-144)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(240)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 88)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 28)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 124)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-384)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-288)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-128)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(256)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 92)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #(4 * -124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-368)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(112)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-272)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-112)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(272)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r11,#(-352)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r14,#(-256)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(128)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_832.s b/tests/poly/auto/poly_u16_toom4_fwd_832.s deleted file mode 100644 index dc90468..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_832.s +++ /dev/null @@ -1,562 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_832_mve, %function -.global poly_u16_toom4_fwd_832_mve -poly_u16_toom4_fwd_832_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 104)] -vldrw.u32 Q2, [r14, #(4 * -44)] -vldrw.u32 Q3, [r14, #(4 * 60)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-352)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(64)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 64)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(416)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-336)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(80)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 68)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(496)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(432)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-320)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(96)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 72)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-144)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(448)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-304)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(112)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 76)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(464)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-288)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(128)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 80)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-464)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(480)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-272)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(144)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 84)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(496)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-256)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(160)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 88)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-432)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-80)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-496)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-240)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(176)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -12)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 92)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-416)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-480)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-224)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(192)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 36)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 96)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-400)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-464)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-208)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(208)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 40)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 100)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-384)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-448)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-192)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(224)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 44)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 0)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 104)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-368)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-16)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-432)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-176)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(240)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 48)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 4)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 108)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-352)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(0)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-416)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-160)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(256)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 52)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 112)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-336)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(16)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-400)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-144)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(272)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 56)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 12)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 116)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-320)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-384)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-128)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(288)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 60)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 120)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-304)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(48)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-368)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-112)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(304)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 20)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 124)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-288)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-352)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-96)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(320)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 68)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #(4 * -124)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-272)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(80)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-336)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-80)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(336)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 72)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 28)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r12, #(4 * -120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-256)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-320)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-64)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(352)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 76)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #(4 * -116)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-240)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(112)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-304)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-48)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(368)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 80)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 36)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r12, #(4 * -112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-224)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-288)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-32)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(384)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 84)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #(4 * -108)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-208)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(144)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-272)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-16)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(400)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 88)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 44)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r12, #(4 * -104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-192)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-256)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(0)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(416)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 92)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 48)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #(4 * -100)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-176)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(176)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-240)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(16)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(432)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 96)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 52)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r12, #(4 * -96)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-160)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(192)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-224)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(32)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(448)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 100)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #(4 * -92)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-144)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(208)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-208)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(48)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(464)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vshl.u16 Q5, Q5, #1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q4, [r14,#(-192)] -vadd.u16 Q5, Q5, Q7 -vstrw.u32 Q5, [r14,#(224)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s b/tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s deleted file mode 100644 index ab0d57f..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve, %function -.global poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve -poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r0, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(144)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-240)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-48)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(384)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(400)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(176)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-208)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-16)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(416)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-192)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(0)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(432)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(240)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(208)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(16)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(448)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(256)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(464)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(272)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(240)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-144)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(48)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(480)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(288)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(256)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-128)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(64)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(304)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(496)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s b/tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s deleted file mode 100644 index dfd6a5d..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s +++ /dev/null @@ -1,200 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve, %function -.global poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve -poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r14, #1008 -add r11, r0, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r14,#(-144)] -vmla.s16 Q6, Q5, r9 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r8 -vstrw.u32 Q3, [r12,#(-288)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(432)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(288)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r14,#(-128)] -vmla.s16 Q4, Q7, r9 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r8 -vstrw.u32 Q3, [r12,#(-272)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(304)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r14,#(-112)] -vmla.s16 Q5, Q6, r9 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r8 -vstrw.u32 Q3, [r12,#(-256)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(464)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(320)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r14,#(-96)] -vmla.s16 Q7, Q4, r9 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r8 -vstrw.u32 Q3, [r12,#(-240)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(336)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r14,#(-48)] -vmla.s16 Q6, Q5, r9 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q5, Q1, r8 -vstrw.u32 Q3, [r12,#(-192)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(240)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r14,#(-32)] -vmla.s16 Q4, Q7, r9 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q7, Q1, r8 -vstrw.u32 Q3, [r12,#(-176)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(256)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-320)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r14,#(-16)] -vmla.s16 Q5, Q6, r9 -vstrw.u32 Q0, [r1,#(128)] -vmla.s16 Q6, Q1, r8 -vstrw.u32 Q3, [r12,#(-160)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(272)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-304)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r14,#(0)] -vmla.s16 Q7, Q4, r9 -vstrw.u32 Q0, [r1,#(144)] -vmla.s16 Q4, Q1, r8 -vstrw.u32 Q3, [r12,#(-144)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(288)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-432)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(432)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-288)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_oop_256.s b/tests/poly/auto/poly_u16_toom4_fwd_oop_256.s deleted file mode 100644 index 294c21e..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_oop_256.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_oop_256_mve, %function -.global poly_u16_toom4_fwd_oop_256_mve -poly_u16_toom4_fwd_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r0, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_dual_bottom_256.s b/tests/poly/auto/poly_u16_toom4_inv_dual_bottom_256.s deleted file mode 100644 index d6b99ea..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_dual_bottom_256.s +++ /dev/null @@ -1,381 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_bottom_256_mve, %function -.global poly_u16_toom4_inv_dual_bottom_256_mve -poly_u16_toom4_inv_dual_bottom_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -mov r14, #0 -mov r12, #0 -mov r11, #0 -mov r10, #21840 -mov r9, #45 -mov r8, #43691 -mov r7, #8 -mov r6, #-30 -mov r5, #4369 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -mov r1, #-1 -vldrw.u32 Q4, [r0, #(4 * -96)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r2 -vldrw.u32 Q6, [r0, #(4 * -92)] -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -88)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -80)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -84)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -92)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -84)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -88)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -96)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -88)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -92)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -100)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -92)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -96)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -104)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -96)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -108)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -100)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -104)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -112)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -104)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -108)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -116)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vst43.u16 {Q0,Q1,Q2,Q3}, [r0] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r11 -vmov.u16 Q0[1], r12 -vmov.u16 Q0[2], r14 -vldrw.u32 Q1, [r0, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s b/tests/poly/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s deleted file mode 100644 index 26e8f23..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_bottom_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_bottom_oop_256_mve -poly_u16_toom4_inv_dual_bottom_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -mov r14, #0 -mov r12, #0 -mov r11, #0 -mov r10, #21840 -mov r9, #45 -mov r8, #43691 -mov r7, #8 -mov r6, #-30 -mov r5, #4369 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q4, [r0, #(4 * -96)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r2 -vldrw.u32 Q6, [r0, #(4 * -92)] -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -88)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -80)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -84)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 20)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -76)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 16)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -68)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -72)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 44)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -64)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -56)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -60)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 60)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -52)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 48)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -44)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -48)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 76)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 68)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -40)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -32)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -36)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 92)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 84)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -28)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 80)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -20)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -24)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 108)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -16)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 96)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -8)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -12)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 124)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -4)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 112)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r11 -vmov.u16 Q0[1], r12 -vmov.u16 Q0[2], r14 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_dual_top_256.s b/tests/poly/auto/poly_u16_toom4_inv_dual_top_256.s deleted file mode 100644 index e87718e..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_dual_top_256.s +++ /dev/null @@ -1,381 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_top_256_mve, %function -.global poly_u16_toom4_inv_dual_top_256_mve -poly_u16_toom4_inv_dual_top_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -mov r1, #1 -vldrw.u32 Q4, [r14, #(4 * -124)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r1 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -116)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -108)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -104)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #(4 * -96)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -92)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -80)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #(4 * -72)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -68)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -60)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -56)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #(4 * -48)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -44)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -32)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vst43.u16 {Q0,Q1,Q2,Q3}, [r0] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r0, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_dual_top_oop_256.s b/tests/poly/auto/poly_u16_toom4_inv_dual_top_oop_256.s deleted file mode 100644 index c69fe71..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_dual_top_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_top_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_top_oop_256_mve -poly_u16_toom4_inv_dual_top_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -vldrw.u32 Q4, [r14, #(4 * -124)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -116)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -108)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 20)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -104)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 16)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -96)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 44)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -92)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 60)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -80)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 48)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 76)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 68)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -68)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -60)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 92)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 84)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -56)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 80)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 108)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -44)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 96)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 124)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -32)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 112)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_full_256.s b/tests/poly/auto/poly_u16_toom4_inv_full_256.s deleted file mode 100644 index dfbbd76..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_full_256.s +++ /dev/null @@ -1,765 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_256_mve, %function -.global poly_u16_toom4_inv_256_mve -poly_u16_toom4_inv_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r7, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vldrw.u32 Q1, [r0, #(4 * 2)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -62)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -122)] -vldrw.u32 Q4, [r0, #(4 * 66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [r0, #(4 * 70)] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 6)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r7 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -118)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 74)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -114)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 78)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 82)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 86)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 90)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 94)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 98)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -62)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-248)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 102)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 2)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -58)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-232)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 106)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 10)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -54)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-216)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 110)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 14)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -50)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-200)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 114)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 18)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -46)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-184)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 118)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 22)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -42)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-168)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 122)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 26)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -38)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-152)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 126)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 30)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r0, #(4 * -34)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(-136)] -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q3, [r0, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r14,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 34)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(136)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_full_512.s b/tests/poly/auto/poly_u16_toom4_inv_full_512.s deleted file mode 100644 index 466055f..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_full_512.s +++ /dev/null @@ -1,1511 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_512_mve, %function -.global poly_u16_toom4_inv_512_mve -poly_u16_toom4_inv_512_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #-64 -mov r9, #45 -mov r8, #-8 -mov r7, #43691 -mov r6, #16 -mov r5, #30 -mov r4, #61167 -mov r3, #-65 -mov r2, #36409 -mov r1, #1 -vldrw.u32 Q0, [r12, #(4 * 10)] -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 2)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * -118)] -vldrw.u32 Q4, [r14, #(4 * 6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r11, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r3 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r8 -vldrw.u32 Q5, [r14, #(4 * 10)] -vmla.s16 Q0, Q3, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r6 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(-472)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r2 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r5 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r4 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 14)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 18)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 22)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 26)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 30)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 34)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 38)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 42)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 46)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 50)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 54)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 58)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -70)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 62)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -66)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 66)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 70)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 2)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(8)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 74)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 6)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(24)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 78)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -114)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 10)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(40)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 82)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -110)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 14)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(56)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 86)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -106)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 18)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(72)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 90)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -102)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 22)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(88)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 94)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -98)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 26)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(104)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 98)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -94)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 30)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(120)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 102)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -90)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 34)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(136)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 106)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -86)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 38)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(152)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 110)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -82)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 42)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(168)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 114)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -78)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 46)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(184)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 118)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -74)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 50)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(200)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -6)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -70)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 54)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(216)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 126)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -2)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -66)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 58)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(232)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r12, #(4 * -122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 2)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -62)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-264)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r0, #(4 * 62)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(248)] -vmla.s16 Q1, Q2, r8 -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q3, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * 70)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r12,#(280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-216)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_full_768.s b/tests/poly/auto/poly_u16_toom4_inv_full_768.s deleted file mode 100644 index e3e8e89..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_full_768.s +++ /dev/null @@ -1,2303 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_768_mve, %function -.global poly_u16_toom4_inv_768_mve -poly_u16_toom4_inv_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r7, #45 -mov r6, #-8 -mov r5, #43691 -mov r4, #16 -mov r3, #30 -mov r2, #61167 -mov r1, #-65 -vldrw.u32 Q0, [r11, #(4 * 78)] -vldrw.u32 Q1, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 66)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -114)] -vldrw.u32 Q4, [r12, #(4 * -54)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r1 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r11, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r6 -vldrw.u32 Q5, [r12, #(4 * -50)] -vmla.s16 Q0, Q3, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(24)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r4 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-456)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 10)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r3 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-216)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r2 -vstrw.u32 Q2, [r0,#(264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r11,#(312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -46)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -42)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -38)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -34)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -30)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -26)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -22)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -18)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -14)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -10)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -6)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -58)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -54)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -50)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 66)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(264)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 18)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 70)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(280)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(40)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(296)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 26)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(312)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(72)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 34)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -94)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(104)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -90)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 30)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -86)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 34)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(136)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -82)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -78)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -74)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -70)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 126)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -66)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -62)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -58)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(504)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -54)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -94)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-376)] -vmla.s16 Q1, Q2, r6 -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q3, [r12, #(4 * 38)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * -82)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * -22)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(440)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_full_832.s b/tests/poly/auto/poly_u16_toom4_inv_full_832.s deleted file mode 100644 index 5963703..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_full_832.s +++ /dev/null @@ -1,2493 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_832_mve, %function -.global poly_u16_toom4_inv_832_mve -poly_u16_toom4_inv_832_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r7, #45 -mov r6, #-8 -mov r5, #43691 -mov r4, #16 -mov r3, #30 -mov r2, #61167 -mov r1, #-65 -vldrw.u32 Q0, [r10, #(4 * -94)] -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 82)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -50)] -vldrw.u32 Q4, [r12, #(4 * -6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r1 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r6 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q0, Q3, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r4 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-200)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r3 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r2 -vstrw.u32 Q2, [r0,#(328)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(408)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(424)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(440)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(504)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-8)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 126)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(504)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -74)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -70)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -22)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -18)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -114)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -14)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -110)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -10)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(504)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -106)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -6)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -102)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -2)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -98)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 2)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -94)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 6)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -90)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 10)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -86)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 14)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -82)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 18)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(424)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 126)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -78)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 22)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -74)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 26)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -70)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 30)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -66)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 34)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 66)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -62)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -6)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 38)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 126)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 70)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -58)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 42)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r12, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 74)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -54)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * 2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -10)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 46)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 78)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -70)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q1, Q2, r6 -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q3, [r12, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * 6)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * 50)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-152)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_half_256.s b/tests/poly/auto/poly_u16_toom4_inv_half_256.s deleted file mode 100644 index 7b0cf8c..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_half_256.s +++ /dev/null @@ -1,340 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_256_mve, %function -.global poly_u16_toom4_inv_half_256_mve -poly_u16_toom4_inv_half_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -mov r14, #-64 -mov r12, #45 -mov r11, #-8 -mov r10, #43691 -mov r9, #16 -mov r8, #30 -mov r7, #61167 -mov r6, #-65 -mov r5, #36409 -mov r4, #1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vldrw.u32 Q1, [r0, #(4 * -62)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -94)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r0, #(4 * 2)] -vldrw.u32 Q4, [r0, #(4 * -30)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r0, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r6 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r0, #(4 * 38)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r11 -vldrw.u32 Q5, [r0, #(4 * -26)] -vmla.s16 Q0, Q3, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-248)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r9 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r0,#(8)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * -58)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r8 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(-120)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r7 -vstrw.u32 Q2, [r0,#(-376)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r0,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #(4 * -22)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-360)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #(4 * -18)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-344)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #(4 * -14)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-328)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #(4 * -10)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-312)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #(4 * -6)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-296)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #(4 * -2)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-280)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-264)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(248)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_half_512.s b/tests/poly/auto/poly_u16_toom4_inv_half_512.s deleted file mode 100644 index df3c8ab..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_half_512.s +++ /dev/null @@ -1,661 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_512_mve, %function -.global poly_u16_toom4_inv_half_512_mve -poly_u16_toom4_inv_half_512_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r7, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vldrw.u32 Q1, [r0, #(4 * 2)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -62)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -122)] -vldrw.u32 Q4, [r0, #(4 * 66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [r0, #(4 * 70)] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 6)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r7 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -118)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 74)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -114)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 78)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 82)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 86)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 90)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 94)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 98)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 102)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-120)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 106)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-104)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 110)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-88)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 114)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 118)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 122)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 126)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(504)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(8)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_half_768.s b/tests/poly/auto/poly_u16_toom4_inv_half_768.s deleted file mode 100644 index 164402e..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_half_768.s +++ /dev/null @@ -1,982 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_768_mve, %function -.global poly_u16_toom4_inv_half_768_mve -poly_u16_toom4_inv_half_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-64 -mov r10, #45 -mov r9, #-8 -mov r8, #43691 -mov r7, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r14, #(4 * 102)] -vldrw.u32 Q1, [r0, #(4 * 66)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -30)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * 6)] -vldrw.u32 Q4, [r14, #(4 * -90)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r12, #(4 * -54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * 106)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r9 -vldrw.u32 Q5, [r14, #(4 * -86)] -vmla.s16 Q0, Q3, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(264)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r7 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(24)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 70)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(-360)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [r0,#(-120)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -82)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-104)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -78)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-88)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -74)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -70)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -66)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -62)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -58)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -54)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -50)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -46)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -42)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -38)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -34)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -30)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 126)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -26)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(504)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -22)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -18)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -14)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -10)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -6)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -2)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 2)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-232)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_half_832.s b/tests/poly/auto/poly_u16_toom4_inv_half_832.s deleted file mode 100644 index 3181d69..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_half_832.s +++ /dev/null @@ -1,1062 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_832_mve, %function -.global poly_u16_toom4_inv_half_832_mve -poly_u16_toom4_inv_half_832_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-64 -mov r10, #45 -mov r9, #-8 -mov r8, #43691 -mov r7, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r12, #(4 * -110)] -vldrw.u32 Q1, [r0, #(4 * 82)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -22)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * 38)] -vldrw.u32 Q4, [r14, #(4 * -66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r12, #(4 * -6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #(4 * -106)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r9 -vldrw.u32 Q5, [r14, #(4 * -62)] -vmla.s16 Q0, Q3, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(328)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r7 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 86)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(-264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [r0,#(-88)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -58)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -54)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -50)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -46)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -42)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -38)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -34)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -30)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -26)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -22)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 126)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -18)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(504)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -14)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -10)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -6)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -2)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 2)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 6)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 10)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 14)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 18)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 22)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 126)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 26)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(264)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 30)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 34)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -70)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-40)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/manual/karatsuba.h b/tests/poly/karatsuba.h similarity index 100% rename from tests/poly/manual/karatsuba.h rename to tests/poly/karatsuba.h diff --git a/tests/poly/manual/karatsuba.s b/tests/poly/karatsuba.s similarity index 100% rename from tests/poly/manual/karatsuba.s rename to tests/poly/karatsuba.s diff --git a/tests/poly/manual/karatsuba_const.h b/tests/poly/karatsuba_const.h similarity index 100% rename from tests/poly/manual/karatsuba_const.h rename to tests/poly/karatsuba_const.h diff --git a/tests/poly/main.c b/tests/poly/main.c index c722df9..03fb585 100644 --- a/tests/poly/main.c +++ b/tests/poly/main.c @@ -950,5 +950,9 @@ int main(void) ret |= test_mat_vec_mul_ntt_incomplete(); #endif /* TEST_MAT_VEC_MUL_NTT_INCOMPLETE */ + if(ret == 0){ + debug_printf( "ALL GOOD!\n" ); + } + return( ret ); } diff --git a/tests/poly/misc.c b/tests/poly/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/poly/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/poly/misc.h b/tests/poly/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/poly/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/poly/manual/montgomery.h b/tests/poly/montgomery.h similarity index 100% rename from tests/poly/manual/montgomery.h rename to tests/poly/montgomery.h diff --git a/tests/poly/manual/montgomery.s b/tests/poly/montgomery.s similarity index 100% rename from tests/poly/manual/montgomery.s rename to tests/poly/montgomery.s diff --git a/tests/poly/manual/montgomery_const.h b/tests/poly/montgomery_const.h similarity index 100% rename from tests/poly/manual/montgomery_const.h rename to tests/poly/montgomery_const.h diff --git a/tests/poly/poly.c b/tests/poly/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/poly/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/poly/poly.h b/tests/poly/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/poly/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/poly/poly.mk b/tests/poly/poly.mk new file mode 100644 index 0000000..720cae7 --- /dev/null +++ b/tests/poly/poly.mk @@ -0,0 +1,27 @@ +# Test name - needs to match the directory name +TESTS += poly + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +POLY_PLATFORMS += m55-an547 +POLY_PLATFORMS += m85-an555 + +# C sources required for this test +POLY_SOURCES += main.c +POLY_SOURCES += misc.c +POLY_SOURCES += poly.c + +# Assembly sources required for this test +POLY_ASM_DIR = ./auto +POLY_ASMS += montgomery.s +POLY_ASMS += karatsuba.s +POLY_ASMS += poly_u16_32.s +POLY_ASMS += poly_u16_32_acc.s +POLY_ASMS += $(POLY_ASM_DIR)/inv_ntt_u32_33556993_28678040_incomplete.s +POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_incomplete_double.s +POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_incomplete.s +POLY_ASMS += $(POLY_ASM_DIR)/inv_ntt_u32_33556993_28678040_complete.s +POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_complete.s +POLY_ASMS += $(POLY_ASM_DIR)/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +POLY_ASMS += $(POLY_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s diff --git a/tests/poly/manual/poly_u16_32.s b/tests/poly/poly_u16_32.s similarity index 100% rename from tests/poly/manual/poly_u16_32.s rename to tests/poly/poly_u16_32.s diff --git a/tests/poly/manual/poly_u16_32_acc.s b/tests/poly/poly_u16_32_acc.s similarity index 100% rename from tests/poly/manual/poly_u16_32_acc.s rename to tests/poly/poly_u16_32_acc.s diff --git a/tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_complete.s b/tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index 4d756a1..0000000 --- a/tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,3396 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 20558213 // zeta^510 * 2^31 = 28678040^510 * 2^31 -.word 66424611 // zeta^382 * 2^31 = 28678040^382 * 2^31 -.word 59465515 // zeta^446 * 2^31 = 28678040^446 * 2^31 -.word 39560591 // zeta^318 * 2^31 = 28678040^318 * 2^31 -.word 2042724475 // zeta^510 * (q^(-1) mod 2^32) * 2^31 = 28678040^510 * 375649793 * 2^31 -.word 2817904349 // zeta^382 * (q^(-1) mod 2^32) * 2^31 = 28678040^382 * 375649793 * 2^31 -.word 2405453525 // zeta^446 * (q^(-1) mod 2^32) * 2^31 = 28678040^446 * 375649793 * 2^31 -.word 2621436017 // zeta^318 * (q^(-1) mod 2^32) * 2^31 = 28678040^318 * 375649793 * 2^31 -.word 35339857 // zeta^511 * 2^31 = 28678040^511 * 2^31 -.word 13377101 // zeta^447 * 2^31 = 28678040^447 * 2^31 -.word 33252123 // zeta^479 * 2^31 = 28678040^479 * 2^31 -.word 16713319 // zeta^415 * 2^31 = 28678040^415 * 2^31 -.word 10815985 // zeta^383 * 2^31 = 28678040^383 * 2^31 -.word 56247925 // zeta^319 * 2^31 = 28678040^319 * 2^31 -.word 26943959 // zeta^351 * 2^31 = 28678040^351 * 2^31 -.word 51316823 // zeta^287 * 2^31 = 28678040^287 * 2^31 -.word 3650773007 // zeta^383 * (q^(-1) mod 2^32) * 2^31 = 28678040^383 * 375649793 * 2^31 -.word 4021439371 // zeta^319 * (q^(-1) mod 2^32) * 2^31 = 28678040^319 * 375649793 * 2^31 -.word 1538999337 // zeta^351 * (q^(-1) mod 2^32) * 2^31 = 28678040^351 * 375649793 * 2^31 -.word 3611844009 // zeta^287 * (q^(-1) mod 2^32) * 2^31 = 28678040^287 * 375649793 * 2^31 -.word 42042379 // zeta^478 * 2^31 = 28678040^478 * 2^31 -.word 26419651 // zeta^350 * 2^31 = 28678040^350 * 2^31 -.word 61522009 // zeta^414 * 2^31 = 28678040^414 * 2^31 -.word 23758817 // zeta^286 * 2^31 = 28678040^286 * 2^31 -.word 2254105077 // zeta^478 * (q^(-1) mod 2^32) * 2^31 = 28678040^478 * 375649793 * 2^31 -.word 3415374909 // zeta^350 * (q^(-1) mod 2^32) * 2^31 = 28678040^350 * 375649793 * 2^31 -.word 3742677415 // zeta^414 * (q^(-1) mod 2^32) * 2^31 = 28678040^414 * 375649793 * 2^31 -.word 3187687967 // zeta^286 * (q^(-1) mod 2^32) * 2^31 = 28678040^286 * 375649793 * 2^31 -.word 35776599 // zeta^495 * 2^31 = 28678040^495 * 2^31 -.word 6731445 // zeta^431 * 2^31 = 28678040^431 * 2^31 -.word 3030459 // zeta^463 * 2^31 = 28678040^463 * 2^31 -.word 41085059 // zeta^399 * 2^31 = 28678040^399 * 2^31 -.word 6685305 // zeta^367 * 2^31 = 28678040^367 * 2^31 -.word 24840267 // zeta^303 * 2^31 = 28678040^303 * 2^31 -.word 21119839 // zeta^335 * 2^31 = 28678040^335 * 2^31 -.word 32376869 // zeta^271 * 2^31 = 28678040^271 * 2^31 -.word 2658056071 // zeta^367 * (q^(-1) mod 2^32) * 2^31 = 28678040^367 * 375649793 * 2^31 -.word 495707573 // zeta^303 * (q^(-1) mod 2^32) * 2^31 = 28678040^303 * 375649793 * 2^31 -.word 440627873 // zeta^335 * (q^(-1) mod 2^32) * 2^31 = 28678040^335 * 375649793 * 2^31 -.word 3991890395 // zeta^271 * (q^(-1) mod 2^32) * 2^31 = 28678040^271 * 375649793 * 2^31 -.word 11319751 // zeta^494 * 2^31 = 28678040^494 * 2^31 -.word 57449959 // zeta^366 * 2^31 = 28678040^366 * 2^31 -.word 47736605 // zeta^430 * 2^31 = 28678040^430 * 2^31 -.word 25310795 // zeta^302 * 2^31 = 28678040^302 * 2^31 -.word 316214329 // zeta^494 * (q^(-1) mod 2^32) * 2^31 = 28678040^494 * 375649793 * 2^31 -.word 2994890777 // zeta^366 * (q^(-1) mod 2^32) * 2^31 = 28678040^366 * 375649793 * 2^31 -.word 2883238627 // zeta^430 * (q^(-1) mod 2^32) * 2^31 = 28678040^430 * 375649793 * 2^31 -.word 1834006453 // zeta^302 * (q^(-1) mod 2^32) * 2^31 = 28678040^302 * 375649793 * 2^31 -.word 5649915 // zeta^503 * 2^31 = 28678040^503 * 2^31 -.word 25847843 // zeta^439 * 2^31 = 28678040^439 * 2^31 -.word 62444027 // zeta^471 * 2^31 = 28678040^471 * 2^31 -.word 57855139 // zeta^407 * 2^31 = 28678040^407 * 2^31 -.word 43953263 // zeta^375 * 2^31 = 28678040^375 * 2^31 -.word 3973257 // zeta^311 * 2^31 = 28678040^311 * 2^31 -.word 45754835 // zeta^343 * 2^31 = 28678040^343 * 2^31 -.word 47438647 // zeta^279 * 2^31 = 28678040^279 * 2^31 -.word 1254205841 // zeta^375 * (q^(-1) mod 2^32) * 2^31 = 28678040^375 * 375649793 * 2^31 -.word 3800349047 // zeta^311 * (q^(-1) mod 2^32) * 2^31 = 28678040^311 * 375649793 * 2^31 -.word 3397129261 // zeta^343 * (q^(-1) mod 2^32) * 2^31 = 28678040^343 * 375649793 * 2^31 -.word 3896527561 // zeta^279 * (q^(-1) mod 2^32) * 2^31 = 28678040^279 * 375649793 * 2^31 -.word 34946213 // zeta^462 * 2^31 = 28678040^462 * 2^31 -.word 33401995 // zeta^334 * 2^31 = 28678040^334 * 2^31 -.word 57707227 // zeta^398 * 2^31 = 28678040^398 * 2^31 -.word 43655235 // zeta^270 * 2^31 = 28678040^270 * 2^31 -.word 4090836315 // zeta^462 * (q^(-1) mod 2^32) * 2^31 = 28678040^462 * 375649793 * 2^31 -.word 2389950837 // zeta^334 * (q^(-1) mod 2^32) * 2^31 = 28678040^334 * 375649793 * 2^31 -.word 1383072549 // zeta^398 * (q^(-1) mod 2^32) * 2^31 = 28678040^398 * 375649793 * 2^31 -.word 2793176509 // zeta^270 * (q^(-1) mod 2^32) * 2^31 = 28678040^270 * 375649793 * 2^31 -.word 30218957 // zeta^487 * 2^31 = 28678040^487 * 2^31 -.word 13073717 // zeta^423 * 2^31 = 28678040^423 * 2^31 -.word 41547715 // zeta^455 * 2^31 = 28678040^455 * 2^31 -.word 51082899 // zeta^391 * 2^31 = 28678040^391 * 2^31 -.word 6539853 // zeta^359 * 2^31 = 28678040^359 * 2^31 -.word 52712977 // zeta^295 * 2^31 = 28678040^295 * 2^31 -.word 15171525 // zeta^327 * 2^31 = 28678040^327 * 2^31 -.word 41070365 // zeta^263 * 2^31 = 28678040^263 * 2^31 -.word 1097807795 // zeta^359 * (q^(-1) mod 2^32) * 2^31 = 28678040^359 * 375649793 * 2^31 -.word 1402229743 // zeta^295 * (q^(-1) mod 2^32) * 2^31 = 28678040^295 * 375649793 * 2^31 -.word 857879099 // zeta^327 * (q^(-1) mod 2^32) * 2^31 = 28678040^327 * 375649793 * 2^31 -.word 2467328739 // zeta^263 * (q^(-1) mod 2^32) * 2^31 = 28678040^263 * 375649793 * 2^31 -.word 1421525 // zeta^502 * 2^31 = 28678040^502 * 2^31 -.word 5608953 // zeta^374 * 2^31 = 28678040^374 * 2^31 -.word 3344309 // zeta^438 * 2^31 = 28678040^438 * 2^31 -.word 54192527 // zeta^310 * 2^31 = 28678040^310 * 2^31 -.word 2006884651 // zeta^502 * (q^(-1) mod 2^32) * 2^31 = 28678040^502 * 375649793 * 2^31 -.word 1547838471 // zeta^374 * (q^(-1) mod 2^32) * 2^31 = 28678040^374 * 375649793 * 2^31 -.word 1835403851 // zeta^438 * (q^(-1) mod 2^32) * 2^31 = 28678040^438 * 375649793 * 2^31 -.word 3288902769 // zeta^310 * (q^(-1) mod 2^32) * 2^31 = 28678040^310 * 375649793 * 2^31 -.word 55532487 // zeta^507 * 2^31 = 28678040^507 * 2^31 -.word 25878283 // zeta^443 * 2^31 = 28678040^443 * 2^31 -.word 7519477 // zeta^475 * 2^31 = 28678040^475 * 2^31 -.word 10400227 // zeta^411 * 2^31 = 28678040^411 * 2^31 -.word 66449241 // zeta^379 * 2^31 = 28678040^379 * 2^31 -.word 4428811 // zeta^315 * 2^31 = 28678040^315 * 2^31 -.word 30618985 // zeta^347 * 2^31 = 28678040^347 * 2^31 -.word 46942975 // zeta^283 * 2^31 = 28678040^283 * 2^31 -.word 1923058343 // zeta^379 * (q^(-1) mod 2^32) * 2^31 = 28678040^379 * 375649793 * 2^31 -.word 3711490549 // zeta^315 * (q^(-1) mod 2^32) * 2^31 = 28678040^315 * 375649793 * 2^31 -.word 1530848407 // zeta^347 * (q^(-1) mod 2^32) * 2^31 = 28678040^347 * 375649793 * 2^31 -.word 3263539969 // zeta^283 * (q^(-1) mod 2^32) * 2^31 = 28678040^283 * 375649793 * 2^31 -.word 34238409 // zeta^470 * 2^31 = 28678040^470 * 2^31 -.word 7278675 // zeta^342 * 2^31 = 28678040^342 * 2^31 -.word 26316985 // zeta^406 * 2^31 = 28678040^406 * 2^31 -.word 1738533 // zeta^278 * 2^31 = 28678040^278 * 2^31 -.word 1976527415 // zeta^470 * (q^(-1) mod 2^32) * 2^31 = 28678040^470 * 375649793 * 2^31 -.word 3553111469 // zeta^342 * (q^(-1) mod 2^32) * 2^31 = 28678040^342 * 375649793 * 2^31 -.word 1070704967 // zeta^406 * (q^(-1) mod 2^32) * 2^31 = 28678040^406 * 375649793 * 2^31 -.word 280554203 // zeta^278 * (q^(-1) mod 2^32) * 2^31 = 28678040^278 * 375649793 * 2^31 -.word 29493541 // zeta^491 * 2^31 = 28678040^491 * 2^31 -.word 46179537 // zeta^427 * 2^31 = 28678040^427 * 2^31 -.word 61070425 // zeta^459 * 2^31 = 28678040^459 * 2^31 -.word 47641435 // zeta^395 * 2^31 = 28678040^395 * 2^31 -.word 8700655 // zeta^363 * 2^31 = 28678040^363 * 2^31 -.word 49217369 // zeta^299 * 2^31 = 28678040^299 * 2^31 -.word 14037329 // zeta^331 * 2^31 = 28678040^331 * 2^31 -.word 57068693 // zeta^267 * 2^31 = 28678040^267 * 2^31 -.word 2143064849 // zeta^363 * (q^(-1) mod 2^32) * 2^31 = 28678040^363 * 375649793 * 2^31 -.word 3997596327 // zeta^299 * (q^(-1) mod 2^32) * 2^31 = 28678040^299 * 375649793 * 2^31 -.word 594737327 // zeta^331 * (q^(-1) mod 2^32) * 2^31 = 28678040^331 * 375649793 * 2^31 -.word 1214449003 // zeta^267 * (q^(-1) mod 2^32) * 2^31 = 28678040^267 * 375649793 * 2^31 -.word 5988919 // zeta^486 * 2^31 = 28678040^486 * 2^31 -.word 27781261 // zeta^358 * 2^31 = 28678040^358 * 2^31 -.word 33650523 // zeta^422 * 2^31 = 28678040^422 * 2^31 -.word 40314383 // zeta^294 * 2^31 = 28678040^294 * 2^31 -.word 2046739401 // zeta^486 * (q^(-1) mod 2^32) * 2^31 = 28678040^486 * 375649793 * 2^31 -.word 2556008819 // zeta^358 * (q^(-1) mod 2^32) * 2^31 = 28678040^358 * 375649793 * 2^31 -.word 2602309285 // zeta^422 * (q^(-1) mod 2^32) * 2^31 = 28678040^422 * 375649793 * 2^31 -.word 3711528945 // zeta^294 * (q^(-1) mod 2^32) * 2^31 = 28678040^294 * 375649793 * 2^31 -.word 25356533 // zeta^499 * 2^31 = 28678040^499 * 2^31 -.word 59712043 // zeta^435 * 2^31 = 28678040^435 * 2^31 -.word 59431885 // zeta^467 * 2^31 = 28678040^467 * 2^31 -.word 42783775 // zeta^403 * 2^31 = 28678040^403 * 2^31 -.word 15118727 // zeta^371 * 2^31 = 28678040^371 * 2^31 -.word 16104593 // zeta^307 * 2^31 = 28678040^307 * 2^31 -.word 66551101 // zeta^339 * 2^31 = 28678040^339 * 2^31 -.word 27099659 // zeta^275 * 2^31 = 28678040^275 * 2^31 -.word 256676985 // zeta^371 * (q^(-1) mod 2^32) * 2^31 = 28678040^371 * 375649793 * 2^31 -.word 2042883439 // zeta^307 * (q^(-1) mod 2^32) * 2^31 = 28678040^307 * 375649793 * 2^31 -.word 2098783427 // zeta^339 * (q^(-1) mod 2^32) * 2^31 = 28678040^339 * 375649793 * 2^31 -.word 1730866165 // zeta^275 * (q^(-1) mod 2^32) * 2^31 = 28678040^275 * 375649793 * 2^31 -.word 52622279 // zeta^454 * 2^31 = 28678040^454 * 2^31 -.word 48542309 // zeta^326 * 2^31 = 28678040^326 * 2^31 -.word 28412919 // zeta^390 * 2^31 = 28678040^390 * 2^31 -.word 61490063 // zeta^262 * 2^31 = 28678040^262 * 2^31 -.word 111596089 // zeta^454 * (q^(-1) mod 2^32) * 2^31 = 28678040^454 * 375649793 * 2^31 -.word 2392801179 // zeta^326 * (q^(-1) mod 2^32) * 2^31 = 28678040^326 * 375649793 * 2^31 -.word 122296841 // zeta^390 * (q^(-1) mod 2^32) * 2^31 = 28678040^390 * 375649793 * 2^31 -.word 4112339569 // zeta^262 * (q^(-1) mod 2^32) * 2^31 = 28678040^262 * 375649793 * 2^31 -.word 17544659 // zeta^483 * 2^31 = 28678040^483 * 2^31 -.word 26761761 // zeta^419 * 2^31 = 28678040^419 * 2^31 -.word 28138345 // zeta^451 * 2^31 = 28678040^451 * 2^31 -.word 6006005 // zeta^387 * 2^31 = 28678040^387 * 2^31 -.word 49338991 // zeta^355 * 2^31 = 28678040^355 * 2^31 -.word 59052279 // zeta^291 * 2^31 = 28678040^291 * 2^31 -.word 54131019 // zeta^323 * 2^31 = 28678040^323 * 2^31 -.word 49172137 // zeta^259 * 2^31 = 28678040^259 * 2^31 -.word 2285599633 // zeta^355 * (q^(-1) mod 2^32) * 2^31 = 28678040^355 * 375649793 * 2^31 -.word 1420334345 // zeta^291 * (q^(-1) mod 2^32) * 2^31 = 28678040^291 * 375649793 * 2^31 -.word 1832318133 // zeta^323 * (q^(-1) mod 2^32) * 2^31 = 28678040^323 * 375649793 * 2^31 -.word 203443031 // zeta^259 * (q^(-1) mod 2^32) * 2^31 = 28678040^259 * 375649793 * 2^31 -.word 41164657 // zeta^506 * 2^31 = 28678040^506 * 2^31 -.word 23553921 // zeta^378 * 2^31 = 28678040^378 * 2^31 -.word 51075303 // zeta^442 * 2^31 = 28678040^442 * 2^31 -.word 11244857 // zeta^314 * 2^31 = 28678040^314 * 2^31 -.word 2292337295 // zeta^506 * (q^(-1) mod 2^32) * 2^31 = 28678040^506 * 375649793 * 2^31 -.word 2218762879 // zeta^378 * (q^(-1) mod 2^32) * 2^31 = 28678040^378 * 375649793 * 2^31 -.word 3660688665 // zeta^442 * (q^(-1) mod 2^32) * 2^31 = 28678040^442 * 375649793 * 2^31 -.word 2196022471 // zeta^314 * (q^(-1) mod 2^32) * 2^31 = 28678040^314 * 375649793 * 2^31 -.word 27161421 // zeta^509 * 2^31 = 28678040^509 * 2^31 -.word 12259351 // zeta^445 * 2^31 = 28678040^445 * 2^31 -.word 42183787 // zeta^477 * 2^31 = 28678040^477 * 2^31 -.word 260949 // zeta^413 * 2^31 = 28678040^413 * 2^31 -.word 49379395 // zeta^381 * 2^31 = 28678040^381 * 2^31 -.word 45318697 // zeta^317 * 2^31 = 28678040^317 * 2^31 -.word 65417737 // zeta^349 * 2^31 = 28678040^349 * 2^31 -.word 60522221 // zeta^285 * 2^31 = 28678040^285 * 2^31 -.word 2945787325 // zeta^381 * (q^(-1) mod 2^32) * 2^31 = 28678040^381 * 375649793 * 2^31 -.word 2724075479 // zeta^317 * (q^(-1) mod 2^32) * 2^31 = 28678040^317 * 375649793 * 2^31 -.word 2827626487 // zeta^349 * (q^(-1) mod 2^32) * 2^31 = 28678040^349 * 375649793 * 2^31 -.word 482722579 // zeta^285 * (q^(-1) mod 2^32) * 2^31 = 28678040^285 * 375649793 * 2^31 -.word 3629237 // zeta^474 * 2^31 = 28678040^474 * 2^31 -.word 60326323 // zeta^346 * 2^31 = 28678040^346 * 2^31 -.word 30569867 // zeta^410 * 2^31 = 28678040^410 * 2^31 -.word 31921231 // zeta^282 * 2^31 = 28678040^282 * 2^31 -.word 3571167563 // zeta^474 * (q^(-1) mod 2^32) * 2^31 = 28678040^474 * 375649793 * 2^31 -.word 3851189325 // zeta^346 * (q^(-1) mod 2^32) * 2^31 = 28678040^346 * 375649793 * 2^31 -.word 1517877365 // zeta^410 * (q^(-1) mod 2^32) * 2^31 = 28678040^410 * 375649793 * 2^31 -.word 1275593137 // zeta^282 * (q^(-1) mod 2^32) * 2^31 = 28678040^282 * 375649793 * 2^31 -.word 51477925 // zeta^493 * 2^31 = 28678040^493 * 2^31 -.word 23177153 // zeta^429 * 2^31 = 28678040^429 * 2^31 -.word 42516129 // zeta^461 * 2^31 = 28678040^461 * 2^31 -.word 23261199 // zeta^397 * 2^31 = 28678040^397 * 2^31 -.word 50523083 // zeta^365 * 2^31 = 28678040^365 * 2^31 -.word 29024109 // zeta^301 * 2^31 = 28678040^301 * 2^31 -.word 62634975 // zeta^333 * 2^31 = 28678040^333 * 2^31 -.word 5116371 // zeta^269 * 2^31 = 28678040^269 * 2^31 -.word 2363949621 // zeta^365 * (q^(-1) mod 2^32) * 2^31 = 28678040^365 * 375649793 * 2^31 -.word 2792055443 // zeta^301 * (q^(-1) mod 2^32) * 2^31 = 28678040^301 * 375649793 * 2^31 -.word 3296655905 // zeta^333 * (q^(-1) mod 2^32) * 2^31 = 28678040^333 * 375649793 * 2^31 -.word 4093127725 // zeta^269 * (q^(-1) mod 2^32) * 2^31 = 28678040^269 * 375649793 * 2^31 -.word 55626043 // zeta^490 * 2^31 = 28678040^490 * 2^31 -.word 15630981 // zeta^362 * 2^31 = 28678040^362 * 2^31 -.word 43717491 // zeta^426 * 2^31 = 28678040^426 * 2^31 -.word 14342369 // zeta^298 * 2^31 = 28678040^298 * 2^31 -.word 2004845765 // zeta^490 * (q^(-1) mod 2^32) * 2^31 = 28678040^490 * 375649793 * 2^31 -.word 3862343547 // zeta^362 * (q^(-1) mod 2^32) * 2^31 = 28678040^362 * 375649793 * 2^31 -.word 2436590221 // zeta^426 * (q^(-1) mod 2^32) * 2^31 = 28678040^426 * 375649793 * 2^31 -.word 2109337887 // zeta^298 * (q^(-1) mod 2^32) * 2^31 = 28678040^298 * 375649793 * 2^31 -.word 6776583 // zeta^501 * 2^31 = 28678040^501 * 2^31 -.word 33530533 // zeta^437 * 2^31 = 28678040^437 * 2^31 -.word 43598203 // zeta^469 * 2^31 = 28678040^469 * 2^31 -.word 59373651 // zeta^405 * 2^31 = 28678040^405 * 2^31 -.word 37946425 // zeta^373 * 2^31 = 28678040^373 * 2^31 -.word 47668559 // zeta^309 * 2^31 = 28678040^309 * 2^31 -.word 10775673 // zeta^341 * 2^31 = 28678040^341 * 2^31 -.word 3826249 // zeta^277 * 2^31 = 28678040^277 * 2^31 -.word 262354375 // zeta^373 * (q^(-1) mod 2^32) * 2^31 = 28678040^373 * 375649793 * 2^31 -.word 703707313 // zeta^309 * (q^(-1) mod 2^32) * 2^31 = 28678040^309 * 375649793 * 2^31 -.word 2790542727 // zeta^341 * (q^(-1) mod 2^32) * 2^31 = 28678040^341 * 375649793 * 2^31 -.word 2635626423 // zeta^277 * (q^(-1) mod 2^32) * 2^31 = 28678040^277 * 375649793 * 2^31 -.word 53733071 // zeta^458 * 2^31 = 28678040^458 * 2^31 -.word 10734019 // zeta^330 * 2^31 = 28678040^330 * 2^31 -.word 25306471 // zeta^394 * 2^31 = 28678040^394 * 2^31 -.word 54139625 // zeta^266 * 2^31 = 28678040^266 * 2^31 -.word 284438321 // zeta^458 * (q^(-1) mod 2^32) * 2^31 = 28678040^458 * 375649793 * 2^31 -.word 3541161021 // zeta^330 * (q^(-1) mod 2^32) * 2^31 = 28678040^330 * 375649793 * 2^31 -.word 2646073497 // zeta^394 * (q^(-1) mod 2^32) * 2^31 = 28678040^394 * 375649793 * 2^31 -.word 3100573463 // zeta^266 * (q^(-1) mod 2^32) * 2^31 = 28678040^266 * 375649793 * 2^31 -.word 1468391 // zeta^485 * 2^31 = 28678040^485 * 2^31 -.word 4426959 // zeta^421 * 2^31 = 28678040^421 * 2^31 -.word 42735737 // zeta^453 * 2^31 = 28678040^453 * 2^31 -.word 38665093 // zeta^389 * 2^31 = 28678040^389 * 2^31 -.word 33133879 // zeta^357 * 2^31 = 28678040^357 * 2^31 -.word 7139481 // zeta^293 * 2^31 = 28678040^293 * 2^31 -.word 8438111 // zeta^325 * 2^31 = 28678040^325 * 2^31 -.word 50341189 // zeta^261 * 2^31 = 28678040^261 * 2^31 -.word 3126759625 // zeta^357 * (q^(-1) mod 2^32) * 2^31 = 28678040^357 * 375649793 * 2^31 -.word 523569511 // zeta^293 * (q^(-1) mod 2^32) * 2^31 = 28678040^293 * 375649793 * 2^31 -.word 1408300193 // zeta^325 * (q^(-1) mod 2^32) * 2^31 = 28678040^325 * 375649793 * 2^31 -.word 2172685499 // zeta^261 * (q^(-1) mod 2^32) * 2^31 = 28678040^261 * 375649793 * 2^31 -.word 47558821 // zeta^498 * 2^31 = 28678040^498 * 2^31 -.word 33268441 // zeta^370 * 2^31 = 28678040^370 * 2^31 -.word 63536237 // zeta^434 * 2^31 = 28678040^434 * 2^31 -.word 26272521 // zeta^306 * 2^31 = 28678040^306 * 2^31 -.word 664584539 // zeta^498 * (q^(-1) mod 2^32) * 2^31 = 28678040^498 * 375649793 * 2^31 -.word 2409420583 // zeta^370 * (q^(-1) mod 2^32) * 2^31 = 28678040^370 * 375649793 * 2^31 -.word 3799958931 // zeta^434 * (q^(-1) mod 2^32) * 2^31 = 28678040^434 * 375649793 * 2^31 -.word 835286775 // zeta^306 * (q^(-1) mod 2^32) * 2^31 = 28678040^306 * 375649793 * 2^31 -.word 1854317 // zeta^505 * 2^31 = 28678040^505 * 2^31 -.word 2223865 // zeta^441 * 2^31 = 28678040^441 * 2^31 -.word 22962475 // zeta^473 * 2^31 = 28678040^473 * 2^31 -.word 36888515 // zeta^409 * 2^31 = 28678040^409 * 2^31 -.word 59868297 // zeta^377 * 2^31 = 28678040^377 * 2^31 -.word 15191207 // zeta^313 * 2^31 = 28678040^313 * 2^31 -.word 59108143 // zeta^345 * 2^31 = 28678040^345 * 2^31 -.word 4355773 // zeta^281 * 2^31 = 28678040^281 * 2^31 -.word 538432887 // zeta^377 * (q^(-1) mod 2^32) * 2^31 = 28678040^377 * 375649793 * 2^31 -.word 3252336985 // zeta^313 * (q^(-1) mod 2^32) * 2^31 = 28678040^313 * 375649793 * 2^31 -.word 1330506449 // zeta^345 * (q^(-1) mod 2^32) * 2^31 = 28678040^345 * 375649793 * 2^31 -.word 4169984835 // zeta^281 * (q^(-1) mod 2^32) * 2^31 = 28678040^281 * 375649793 * 2^31 -.word 27411989 // zeta^466 * 2^31 = 28678040^466 * 2^31 -.word 52176833 // zeta^338 * 2^31 = 28678040^338 * 2^31 -.word 52660121 // zeta^402 * 2^31 = 28678040^402 * 2^31 -.word 23140553 // zeta^274 * 2^31 = 28678040^274 * 2^31 -.word 652643307 // zeta^466 * (q^(-1) mod 2^32) * 2^31 = 28678040^466 * 375649793 * 2^31 -.word 4178403903 // zeta^338 * (q^(-1) mod 2^32) * 2^31 = 28678040^338 * 375649793 * 2^31 -.word 1113879143 // zeta^402 * (q^(-1) mod 2^32) * 2^31 = 28678040^402 * 375649793 * 2^31 -.word 3574776119 // zeta^274 * (q^(-1) mod 2^32) * 2^31 = 28678040^274 * 375649793 * 2^31 -.word 50275685 // zeta^489 * 2^31 = 28678040^489 * 2^31 -.word 12903773 // zeta^425 * 2^31 = 28678040^425 * 2^31 -.word 25228433 // zeta^457 * 2^31 = 28678040^457 * 2^31 -.word 55395235 // zeta^393 * 2^31 = 28678040^393 * 2^31 -.word 3868449 // zeta^361 * 2^31 = 28678040^361 * 2^31 -.word 66432231 // zeta^297 * 2^31 = 28678040^297 * 2^31 -.word 31236859 // zeta^329 * 2^31 = 28678040^329 * 2^31 -.word 13658415 // zeta^265 * 2^31 = 28678040^265 * 2^31 -.word 2938651359 // zeta^361 * (q^(-1) mod 2^32) * 2^31 = 28678040^361 * 375649793 * 2^31 -.word 814700825 // zeta^297 * (q^(-1) mod 2^32) * 2^31 = 28678040^297 * 375649793 * 2^31 -.word 1618291461 // zeta^329 * (q^(-1) mod 2^32) * 2^31 = 28678040^329 * 375649793 * 2^31 -.word 49245393 // zeta^265 * (q^(-1) mod 2^32) * 2^31 = 28678040^265 * 375649793 * 2^31 -.word 34409967 // zeta^482 * 2^31 = 28678040^482 * 2^31 -.word 12619783 // zeta^354 * 2^31 = 28678040^354 * 2^31 -.word 54561811 // zeta^418 * 2^31 = 28678040^418 * 2^31 -.word 61632377 // zeta^290 * 2^31 = 28678040^290 * 2^31 -.word 2233616401 // zeta^482 * (q^(-1) mod 2^32) * 2^31 = 28678040^482 * 375649793 * 2^31 -.word 2820912633 // zeta^354 * (q^(-1) mod 2^32) * 2^31 = 28678040^354 * 375649793 * 2^31 -.word 684470765 // zeta^418 * (q^(-1) mod 2^32) * 2^31 = 28678040^418 * 375649793 * 2^31 -.word 3345631879 // zeta^290 * (q^(-1) mod 2^32) * 2^31 = 28678040^290 * 375649793 * 2^31 -.word 7605279 // zeta^497 * 2^31 = 28678040^497 * 2^31 -.word 58319315 // zeta^433 * 2^31 = 28678040^433 * 2^31 -.word 16342937 // zeta^465 * 2^31 = 28678040^465 * 2^31 -.word 48148431 // zeta^401 * 2^31 = 28678040^401 * 2^31 -.word 62377755 // zeta^369 * 2^31 = 28678040^369 * 2^31 -.word 35459369 // zeta^305 * 2^31 = 28678040^305 * 2^31 -.word 27513701 // zeta^337 * 2^31 = 28678040^337 * 2^31 -.word 18346679 // zeta^273 * 2^31 = 28678040^273 * 2^31 -.word 4057153253 // zeta^369 * (q^(-1) mod 2^32) * 2^31 = 28678040^369 * 375649793 * 2^31 -.word 3867838679 // zeta^305 * (q^(-1) mod 2^32) * 2^31 = 28678040^305 * 375649793 * 2^31 -.word 589962907 // zeta^337 * (q^(-1) mod 2^32) * 2^31 = 28678040^337 * 375649793 * 2^31 -.word 1692873545 // zeta^273 * (q^(-1) mod 2^32) * 2^31 = 28678040^273 * 375649793 * 2^31 -.word 1824951 // zeta^450 * 2^31 = 28678040^450 * 2^31 -.word 40410247 // zeta^322 * 2^31 = 28678040^322 * 2^31 -.word 25935987 // zeta^386 * 2^31 = 28678040^386 * 2^31 -.word 53409853 // zeta^258 * 2^31 = 28678040^258 * 2^31 -.word 3034533193 // zeta^450 * (q^(-1) mod 2^32) * 2^31 = 28678040^450 * 375649793 * 2^31 -.word 1425582457 // zeta^322 * (q^(-1) mod 2^32) * 2^31 = 28678040^322 * 375649793 * 2^31 -.word 1695333773 // zeta^386 * (q^(-1) mod 2^32) * 2^31 = 28678040^386 * 375649793 * 2^31 -.word 2628741571 // zeta^258 * (q^(-1) mod 2^32) * 2^31 = 28678040^258 * 375649793 * 2^31 -.word 44896477 // zeta^481 * 2^31 = 28678040^481 * 2^31 -.word 66621379 // zeta^417 * 2^31 = 28678040^417 * 2^31 -.word 35702907 // zeta^449 * 2^31 = 28678040^449 * 2^31 -.word 44158149 // zeta^385 * 2^31 = 28678040^385 * 2^31 -.word 32881793 // zeta^353 * 2^31 = 28678040^353 * 2^31 -.word 18033685 // zeta^289 * 2^31 = 28678040^289 * 2^31 -.word 29367795 // zeta^321 * 2^31 = 28678040^321 * 2^31 -.word 16787671 // zeta^257 * 2^31 = 28678040^257 * 2^31 -.word 3741535615 // zeta^353 * (q^(-1) mod 2^32) * 2^31 = 28678040^353 * 375649793 * 2^31 -.word 3094455787 // zeta^289 * (q^(-1) mod 2^32) * 2^31 = 28678040^289 * 375649793 * 2^31 -.word 3934216205 // zeta^321 * (q^(-1) mod 2^32) * 2^31 = 28678040^321 * 375649793 * 2^31 -.word 2459712809 // zeta^257 * (q^(-1) mod 2^32) * 2^31 = 28678040^257 * 375649793 * 2^31 -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 17514581 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 4460971 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_n256_u32_33556993_28678040, %function -.global inv_ntt_n256_u32_33556993_28678040 -inv_ntt_n256_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r4, #:lower16:modulus_inv -movt r4, #:upper16:modulus_inv -vldrw.s32 Q4, [r0, #0] -vldrw.s32 Q5, [r0, #16] -vsub.s32 Q6, Q4, Q5 -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vqrdmulh.s32 Q5, Q6, Q5 -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #48] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #64] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #80 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-64] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vmul.u32 Q6, Q6, Q5 -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vst41.s32 {Q0,Q1,Q2,Q3}, [r0] -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vst43.s32 {Q0,Q1,Q2,Q3}, [r0]! -sub r0, r0, #1024 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #16] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #32] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #128] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #144] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #176] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #208] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #272] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #288] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #352] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #368] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #400] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #-480] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #-400] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #-352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #-288] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #-272] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #-224] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #-208] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #-176] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #-112] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #-96] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #-64] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #-32] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #-16] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #208] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #96] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #224] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #112] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #288] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #432] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #-432] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #-368] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #-480] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #-464] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #-400] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #-384] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #-320] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #-176] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #-112] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #-144] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #-192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #-128] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #256] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 // XXXXX -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 50631221 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 2147319755 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #-240] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #16] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #272] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #-480] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #-224] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #288] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #-464] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #304] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #-192] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #64] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #-432] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #-416] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #-160] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #-400] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #-384] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #384] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #400] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #-352] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #416] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #-336] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #432] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #-320] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #448] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #-304] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #208] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #-288] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #480] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #496] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3172 -// Instruction count: 2670 \ No newline at end of file diff --git a/tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s b/tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index 418f64a..0000000 --- a/tests/saber/auto/inv_ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2527 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 36501331 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 17843885 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_n256_u32_33556993_28678040_incomplete, %function -.global inv_ntt_n256_u32_33556993_28678040_incomplete -inv_ntt_n256_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #16] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #32] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #128] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #144] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #176] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #208] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #272] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #288] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #352] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #368] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #400] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #-480] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #-400] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #-352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #-288] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #-272] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #-224] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #-208] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #-176] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #-112] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #-96] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #-64] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #-32] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #-16] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #208] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #96] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #224] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #112] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #288] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #432] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #-432] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #-368] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #-480] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #-464] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #-400] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #-384] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #-320] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #-176] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #-112] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #-144] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #-192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #-128] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #256] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 // XXXXX -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 34739919 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 4294311729 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #-240] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #16] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #272] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #-480] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #-224] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #288] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #-464] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #304] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #-192] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #64] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #-432] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #-416] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #-160] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #-400] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #-384] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #384] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #400] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #-352] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #416] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #-336] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #432] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #-320] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #448] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #-304] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #208] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #-288] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #480] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #496] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2303 -// Instruction count: 1802 \ No newline at end of file diff --git a/tests/saber/auto/inv_ntt_u32_33556993_28678040_complete.s b/tests/saber/auto/inv_ntt_u32_33556993_28678040_complete.s deleted file mode 100644 index f7b1e83..0000000 --- a/tests/saber/auto/inv_ntt_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,3468 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 20558213 // zeta^510 * 2^31 = 28678040^510 * 2^31 -.word 66424611 // zeta^382 * 2^31 = 28678040^382 * 2^31 -.word 59465515 // zeta^446 * 2^31 = 28678040^446 * 2^31 -.word 39560591 // zeta^318 * 2^31 = 28678040^318 * 2^31 -.word 2042724475 // zeta^510 * (q^(-1) mod 2^32) * 2^31 = 28678040^510 * 375649793 * 2^31 -.word 2817904349 // zeta^382 * (q^(-1) mod 2^32) * 2^31 = 28678040^382 * 375649793 * 2^31 -.word 2405453525 // zeta^446 * (q^(-1) mod 2^32) * 2^31 = 28678040^446 * 375649793 * 2^31 -.word 2621436017 // zeta^318 * (q^(-1) mod 2^32) * 2^31 = 28678040^318 * 375649793 * 2^31 -.word 35339857 // zeta^511 * 2^31 = 28678040^511 * 2^31 -.word 13377101 // zeta^447 * 2^31 = 28678040^447 * 2^31 -.word 33252123 // zeta^479 * 2^31 = 28678040^479 * 2^31 -.word 16713319 // zeta^415 * 2^31 = 28678040^415 * 2^31 -.word 3232754607 // zeta^511 * (q^(-1) mod 2^32) * 2^31 = 28678040^511 * 375649793 * 2^31 -.word 2219762611 // zeta^447 * (q^(-1) mod 2^32) * 2^31 = 28678040^447 * 375649793 * 2^31 -.word 3344411365 // zeta^479 * (q^(-1) mod 2^32) * 2^31 = 28678040^479 * 375649793 * 2^31 -.word 2600796057 // zeta^415 * (q^(-1) mod 2^32) * 2^31 = 28678040^415 * 375649793 * 2^31 -.word 10815985 // zeta^383 * 2^31 = 28678040^383 * 2^31 -.word 56247925 // zeta^319 * 2^31 = 28678040^319 * 2^31 -.word 26943959 // zeta^351 * 2^31 = 28678040^351 * 2^31 -.word 51316823 // zeta^287 * 2^31 = 28678040^287 * 2^31 -.word 3650773007 // zeta^383 * (q^(-1) mod 2^32) * 2^31 = 28678040^383 * 375649793 * 2^31 -.word 4021439371 // zeta^319 * (q^(-1) mod 2^32) * 2^31 = 28678040^319 * 375649793 * 2^31 -.word 1538999337 // zeta^351 * (q^(-1) mod 2^32) * 2^31 = 28678040^351 * 375649793 * 2^31 -.word 3611844009 // zeta^287 * (q^(-1) mod 2^32) * 2^31 = 28678040^287 * 375649793 * 2^31 -.word 42042379 // zeta^478 * 2^31 = 28678040^478 * 2^31 -.word 26419651 // zeta^350 * 2^31 = 28678040^350 * 2^31 -.word 61522009 // zeta^414 * 2^31 = 28678040^414 * 2^31 -.word 23758817 // zeta^286 * 2^31 = 28678040^286 * 2^31 -.word 2254105077 // zeta^478 * (q^(-1) mod 2^32) * 2^31 = 28678040^478 * 375649793 * 2^31 -.word 3415374909 // zeta^350 * (q^(-1) mod 2^32) * 2^31 = 28678040^350 * 375649793 * 2^31 -.word 3742677415 // zeta^414 * (q^(-1) mod 2^32) * 2^31 = 28678040^414 * 375649793 * 2^31 -.word 3187687967 // zeta^286 * (q^(-1) mod 2^32) * 2^31 = 28678040^286 * 375649793 * 2^31 -.word 35776599 // zeta^495 * 2^31 = 28678040^495 * 2^31 -.word 6731445 // zeta^431 * 2^31 = 28678040^431 * 2^31 -.word 3030459 // zeta^463 * 2^31 = 28678040^463 * 2^31 -.word 41085059 // zeta^399 * 2^31 = 28678040^399 * 2^31 -.word 351632809 // zeta^495 * (q^(-1) mod 2^32) * 2^31 = 28678040^495 * 375649793 * 2^31 -.word 369646411 // zeta^431 * (q^(-1) mod 2^32) * 2^31 = 28678040^431 * 375649793 * 2^31 -.word 2670661701 // zeta^463 * (q^(-1) mod 2^32) * 2^31 = 28678040^463 * 375649793 * 2^31 -.word 1702245757 // zeta^399 * (q^(-1) mod 2^32) * 2^31 = 28678040^399 * 375649793 * 2^31 -.word 6685305 // zeta^367 * 2^31 = 28678040^367 * 2^31 -.word 24840267 // zeta^303 * 2^31 = 28678040^303 * 2^31 -.word 21119839 // zeta^335 * 2^31 = 28678040^335 * 2^31 -.word 32376869 // zeta^271 * 2^31 = 28678040^271 * 2^31 -.word 2658056071 // zeta^367 * (q^(-1) mod 2^32) * 2^31 = 28678040^367 * 375649793 * 2^31 -.word 495707573 // zeta^303 * (q^(-1) mod 2^32) * 2^31 = 28678040^303 * 375649793 * 2^31 -.word 440627873 // zeta^335 * (q^(-1) mod 2^32) * 2^31 = 28678040^335 * 375649793 * 2^31 -.word 3991890395 // zeta^271 * (q^(-1) mod 2^32) * 2^31 = 28678040^271 * 375649793 * 2^31 -.word 11319751 // zeta^494 * 2^31 = 28678040^494 * 2^31 -.word 57449959 // zeta^366 * 2^31 = 28678040^366 * 2^31 -.word 47736605 // zeta^430 * 2^31 = 28678040^430 * 2^31 -.word 25310795 // zeta^302 * 2^31 = 28678040^302 * 2^31 -.word 316214329 // zeta^494 * (q^(-1) mod 2^32) * 2^31 = 28678040^494 * 375649793 * 2^31 -.word 2994890777 // zeta^366 * (q^(-1) mod 2^32) * 2^31 = 28678040^366 * 375649793 * 2^31 -.word 2883238627 // zeta^430 * (q^(-1) mod 2^32) * 2^31 = 28678040^430 * 375649793 * 2^31 -.word 1834006453 // zeta^302 * (q^(-1) mod 2^32) * 2^31 = 28678040^302 * 375649793 * 2^31 -.word 5649915 // zeta^503 * 2^31 = 28678040^503 * 2^31 -.word 25847843 // zeta^439 * 2^31 = 28678040^439 * 2^31 -.word 62444027 // zeta^471 * 2^31 = 28678040^471 * 2^31 -.word 57855139 // zeta^407 * 2^31 = 28678040^407 * 2^31 -.word 3048839173 // zeta^503 * (q^(-1) mod 2^32) * 2^31 = 28678040^503 * 375649793 * 2^31 -.word 3067803101 // zeta^439 * (q^(-1) mod 2^32) * 2^31 = 28678040^439 * 375649793 * 2^31 -.word 2624519173 // zeta^471 * (q^(-1) mod 2^32) * 2^31 = 28678040^471 * 375649793 * 2^31 -.word 2262798685 // zeta^407 * (q^(-1) mod 2^32) * 2^31 = 28678040^407 * 375649793 * 2^31 -.word 43953263 // zeta^375 * 2^31 = 28678040^375 * 2^31 -.word 3973257 // zeta^311 * 2^31 = 28678040^311 * 2^31 -.word 45754835 // zeta^343 * 2^31 = 28678040^343 * 2^31 -.word 47438647 // zeta^279 * 2^31 = 28678040^279 * 2^31 -.word 1254205841 // zeta^375 * (q^(-1) mod 2^32) * 2^31 = 28678040^375 * 375649793 * 2^31 -.word 3800349047 // zeta^311 * (q^(-1) mod 2^32) * 2^31 = 28678040^311 * 375649793 * 2^31 -.word 3397129261 // zeta^343 * (q^(-1) mod 2^32) * 2^31 = 28678040^343 * 375649793 * 2^31 -.word 3896527561 // zeta^279 * (q^(-1) mod 2^32) * 2^31 = 28678040^279 * 375649793 * 2^31 -.word 34946213 // zeta^462 * 2^31 = 28678040^462 * 2^31 -.word 33401995 // zeta^334 * 2^31 = 28678040^334 * 2^31 -.word 57707227 // zeta^398 * 2^31 = 28678040^398 * 2^31 -.word 43655235 // zeta^270 * 2^31 = 28678040^270 * 2^31 -.word 4090836315 // zeta^462 * (q^(-1) mod 2^32) * 2^31 = 28678040^462 * 375649793 * 2^31 -.word 2389950837 // zeta^334 * (q^(-1) mod 2^32) * 2^31 = 28678040^334 * 375649793 * 2^31 -.word 1383072549 // zeta^398 * (q^(-1) mod 2^32) * 2^31 = 28678040^398 * 375649793 * 2^31 -.word 2793176509 // zeta^270 * (q^(-1) mod 2^32) * 2^31 = 28678040^270 * 375649793 * 2^31 -.word 30218957 // zeta^487 * 2^31 = 28678040^487 * 2^31 -.word 13073717 // zeta^423 * 2^31 = 28678040^423 * 2^31 -.word 41547715 // zeta^455 * 2^31 = 28678040^455 * 2^31 -.word 51082899 // zeta^391 * 2^31 = 28678040^391 * 2^31 -.word 3945457459 // zeta^487 * (q^(-1) mod 2^32) * 2^31 = 28678040^487 * 375649793 * 2^31 -.word 1399362763 // zeta^423 * (q^(-1) mod 2^32) * 2^31 = 28678040^423 * 375649793 * 2^31 -.word 923248189 // zeta^455 * (q^(-1) mod 2^32) * 2^31 = 28678040^455 * 375649793 * 2^31 -.word 2083145581 // zeta^391 * (q^(-1) mod 2^32) * 2^31 = 28678040^391 * 375649793 * 2^31 -.word 6539853 // zeta^359 * 2^31 = 28678040^359 * 2^31 -.word 52712977 // zeta^295 * 2^31 = 28678040^295 * 2^31 -.word 15171525 // zeta^327 * 2^31 = 28678040^327 * 2^31 -.word 41070365 // zeta^263 * 2^31 = 28678040^263 * 2^31 -.word 1097807795 // zeta^359 * (q^(-1) mod 2^32) * 2^31 = 28678040^359 * 375649793 * 2^31 -.word 1402229743 // zeta^295 * (q^(-1) mod 2^32) * 2^31 = 28678040^295 * 375649793 * 2^31 -.word 857879099 // zeta^327 * (q^(-1) mod 2^32) * 2^31 = 28678040^327 * 375649793 * 2^31 -.word 2467328739 // zeta^263 * (q^(-1) mod 2^32) * 2^31 = 28678040^263 * 375649793 * 2^31 -.word 1421525 // zeta^502 * 2^31 = 28678040^502 * 2^31 -.word 5608953 // zeta^374 * 2^31 = 28678040^374 * 2^31 -.word 3344309 // zeta^438 * 2^31 = 28678040^438 * 2^31 -.word 54192527 // zeta^310 * 2^31 = 28678040^310 * 2^31 -.word 2006884651 // zeta^502 * (q^(-1) mod 2^32) * 2^31 = 28678040^502 * 375649793 * 2^31 -.word 1547838471 // zeta^374 * (q^(-1) mod 2^32) * 2^31 = 28678040^374 * 375649793 * 2^31 -.word 1835403851 // zeta^438 * (q^(-1) mod 2^32) * 2^31 = 28678040^438 * 375649793 * 2^31 -.word 3288902769 // zeta^310 * (q^(-1) mod 2^32) * 2^31 = 28678040^310 * 375649793 * 2^31 -.word 55532487 // zeta^507 * 2^31 = 28678040^507 * 2^31 -.word 25878283 // zeta^443 * 2^31 = 28678040^443 * 2^31 -.word 7519477 // zeta^475 * 2^31 = 28678040^475 * 2^31 -.word 10400227 // zeta^411 * 2^31 = 28678040^411 * 2^31 -.word 579496505 // zeta^507 * (q^(-1) mod 2^32) * 2^31 = 28678040^507 * 375649793 * 2^31 -.word 1491046133 // zeta^443 * (q^(-1) mod 2^32) * 2^31 = 28678040^443 * 375649793 * 2^31 -.word 2637878539 // zeta^475 * (q^(-1) mod 2^32) * 2^31 = 28678040^475 * 375649793 * 2^31 -.word 866659357 // zeta^411 * (q^(-1) mod 2^32) * 2^31 = 28678040^411 * 375649793 * 2^31 -.word 66449241 // zeta^379 * 2^31 = 28678040^379 * 2^31 -.word 4428811 // zeta^315 * 2^31 = 28678040^315 * 2^31 -.word 30618985 // zeta^347 * 2^31 = 28678040^347 * 2^31 -.word 46942975 // zeta^283 * 2^31 = 28678040^283 * 2^31 -.word 1923058343 // zeta^379 * (q^(-1) mod 2^32) * 2^31 = 28678040^379 * 375649793 * 2^31 -.word 3711490549 // zeta^315 * (q^(-1) mod 2^32) * 2^31 = 28678040^315 * 375649793 * 2^31 -.word 1530848407 // zeta^347 * (q^(-1) mod 2^32) * 2^31 = 28678040^347 * 375649793 * 2^31 -.word 3263539969 // zeta^283 * (q^(-1) mod 2^32) * 2^31 = 28678040^283 * 375649793 * 2^31 -.word 34238409 // zeta^470 * 2^31 = 28678040^470 * 2^31 -.word 7278675 // zeta^342 * 2^31 = 28678040^342 * 2^31 -.word 26316985 // zeta^406 * 2^31 = 28678040^406 * 2^31 -.word 1738533 // zeta^278 * 2^31 = 28678040^278 * 2^31 -.word 1976527415 // zeta^470 * (q^(-1) mod 2^32) * 2^31 = 28678040^470 * 375649793 * 2^31 -.word 3553111469 // zeta^342 * (q^(-1) mod 2^32) * 2^31 = 28678040^342 * 375649793 * 2^31 -.word 1070704967 // zeta^406 * (q^(-1) mod 2^32) * 2^31 = 28678040^406 * 375649793 * 2^31 -.word 280554203 // zeta^278 * (q^(-1) mod 2^32) * 2^31 = 28678040^278 * 375649793 * 2^31 -.word 29493541 // zeta^491 * 2^31 = 28678040^491 * 2^31 -.word 46179537 // zeta^427 * 2^31 = 28678040^427 * 2^31 -.word 61070425 // zeta^459 * 2^31 = 28678040^459 * 2^31 -.word 47641435 // zeta^395 * 2^31 = 28678040^395 * 2^31 -.word 3525667035 // zeta^491 * (q^(-1) mod 2^32) * 2^31 = 28678040^491 * 375649793 * 2^31 -.word 738952495 // zeta^427 * (q^(-1) mod 2^32) * 2^31 = 28678040^427 * 375649793 * 2^31 -.word 2855509415 // zeta^459 * (q^(-1) mod 2^32) * 2^31 = 28678040^459 * 375649793 * 2^31 -.word 2166266533 // zeta^395 * (q^(-1) mod 2^32) * 2^31 = 28678040^395 * 375649793 * 2^31 -.word 8700655 // zeta^363 * 2^31 = 28678040^363 * 2^31 -.word 49217369 // zeta^299 * 2^31 = 28678040^299 * 2^31 -.word 14037329 // zeta^331 * 2^31 = 28678040^331 * 2^31 -.word 57068693 // zeta^267 * 2^31 = 28678040^267 * 2^31 -.word 2143064849 // zeta^363 * (q^(-1) mod 2^32) * 2^31 = 28678040^363 * 375649793 * 2^31 -.word 3997596327 // zeta^299 * (q^(-1) mod 2^32) * 2^31 = 28678040^299 * 375649793 * 2^31 -.word 594737327 // zeta^331 * (q^(-1) mod 2^32) * 2^31 = 28678040^331 * 375649793 * 2^31 -.word 1214449003 // zeta^267 * (q^(-1) mod 2^32) * 2^31 = 28678040^267 * 375649793 * 2^31 -.word 5988919 // zeta^486 * 2^31 = 28678040^486 * 2^31 -.word 27781261 // zeta^358 * 2^31 = 28678040^358 * 2^31 -.word 33650523 // zeta^422 * 2^31 = 28678040^422 * 2^31 -.word 40314383 // zeta^294 * 2^31 = 28678040^294 * 2^31 -.word 2046739401 // zeta^486 * (q^(-1) mod 2^32) * 2^31 = 28678040^486 * 375649793 * 2^31 -.word 2556008819 // zeta^358 * (q^(-1) mod 2^32) * 2^31 = 28678040^358 * 375649793 * 2^31 -.word 2602309285 // zeta^422 * (q^(-1) mod 2^32) * 2^31 = 28678040^422 * 375649793 * 2^31 -.word 3711528945 // zeta^294 * (q^(-1) mod 2^32) * 2^31 = 28678040^294 * 375649793 * 2^31 -.word 25356533 // zeta^499 * 2^31 = 28678040^499 * 2^31 -.word 59712043 // zeta^435 * 2^31 = 28678040^435 * 2^31 -.word 59431885 // zeta^467 * 2^31 = 28678040^467 * 2^31 -.word 42783775 // zeta^403 * 2^31 = 28678040^403 * 2^31 -.word 232958219 // zeta^499 * (q^(-1) mod 2^32) * 2^31 = 28678040^499 * 375649793 * 2^31 -.word 2298121173 // zeta^435 * (q^(-1) mod 2^32) * 2^31 = 28678040^435 * 375649793 * 2^31 -.word 4009174579 // zeta^467 * (q^(-1) mod 2^32) * 2^31 = 28678040^467 * 375649793 * 2^31 -.word 4154483169 // zeta^403 * (q^(-1) mod 2^32) * 2^31 = 28678040^403 * 375649793 * 2^31 -.word 15118727 // zeta^371 * 2^31 = 28678040^371 * 2^31 -.word 16104593 // zeta^307 * 2^31 = 28678040^307 * 2^31 -.word 66551101 // zeta^339 * 2^31 = 28678040^339 * 2^31 -.word 27099659 // zeta^275 * 2^31 = 28678040^275 * 2^31 -.word 256676985 // zeta^371 * (q^(-1) mod 2^32) * 2^31 = 28678040^371 * 375649793 * 2^31 -.word 2042883439 // zeta^307 * (q^(-1) mod 2^32) * 2^31 = 28678040^307 * 375649793 * 2^31 -.word 2098783427 // zeta^339 * (q^(-1) mod 2^32) * 2^31 = 28678040^339 * 375649793 * 2^31 -.word 1730866165 // zeta^275 * (q^(-1) mod 2^32) * 2^31 = 28678040^275 * 375649793 * 2^31 -.word 52622279 // zeta^454 * 2^31 = 28678040^454 * 2^31 -.word 48542309 // zeta^326 * 2^31 = 28678040^326 * 2^31 -.word 28412919 // zeta^390 * 2^31 = 28678040^390 * 2^31 -.word 61490063 // zeta^262 * 2^31 = 28678040^262 * 2^31 -.word 111596089 // zeta^454 * (q^(-1) mod 2^32) * 2^31 = 28678040^454 * 375649793 * 2^31 -.word 2392801179 // zeta^326 * (q^(-1) mod 2^32) * 2^31 = 28678040^326 * 375649793 * 2^31 -.word 122296841 // zeta^390 * (q^(-1) mod 2^32) * 2^31 = 28678040^390 * 375649793 * 2^31 -.word 4112339569 // zeta^262 * (q^(-1) mod 2^32) * 2^31 = 28678040^262 * 375649793 * 2^31 -.word 17544659 // zeta^483 * 2^31 = 28678040^483 * 2^31 -.word 26761761 // zeta^419 * 2^31 = 28678040^419 * 2^31 -.word 28138345 // zeta^451 * 2^31 = 28678040^451 * 2^31 -.word 6006005 // zeta^387 * 2^31 = 28678040^387 * 2^31 -.word 1268942893 // zeta^483 * (q^(-1) mod 2^32) * 2^31 = 28678040^483 * 375649793 * 2^31 -.word 3876122591 // zeta^419 * (q^(-1) mod 2^32) * 2^31 = 28678040^419 * 375649793 * 2^31 -.word 148946583 // zeta^451 * (q^(-1) mod 2^32) * 2^31 = 28678040^451 * 375649793 * 2^31 -.word 375516427 // zeta^387 * (q^(-1) mod 2^32) * 2^31 = 28678040^387 * 375649793 * 2^31 -.word 49338991 // zeta^355 * 2^31 = 28678040^355 * 2^31 -.word 59052279 // zeta^291 * 2^31 = 28678040^291 * 2^31 -.word 54131019 // zeta^323 * 2^31 = 28678040^323 * 2^31 -.word 49172137 // zeta^259 * 2^31 = 28678040^259 * 2^31 -.word 2285599633 // zeta^355 * (q^(-1) mod 2^32) * 2^31 = 28678040^355 * 375649793 * 2^31 -.word 1420334345 // zeta^291 * (q^(-1) mod 2^32) * 2^31 = 28678040^291 * 375649793 * 2^31 -.word 1832318133 // zeta^323 * (q^(-1) mod 2^32) * 2^31 = 28678040^323 * 375649793 * 2^31 -.word 203443031 // zeta^259 * (q^(-1) mod 2^32) * 2^31 = 28678040^259 * 375649793 * 2^31 -.word 41164657 // zeta^506 * 2^31 = 28678040^506 * 2^31 -.word 23553921 // zeta^378 * 2^31 = 28678040^378 * 2^31 -.word 51075303 // zeta^442 * 2^31 = 28678040^442 * 2^31 -.word 11244857 // zeta^314 * 2^31 = 28678040^314 * 2^31 -.word 2292337295 // zeta^506 * (q^(-1) mod 2^32) * 2^31 = 28678040^506 * 375649793 * 2^31 -.word 2218762879 // zeta^378 * (q^(-1) mod 2^32) * 2^31 = 28678040^378 * 375649793 * 2^31 -.word 3660688665 // zeta^442 * (q^(-1) mod 2^32) * 2^31 = 28678040^442 * 375649793 * 2^31 -.word 2196022471 // zeta^314 * (q^(-1) mod 2^32) * 2^31 = 28678040^314 * 375649793 * 2^31 -.word 27161421 // zeta^509 * 2^31 = 28678040^509 * 2^31 -.word 12259351 // zeta^445 * 2^31 = 28678040^445 * 2^31 -.word 42183787 // zeta^477 * 2^31 = 28678040^477 * 2^31 -.word 260949 // zeta^413 * 2^31 = 28678040^413 * 2^31 -.word 2261683891 // zeta^509 * (q^(-1) mod 2^32) * 2^31 = 28678040^509 * 375649793 * 2^31 -.word 183096809 // zeta^445 * (q^(-1) mod 2^32) * 2^31 = 28678040^445 * 375649793 * 2^31 -.word 2523693461 // zeta^477 * (q^(-1) mod 2^32) * 2^31 = 28678040^477 * 375649793 * 2^31 -.word 2895730347 // zeta^413 * (q^(-1) mod 2^32) * 2^31 = 28678040^413 * 375649793 * 2^31 -.word 49379395 // zeta^381 * 2^31 = 28678040^381 * 2^31 -.word 45318697 // zeta^317 * 2^31 = 28678040^317 * 2^31 -.word 65417737 // zeta^349 * 2^31 = 28678040^349 * 2^31 -.word 60522221 // zeta^285 * 2^31 = 28678040^285 * 2^31 -.word 2945787325 // zeta^381 * (q^(-1) mod 2^32) * 2^31 = 28678040^381 * 375649793 * 2^31 -.word 2724075479 // zeta^317 * (q^(-1) mod 2^32) * 2^31 = 28678040^317 * 375649793 * 2^31 -.word 2827626487 // zeta^349 * (q^(-1) mod 2^32) * 2^31 = 28678040^349 * 375649793 * 2^31 -.word 482722579 // zeta^285 * (q^(-1) mod 2^32) * 2^31 = 28678040^285 * 375649793 * 2^31 -.word 3629237 // zeta^474 * 2^31 = 28678040^474 * 2^31 -.word 60326323 // zeta^346 * 2^31 = 28678040^346 * 2^31 -.word 30569867 // zeta^410 * 2^31 = 28678040^410 * 2^31 -.word 31921231 // zeta^282 * 2^31 = 28678040^282 * 2^31 -.word 3571167563 // zeta^474 * (q^(-1) mod 2^32) * 2^31 = 28678040^474 * 375649793 * 2^31 -.word 3851189325 // zeta^346 * (q^(-1) mod 2^32) * 2^31 = 28678040^346 * 375649793 * 2^31 -.word 1517877365 // zeta^410 * (q^(-1) mod 2^32) * 2^31 = 28678040^410 * 375649793 * 2^31 -.word 1275593137 // zeta^282 * (q^(-1) mod 2^32) * 2^31 = 28678040^282 * 375649793 * 2^31 -.word 51477925 // zeta^493 * 2^31 = 28678040^493 * 2^31 -.word 23177153 // zeta^429 * 2^31 = 28678040^429 * 2^31 -.word 42516129 // zeta^461 * 2^31 = 28678040^461 * 2^31 -.word 23261199 // zeta^397 * 2^31 = 28678040^397 * 2^31 -.word 1768092763 // zeta^493 * (q^(-1) mod 2^32) * 2^31 = 28678040^493 * 375649793 * 2^31 -.word 2982666815 // zeta^429 * (q^(-1) mod 2^32) * 2^31 = 28678040^429 * 375649793 * 2^31 -.word 134581087 // zeta^461 * (q^(-1) mod 2^32) * 2^31 = 28678040^461 * 375649793 * 2^31 -.word 3424757233 // zeta^397 * (q^(-1) mod 2^32) * 2^31 = 28678040^397 * 375649793 * 2^31 -.word 50523083 // zeta^365 * 2^31 = 28678040^365 * 2^31 -.word 29024109 // zeta^301 * 2^31 = 28678040^301 * 2^31 -.word 62634975 // zeta^333 * 2^31 = 28678040^333 * 2^31 -.word 5116371 // zeta^269 * 2^31 = 28678040^269 * 2^31 -.word 2363949621 // zeta^365 * (q^(-1) mod 2^32) * 2^31 = 28678040^365 * 375649793 * 2^31 -.word 2792055443 // zeta^301 * (q^(-1) mod 2^32) * 2^31 = 28678040^301 * 375649793 * 2^31 -.word 3296655905 // zeta^333 * (q^(-1) mod 2^32) * 2^31 = 28678040^333 * 375649793 * 2^31 -.word 4093127725 // zeta^269 * (q^(-1) mod 2^32) * 2^31 = 28678040^269 * 375649793 * 2^31 -.word 55626043 // zeta^490 * 2^31 = 28678040^490 * 2^31 -.word 15630981 // zeta^362 * 2^31 = 28678040^362 * 2^31 -.word 43717491 // zeta^426 * 2^31 = 28678040^426 * 2^31 -.word 14342369 // zeta^298 * 2^31 = 28678040^298 * 2^31 -.word 2004845765 // zeta^490 * (q^(-1) mod 2^32) * 2^31 = 28678040^490 * 375649793 * 2^31 -.word 3862343547 // zeta^362 * (q^(-1) mod 2^32) * 2^31 = 28678040^362 * 375649793 * 2^31 -.word 2436590221 // zeta^426 * (q^(-1) mod 2^32) * 2^31 = 28678040^426 * 375649793 * 2^31 -.word 2109337887 // zeta^298 * (q^(-1) mod 2^32) * 2^31 = 28678040^298 * 375649793 * 2^31 -.word 6776583 // zeta^501 * 2^31 = 28678040^501 * 2^31 -.word 33530533 // zeta^437 * 2^31 = 28678040^437 * 2^31 -.word 43598203 // zeta^469 * 2^31 = 28678040^469 * 2^31 -.word 59373651 // zeta^405 * 2^31 = 28678040^405 * 2^31 -.word 820174585 // zeta^501 * (q^(-1) mod 2^32) * 2^31 = 28678040^501 * 375649793 * 2^31 -.word 1139199835 // zeta^437 * (q^(-1) mod 2^32) * 2^31 = 28678040^437 * 375649793 * 2^31 -.word 3555298437 // zeta^469 * (q^(-1) mod 2^32) * 2^31 = 28678040^469 * 375649793 * 2^31 -.word 1035814317 // zeta^405 * (q^(-1) mod 2^32) * 2^31 = 28678040^405 * 375649793 * 2^31 -.word 37946425 // zeta^373 * 2^31 = 28678040^373 * 2^31 -.word 47668559 // zeta^309 * 2^31 = 28678040^309 * 2^31 -.word 10775673 // zeta^341 * 2^31 = 28678040^341 * 2^31 -.word 3826249 // zeta^277 * 2^31 = 28678040^277 * 2^31 -.word 262354375 // zeta^373 * (q^(-1) mod 2^32) * 2^31 = 28678040^373 * 375649793 * 2^31 -.word 703707313 // zeta^309 * (q^(-1) mod 2^32) * 2^31 = 28678040^309 * 375649793 * 2^31 -.word 2790542727 // zeta^341 * (q^(-1) mod 2^32) * 2^31 = 28678040^341 * 375649793 * 2^31 -.word 2635626423 // zeta^277 * (q^(-1) mod 2^32) * 2^31 = 28678040^277 * 375649793 * 2^31 -.word 53733071 // zeta^458 * 2^31 = 28678040^458 * 2^31 -.word 10734019 // zeta^330 * 2^31 = 28678040^330 * 2^31 -.word 25306471 // zeta^394 * 2^31 = 28678040^394 * 2^31 -.word 54139625 // zeta^266 * 2^31 = 28678040^266 * 2^31 -.word 284438321 // zeta^458 * (q^(-1) mod 2^32) * 2^31 = 28678040^458 * 375649793 * 2^31 -.word 3541161021 // zeta^330 * (q^(-1) mod 2^32) * 2^31 = 28678040^330 * 375649793 * 2^31 -.word 2646073497 // zeta^394 * (q^(-1) mod 2^32) * 2^31 = 28678040^394 * 375649793 * 2^31 -.word 3100573463 // zeta^266 * (q^(-1) mod 2^32) * 2^31 = 28678040^266 * 375649793 * 2^31 -.word 1468391 // zeta^485 * 2^31 = 28678040^485 * 2^31 -.word 4426959 // zeta^421 * 2^31 = 28678040^421 * 2^31 -.word 42735737 // zeta^453 * 2^31 = 28678040^453 * 2^31 -.word 38665093 // zeta^389 * 2^31 = 28678040^389 * 2^31 -.word 1874632217 // zeta^485 * (q^(-1) mod 2^32) * 2^31 = 28678040^485 * 375649793 * 2^31 -.word 3630205233 // zeta^421 * (q^(-1) mod 2^32) * 2^31 = 28678040^421 * 375649793 * 2^31 -.word 2166661511 // zeta^453 * (q^(-1) mod 2^32) * 2^31 = 28678040^453 * 375649793 * 2^31 -.word 1536243323 // zeta^389 * (q^(-1) mod 2^32) * 2^31 = 28678040^389 * 375649793 * 2^31 -.word 33133879 // zeta^357 * 2^31 = 28678040^357 * 2^31 -.word 7139481 // zeta^293 * 2^31 = 28678040^293 * 2^31 -.word 8438111 // zeta^325 * 2^31 = 28678040^325 * 2^31 -.word 50341189 // zeta^261 * 2^31 = 28678040^261 * 2^31 -.word 3126759625 // zeta^357 * (q^(-1) mod 2^32) * 2^31 = 28678040^357 * 375649793 * 2^31 -.word 523569511 // zeta^293 * (q^(-1) mod 2^32) * 2^31 = 28678040^293 * 375649793 * 2^31 -.word 1408300193 // zeta^325 * (q^(-1) mod 2^32) * 2^31 = 28678040^325 * 375649793 * 2^31 -.word 2172685499 // zeta^261 * (q^(-1) mod 2^32) * 2^31 = 28678040^261 * 375649793 * 2^31 -.word 47558821 // zeta^498 * 2^31 = 28678040^498 * 2^31 -.word 33268441 // zeta^370 * 2^31 = 28678040^370 * 2^31 -.word 63536237 // zeta^434 * 2^31 = 28678040^434 * 2^31 -.word 26272521 // zeta^306 * 2^31 = 28678040^306 * 2^31 -.word 664584539 // zeta^498 * (q^(-1) mod 2^32) * 2^31 = 28678040^498 * 375649793 * 2^31 -.word 2409420583 // zeta^370 * (q^(-1) mod 2^32) * 2^31 = 28678040^370 * 375649793 * 2^31 -.word 3799958931 // zeta^434 * (q^(-1) mod 2^32) * 2^31 = 28678040^434 * 375649793 * 2^31 -.word 835286775 // zeta^306 * (q^(-1) mod 2^32) * 2^31 = 28678040^306 * 375649793 * 2^31 -.word 1854317 // zeta^505 * 2^31 = 28678040^505 * 2^31 -.word 2223865 // zeta^441 * 2^31 = 28678040^441 * 2^31 -.word 22962475 // zeta^473 * 2^31 = 28678040^473 * 2^31 -.word 36888515 // zeta^409 * 2^31 = 28678040^409 * 2^31 -.word 1178728083 // zeta^505 * (q^(-1) mod 2^32) * 2^31 = 28678040^505 * 375649793 * 2^31 -.word 2481965831 // zeta^441 * (q^(-1) mod 2^32) * 2^31 = 28678040^441 * 375649793 * 2^31 -.word 128011477 // zeta^473 * (q^(-1) mod 2^32) * 2^31 = 28678040^473 * 375649793 * 2^31 -.word 3495870013 // zeta^409 * (q^(-1) mod 2^32) * 2^31 = 28678040^409 * 375649793 * 2^31 -.word 59868297 // zeta^377 * 2^31 = 28678040^377 * 2^31 -.word 15191207 // zeta^313 * 2^31 = 28678040^313 * 2^31 -.word 59108143 // zeta^345 * 2^31 = 28678040^345 * 2^31 -.word 4355773 // zeta^281 * 2^31 = 28678040^281 * 2^31 -.word 538432887 // zeta^377 * (q^(-1) mod 2^32) * 2^31 = 28678040^377 * 375649793 * 2^31 -.word 3252336985 // zeta^313 * (q^(-1) mod 2^32) * 2^31 = 28678040^313 * 375649793 * 2^31 -.word 1330506449 // zeta^345 * (q^(-1) mod 2^32) * 2^31 = 28678040^345 * 375649793 * 2^31 -.word 4169984835 // zeta^281 * (q^(-1) mod 2^32) * 2^31 = 28678040^281 * 375649793 * 2^31 -.word 27411989 // zeta^466 * 2^31 = 28678040^466 * 2^31 -.word 52176833 // zeta^338 * 2^31 = 28678040^338 * 2^31 -.word 52660121 // zeta^402 * 2^31 = 28678040^402 * 2^31 -.word 23140553 // zeta^274 * 2^31 = 28678040^274 * 2^31 -.word 652643307 // zeta^466 * (q^(-1) mod 2^32) * 2^31 = 28678040^466 * 375649793 * 2^31 -.word 4178403903 // zeta^338 * (q^(-1) mod 2^32) * 2^31 = 28678040^338 * 375649793 * 2^31 -.word 1113879143 // zeta^402 * (q^(-1) mod 2^32) * 2^31 = 28678040^402 * 375649793 * 2^31 -.word 3574776119 // zeta^274 * (q^(-1) mod 2^32) * 2^31 = 28678040^274 * 375649793 * 2^31 -.word 50275685 // zeta^489 * 2^31 = 28678040^489 * 2^31 -.word 12903773 // zeta^425 * 2^31 = 28678040^425 * 2^31 -.word 25228433 // zeta^457 * 2^31 = 28678040^457 * 2^31 -.word 55395235 // zeta^393 * 2^31 = 28678040^393 * 2^31 -.word 2869087387 // zeta^489 * (q^(-1) mod 2^32) * 2^31 = 28678040^489 * 375649793 * 2^31 -.word 433896611 // zeta^425 * (q^(-1) mod 2^32) * 2^31 = 28678040^425 * 375649793 * 2^31 -.word 157857135 // zeta^457 * (q^(-1) mod 2^32) * 2^31 = 28678040^457 * 375649793 * 2^31 -.word 2477464157 // zeta^393 * (q^(-1) mod 2^32) * 2^31 = 28678040^393 * 375649793 * 2^31 -.word 3868449 // zeta^361 * 2^31 = 28678040^361 * 2^31 -.word 66432231 // zeta^297 * 2^31 = 28678040^297 * 2^31 -.word 31236859 // zeta^329 * 2^31 = 28678040^329 * 2^31 -.word 13658415 // zeta^265 * 2^31 = 28678040^265 * 2^31 -.word 2938651359 // zeta^361 * (q^(-1) mod 2^32) * 2^31 = 28678040^361 * 375649793 * 2^31 -.word 814700825 // zeta^297 * (q^(-1) mod 2^32) * 2^31 = 28678040^297 * 375649793 * 2^31 -.word 1618291461 // zeta^329 * (q^(-1) mod 2^32) * 2^31 = 28678040^329 * 375649793 * 2^31 -.word 49245393 // zeta^265 * (q^(-1) mod 2^32) * 2^31 = 28678040^265 * 375649793 * 2^31 -.word 34409967 // zeta^482 * 2^31 = 28678040^482 * 2^31 -.word 12619783 // zeta^354 * 2^31 = 28678040^354 * 2^31 -.word 54561811 // zeta^418 * 2^31 = 28678040^418 * 2^31 -.word 61632377 // zeta^290 * 2^31 = 28678040^290 * 2^31 -.word 2233616401 // zeta^482 * (q^(-1) mod 2^32) * 2^31 = 28678040^482 * 375649793 * 2^31 -.word 2820912633 // zeta^354 * (q^(-1) mod 2^32) * 2^31 = 28678040^354 * 375649793 * 2^31 -.word 684470765 // zeta^418 * (q^(-1) mod 2^32) * 2^31 = 28678040^418 * 375649793 * 2^31 -.word 3345631879 // zeta^290 * (q^(-1) mod 2^32) * 2^31 = 28678040^290 * 375649793 * 2^31 -.word 7605279 // zeta^497 * 2^31 = 28678040^497 * 2^31 -.word 58319315 // zeta^433 * 2^31 = 28678040^433 * 2^31 -.word 16342937 // zeta^465 * 2^31 = 28678040^465 * 2^31 -.word 48148431 // zeta^401 * 2^31 = 28678040^401 * 2^31 -.word 568928737 // zeta^497 * (q^(-1) mod 2^32) * 2^31 = 28678040^497 * 375649793 * 2^31 -.word 1726766125 // zeta^433 * (q^(-1) mod 2^32) * 2^31 = 28678040^433 * 375649793 * 2^31 -.word 1056873063 // zeta^465 * (q^(-1) mod 2^32) * 2^31 = 28678040^465 * 375649793 * 2^31 -.word 958621233 // zeta^401 * (q^(-1) mod 2^32) * 2^31 = 28678040^401 * 375649793 * 2^31 -.word 62377755 // zeta^369 * 2^31 = 28678040^369 * 2^31 -.word 35459369 // zeta^305 * 2^31 = 28678040^305 * 2^31 -.word 27513701 // zeta^337 * 2^31 = 28678040^337 * 2^31 -.word 18346679 // zeta^273 * 2^31 = 28678040^273 * 2^31 -.word 4057153253 // zeta^369 * (q^(-1) mod 2^32) * 2^31 = 28678040^369 * 375649793 * 2^31 -.word 3867838679 // zeta^305 * (q^(-1) mod 2^32) * 2^31 = 28678040^305 * 375649793 * 2^31 -.word 589962907 // zeta^337 * (q^(-1) mod 2^32) * 2^31 = 28678040^337 * 375649793 * 2^31 -.word 1692873545 // zeta^273 * (q^(-1) mod 2^32) * 2^31 = 28678040^273 * 375649793 * 2^31 -.word 1824951 // zeta^450 * 2^31 = 28678040^450 * 2^31 -.word 40410247 // zeta^322 * 2^31 = 28678040^322 * 2^31 -.word 25935987 // zeta^386 * 2^31 = 28678040^386 * 2^31 -.word 53409853 // zeta^258 * 2^31 = 28678040^258 * 2^31 -.word 3034533193 // zeta^450 * (q^(-1) mod 2^32) * 2^31 = 28678040^450 * 375649793 * 2^31 -.word 1425582457 // zeta^322 * (q^(-1) mod 2^32) * 2^31 = 28678040^322 * 375649793 * 2^31 -.word 1695333773 // zeta^386 * (q^(-1) mod 2^32) * 2^31 = 28678040^386 * 375649793 * 2^31 -.word 2628741571 // zeta^258 * (q^(-1) mod 2^32) * 2^31 = 28678040^258 * 375649793 * 2^31 -.word 44896477 // zeta^481 * 2^31 = 28678040^481 * 2^31 -.word 66621379 // zeta^417 * 2^31 = 28678040^417 * 2^31 -.word 35702907 // zeta^449 * 2^31 = 28678040^449 * 2^31 -.word 44158149 // zeta^385 * 2^31 = 28678040^385 * 2^31 -.word 732401955 // zeta^481 * (q^(-1) mod 2^32) * 2^31 = 28678040^481 * 375649793 * 2^31 -.word 3346599485 // zeta^417 * (q^(-1) mod 2^32) * 2^31 = 28678040^417 * 375649793 * 2^31 -.word 1671955845 // zeta^449 * (q^(-1) mod 2^32) * 2^31 = 28678040^449 * 375649793 * 2^31 -.word 1684661563 // zeta^385 * (q^(-1) mod 2^32) * 2^31 = 28678040^385 * 375649793 * 2^31 -.word 32881793 // zeta^353 * 2^31 = 28678040^353 * 2^31 -.word 18033685 // zeta^289 * 2^31 = 28678040^289 * 2^31 -.word 29367795 // zeta^321 * 2^31 = 28678040^321 * 2^31 -.word 16787671 // zeta^257 * 2^31 = 28678040^257 * 2^31 -.word 3741535615 // zeta^353 * (q^(-1) mod 2^32) * 2^31 = 28678040^353 * 375649793 * 2^31 -.word 3094455787 // zeta^289 * (q^(-1) mod 2^32) * 2^31 = 28678040^289 * 375649793 * 2^31 -.word 3934216205 // zeta^321 * (q^(-1) mod 2^32) * 2^31 = 28678040^321 * 375649793 * 2^31 -.word 2459712809 // zeta^257 * (q^(-1) mod 2^32) * 2^31 = 28678040^257 * 375649793 * 2^31 -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 17514581 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 4460971 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_u32_33556993_28678040, %function -.global inv_ntt_u32_33556993_28678040 -inv_ntt_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r4, #:lower16:modulus_inv -movt r4, #:upper16:modulus_inv -vldrw.s32 Q4, [r0, #0] -vldrw.s32 Q5, [r0, #16] -vsub.s32 Q6, Q4, Q5 -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vqrdmulh.s32 Q5, Q6, Q5 -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vldrw.s32 Q4, [r0, #(64+0)] -vmul.u32 Q6, Q6, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q6, Q4, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q4, Q4, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q6, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q6, Q6, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q6, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q6, Q1, Q2 -vsub.s32 Q3, Q4, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q4, Q7 -vqrdmlah.s32 Q6, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q4, Q5, Q6 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q6 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q4, Q7 -vldrw.s32 Q6, [r0, #(64+0)] -vmul.u32 Q4, Q4, Q5 -vldrw.s32 Q5, [r0, #(64+16)] -vqrdmlah.s32 Q3, Q4, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vsub.s32 Q4, Q6, Q5 -vst43.s32 {Q0,Q1,Q2,Q3}, [r0] -vadd.s32 Q6, Q6, Q5 -vldrw.s32 Q5, [r11, #32] -vmul.u32 Q7, Q5, r4 -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q4, Q5 -vst41.s32 {Q0,Q1,Q2,Q3}, [r0]! -vmul.u32 Q4, Q4, Q7 -vldrw.s32 Q7, [r0, #32] -vqrdmlah.s32 Q5, Q4, r12 -vldrw.s32 Q0, [r0, #48] -vsub.s32 Q1, Q7, Q0 -vldrw.s32 Q2, [r11, #64] -vadd.s32 Q7, Q7, Q0 -vqrdmulh.s32 Q4, Q1, Q2 -vsub.s32 Q3, Q6, Q7 -vldrw.s32 Q2, [r11, #80] -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q6, Q7 -vqrdmlah.s32 Q4, Q1, r12 -vldrw.s32 Q7, [r11], #96 -vsub.s32 Q6, Q5, Q4 -vqrdmulh.s32 Q2, Q3, Q7 -vadd.s32 Q1, Q5, Q4 -vldrw.s32 Q5, [r11, #-80] -vmul.u32 Q3, Q3, Q5 -vqrdmlah.s32 Q2, Q3, r12 -vqrdmulh.s32 Q3, Q6, Q7 -vmul.u32 Q6, Q6, Q5 -vqrdmlah.s32 Q3, Q6, r12 -vst40.s32 {Q0,Q1,Q2,Q3}, [r0] -vst41.s32 {Q0,Q1,Q2,Q3}, [r0] -vst42.s32 {Q0,Q1,Q2,Q3}, [r0] -vst43.s32 {Q0,Q1,Q2,Q3}, [r0]! -sub r0, r0, #1024 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #16] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #32] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #128] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #144] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #176] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #208] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #272] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #288] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #352] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #368] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #400] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #-480] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #-400] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #-352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #-288] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #-272] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #-224] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #-208] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #-176] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #-112] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #-96] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #-64] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #-32] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #-16] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #208] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #96] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #224] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #112] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #288] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #432] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #-432] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #-368] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #-480] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #-464] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #-400] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #-384] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #-320] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #-176] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #-112] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #-144] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #-192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #-128] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #256] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 // XXXXX -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 50631221 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 2147319755 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #-240] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #16] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #272] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #-480] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #-224] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #288] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #-464] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #304] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #-192] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #64] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #-432] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #-416] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #-160] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #-400] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #-384] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #384] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #400] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #-352] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #416] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #-336] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #432] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #-320] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #448] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #-304] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #208] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #-288] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #480] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #496] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 3244 -// Instruction count: 2742 \ No newline at end of file diff --git a/tests/saber/auto/ntt_n256_u32_33556993_28678040_complete.s b/tests/saber/auto/ntt_n256_u32_33556993_28678040_complete.s deleted file mode 100644 index 9380e32..0000000 --- a/tests/saber/auto/ntt_n256_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,2907 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.word 13704133 // zeta^ 2 * 2^31 = 28678040^ 2 * 2^31 -.word 41177999 // zeta^130 * 2^31 = 28678040^130 * 2^31 -.word 26703739 // zeta^ 66 * 2^31 = 28678040^ 66 * 2^31 -.word 65289035 // zeta^194 * 2^31 = 28678040^194 * 2^31 -.word 1666225723 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 2 * 375649793 * 2^31 -.word 2599633521 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 28678040^130 * 375649793 * 2^31 -.word 2869384837 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 66 * 375649793 * 2^31 -.word 1260434101 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 28678040^194 * 375649793 * 2^31 -.word 50326315 // zeta^ 1 * 2^31 = 28678040^ 1 * 2^31 -.word 37746191 // zeta^ 65 * 2^31 = 28678040^ 65 * 2^31 -.word 49080301 // zeta^ 33 * 2^31 = 28678040^ 33 * 2^31 -.word 34232193 // zeta^ 97 * 2^31 = 28678040^ 97 * 2^31 -.word 1835254485 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 1 * 375649793 * 2^31 -.word 360751089 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 65 * 375649793 * 2^31 -.word 1200511507 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 33 * 375649793 * 2^31 -.word 553431679 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 97 * 375649793 * 2^31 -.word 22955837 // zeta^129 * 2^31 = 28678040^129 * 2^31 -.word 31411079 // zeta^193 * 2^31 = 28678040^193 * 2^31 -.word 492607 // zeta^161 * 2^31 = 28678040^161 * 2^31 -.word 22217509 // zeta^225 * 2^31 = 28678040^225 * 2^31 -.word 5481609 // zeta^ 34 * 2^31 = 28678040^ 34 * 2^31 -.word 12552175 // zeta^162 * 2^31 = 28678040^162 * 2^31 -.word 54494203 // zeta^ 98 * 2^31 = 28678040^ 98 * 2^31 -.word 32704019 // zeta^226 * 2^31 = 28678040^226 * 2^31 -.word 949335415 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 34 * 375649793 * 2^31 -.word 3610496529 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 28678040^162 * 375649793 * 2^31 -.word 1474054661 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 98 * 375649793 * 2^31 -.word 2061350893 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 28678040^226 * 375649793 * 2^31 -.word 48767307 // zeta^ 17 * 2^31 = 28678040^ 17 * 2^31 -.word 39600285 // zeta^ 81 * 2^31 = 28678040^ 81 * 2^31 -.word 31654617 // zeta^ 49 * 2^31 = 28678040^ 49 * 2^31 -.word 4736231 // zeta^113 * 2^31 = 28678040^113 * 2^31 -.word 2602093749 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 17 * 375649793 * 2^31 -.word 3705004387 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 81 * 375649793 * 2^31 -.word 427128615 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 49 * 375649793 * 2^31 -.word 237814041 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 28678040^113 * 375649793 * 2^31 -.word 18965555 // zeta^145 * 2^31 = 28678040^145 * 2^31 -.word 50771049 // zeta^209 * 2^31 = 28678040^209 * 2^31 -.word 8794671 // zeta^177 * 2^31 = 28678040^177 * 2^31 -.word 59508707 // zeta^241 * 2^31 = 28678040^241 * 2^31 -.word 43973433 // zeta^ 18 * 2^31 = 28678040^ 18 * 2^31 -.word 14453865 // zeta^146 * 2^31 = 28678040^146 * 2^31 -.word 14937153 // zeta^ 82 * 2^31 = 28678040^ 82 * 2^31 -.word 39701997 // zeta^210 * 2^31 = 28678040^210 * 2^31 -.word 720191175 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 18 * 375649793 * 2^31 -.word 3181088151 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 28678040^146 * 375649793 * 2^31 -.word 116563391 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 82 * 375649793 * 2^31 -.word 3642323987 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 28678040^210 * 375649793 * 2^31 -.word 53455571 // zeta^ 9 * 2^31 = 28678040^ 9 * 2^31 -.word 35877127 // zeta^ 73 * 2^31 = 28678040^ 73 * 2^31 -.word 681755 // zeta^ 41 * 2^31 = 28678040^ 41 * 2^31 -.word 63245537 // zeta^105 * 2^31 = 28678040^105 * 2^31 -.word 4245721901 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 9 * 375649793 * 2^31 -.word 2676675833 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 73 * 375649793 * 2^31 -.word 3480266469 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 41 * 375649793 * 2^31 -.word 1356315935 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 28678040^105 * 375649793 * 2^31 -.word 11718751 // zeta^137 * 2^31 = 28678040^137 * 2^31 -.word 41885553 // zeta^201 * 2^31 = 28678040^201 * 2^31 -.word 54210213 // zeta^169 * 2^31 = 28678040^169 * 2^31 -.word 16838301 // zeta^233 * 2^31 = 28678040^233 * 2^31 -.word 40841465 // zeta^ 50 * 2^31 = 28678040^ 50 * 2^31 -.word 3577749 // zeta^178 * 2^31 = 28678040^178 * 2^31 -.word 33845545 // zeta^114 * 2^31 = 28678040^114 * 2^31 -.word 19555165 // zeta^242 * 2^31 = 28678040^242 * 2^31 -.word 3459680519 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 50 * 375649793 * 2^31 -.word 495008363 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 28678040^178 * 375649793 * 2^31 -.word 1885546711 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 28678040^114 * 375649793 * 2^31 -.word 3630382755 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 28678040^242 * 375649793 * 2^31 -.word 62758213 // zeta^ 25 * 2^31 = 28678040^ 25 * 2^31 -.word 8005843 // zeta^ 89 * 2^31 = 28678040^ 89 * 2^31 -.word 51922779 // zeta^ 57 * 2^31 = 28678040^ 57 * 2^31 -.word 7245689 // zeta^121 * 2^31 = 28678040^121 * 2^31 -.word 124982459 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 25 * 375649793 * 2^31 -.word 2964460845 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 89 * 375649793 * 2^31 -.word 1042630309 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 57 * 375649793 * 2^31 -.word 3756534407 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 28678040^121 * 375649793 * 2^31 -.word 30225471 // zeta^153 * 2^31 = 28678040^153 * 2^31 -.word 44151511 // zeta^217 * 2^31 = 28678040^217 * 2^31 -.word 64890121 // zeta^185 * 2^31 = 28678040^185 * 2^31 -.word 65259669 // zeta^249 * 2^31 = 28678040^249 * 2^31 -.word 12974361 // zeta^ 10 * 2^31 = 28678040^ 10 * 2^31 -.word 41807515 // zeta^138 * 2^31 = 28678040^138 * 2^31 -.word 56379967 // zeta^ 74 * 2^31 = 28678040^ 74 * 2^31 -.word 13380915 // zeta^202 * 2^31 = 28678040^202 * 2^31 -.word 1194393831 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 10 * 375649793 * 2^31 -.word 1648893797 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 28678040^138 * 375649793 * 2^31 -.word 753806273 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 74 * 375649793 * 2^31 -.word 4010528973 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 28678040^202 * 375649793 * 2^31 -.word 16772797 // zeta^ 5 * 2^31 = 28678040^ 5 * 2^31 -.word 58675875 // zeta^ 69 * 2^31 = 28678040^ 69 * 2^31 -.word 59974505 // zeta^ 37 * 2^31 = 28678040^ 37 * 2^31 -.word 33980107 // zeta^101 * 2^31 = 28678040^101 * 2^31 -.word 2122281795 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 5 * 375649793 * 2^31 -.word 2886667101 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 69 * 375649793 * 2^31 -.word 3771397783 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 37 * 375649793 * 2^31 -.word 1168207669 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 28678040^101 * 375649793 * 2^31 -.word 28448893 // zeta^133 * 2^31 = 28678040^133 * 2^31 -.word 24378249 // zeta^197 * 2^31 = 28678040^197 * 2^31 -.word 62687027 // zeta^165 * 2^31 = 28678040^165 * 2^31 -.word 65645595 // zeta^229 * 2^31 = 28678040^229 * 2^31 -.word 52771617 // zeta^ 42 * 2^31 = 28678040^ 42 * 2^31 -.word 23396495 // zeta^170 * 2^31 = 28678040^170 * 2^31 -.word 51483005 // zeta^106 * 2^31 = 28678040^106 * 2^31 -.word 11487943 // zeta^234 * 2^31 = 28678040^234 * 2^31 -.word 2185629407 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 42 * 375649793 * 2^31 -.word 1858377073 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 28678040^170 * 375649793 * 2^31 -.word 432623747 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 28678040^106 * 375649793 * 2^31 -.word 2290121529 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 28678040^234 * 375649793 * 2^31 -.word 63287737 // zeta^ 21 * 2^31 = 28678040^ 21 * 2^31 -.word 56338313 // zeta^ 85 * 2^31 = 28678040^ 85 * 2^31 -.word 19445427 // zeta^ 53 * 2^31 = 28678040^ 53 * 2^31 -.word 29167561 // zeta^117 * 2^31 = 28678040^117 * 2^31 -.word 1659340871 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 21 * 375649793 * 2^31 -.word 1504424567 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 85 * 375649793 * 2^31 -.word 3591259981 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 53 * 375649793 * 2^31 -.word 4032612919 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 28678040^117 * 375649793 * 2^31 -.word 7740335 // zeta^149 * 2^31 = 28678040^149 * 2^31 -.word 23515783 // zeta^213 * 2^31 = 28678040^213 * 2^31 -.word 33583453 // zeta^181 * 2^31 = 28678040^181 * 2^31 -.word 60337403 // zeta^245 * 2^31 = 28678040^245 * 2^31 -.word 35192755 // zeta^ 26 * 2^31 = 28678040^ 26 * 2^31 -.word 36544119 // zeta^154 * 2^31 = 28678040^154 * 2^31 -.word 6787663 // zeta^ 90 * 2^31 = 28678040^ 90 * 2^31 -.word 63484749 // zeta^218 * 2^31 = 28678040^218 * 2^31 -.word 3019374157 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 26 * 375649793 * 2^31 -.word 2777089929 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 28678040^154 * 375649793 * 2^31 -.word 443777969 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 90 * 375649793 * 2^31 -.word 723799731 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 28678040^218 * 375649793 * 2^31 -.word 61997615 // zeta^ 13 * 2^31 = 28678040^ 13 * 2^31 -.word 4479011 // zeta^ 77 * 2^31 = 28678040^ 77 * 2^31 -.word 38089877 // zeta^ 45 * 2^31 = 28678040^ 45 * 2^31 -.word 16590903 // zeta^109 * 2^31 = 28678040^109 * 2^31 -.word 201839569 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 13 * 375649793 * 2^31 -.word 998311389 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 77 * 375649793 * 2^31 -.word 1502911851 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 45 * 375649793 * 2^31 -.word 1931017673 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 28678040^109 * 375649793 * 2^31 -.word 43852787 // zeta^141 * 2^31 = 28678040^141 * 2^31 -.word 24597857 // zeta^205 * 2^31 = 28678040^205 * 2^31 -.word 43936833 // zeta^173 * 2^31 = 28678040^173 * 2^31 -.word 15636061 // zeta^237 * 2^31 = 28678040^237 * 2^31 -.word 55869129 // zeta^ 58 * 2^31 = 28678040^ 58 * 2^31 -.word 16038683 // zeta^186 * 2^31 = 28678040^186 * 2^31 -.word 43560065 // zeta^122 * 2^31 = 28678040^122 * 2^31 -.word 25949329 // zeta^250 * 2^31 = 28678040^250 * 2^31 -.word 2098944823 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 58 * 375649793 * 2^31 -.word 634278629 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 28678040^186 * 375649793 * 2^31 -.word 2076204415 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 28678040^122 * 375649793 * 2^31 -.word 2002629999 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 28678040^250 * 375649793 * 2^31 -.word 6591765 // zeta^ 29 * 2^31 = 28678040^ 29 * 2^31 -.word 1696249 // zeta^ 93 * 2^31 = 28678040^ 93 * 2^31 -.word 21795289 // zeta^ 61 * 2^31 = 28678040^ 61 * 2^31 -.word 17734591 // zeta^125 * 2^31 = 28678040^125 * 2^31 -.word 3812244715 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 29 * 375649793 * 2^31 -.word 1467340807 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 93 * 375649793 * 2^31 -.word 1570891815 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 61 * 375649793 * 2^31 -.word 1349179969 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 28678040^125 * 375649793 * 2^31 -.word 66853037 // zeta^157 * 2^31 = 28678040^157 * 2^31 -.word 24930199 // zeta^221 * 2^31 = 28678040^221 * 2^31 -.word 54854635 // zeta^189 * 2^31 = 28678040^189 * 2^31 -.word 39952565 // zeta^253 * 2^31 = 28678040^253 * 2^31 -.word 5623923 // zeta^ 6 * 2^31 = 28678040^ 6 * 2^31 -.word 38701067 // zeta^134 * 2^31 = 28678040^134 * 2^31 -.word 18571677 // zeta^ 70 * 2^31 = 28678040^ 70 * 2^31 -.word 14491707 // zeta^198 * 2^31 = 28678040^198 * 2^31 -.word 182627725 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 6 * 375649793 * 2^31 -.word 4172670453 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 28678040^134 * 375649793 * 2^31 -.word 1902166115 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 70 * 375649793 * 2^31 -.word 4183371205 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 28678040^198 * 375649793 * 2^31 -.word 17941849 // zeta^ 3 * 2^31 = 28678040^ 3 * 2^31 -.word 12982967 // zeta^ 67 * 2^31 = 28678040^ 67 * 2^31 -.word 8061707 // zeta^ 35 * 2^31 = 28678040^ 35 * 2^31 -.word 17774995 // zeta^ 99 * 2^31 = 28678040^ 99 * 2^31 -.word 4091524263 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 3 * 375649793 * 2^31 -.word 2462649161 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 67 * 375649793 * 2^31 -.word 2874632949 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 35 * 375649793 * 2^31 -.word 2009367661 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 99 * 375649793 * 2^31 -.word 61107981 // zeta^131 * 2^31 = 28678040^131 * 2^31 -.word 38975641 // zeta^195 * 2^31 = 28678040^195 * 2^31 -.word 40352225 // zeta^163 * 2^31 = 28678040^163 * 2^31 -.word 49569327 // zeta^227 * 2^31 = 28678040^227 * 2^31 -.word 26799603 // zeta^ 38 * 2^31 = 28678040^ 38 * 2^31 -.word 33463463 // zeta^166 * 2^31 = 28678040^166 * 2^31 -.word 39332725 // zeta^102 * 2^31 = 28678040^102 * 2^31 -.word 61125067 // zeta^230 * 2^31 = 28678040^230 * 2^31 -.word 583438349 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 38 * 375649793 * 2^31 -.word 1692658009 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 28678040^166 * 375649793 * 2^31 -.word 1738958475 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 28678040^102 * 375649793 * 2^31 -.word 2248227893 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 28678040^230 * 375649793 * 2^31 -.word 40014327 // zeta^ 19 * 2^31 = 28678040^ 19 * 2^31 -.word 562885 // zeta^ 83 * 2^31 = 28678040^ 83 * 2^31 -.word 51009393 // zeta^ 51 * 2^31 = 28678040^ 51 * 2^31 -.word 51995259 // zeta^115 * 2^31 = 28678040^115 * 2^31 -.word 2564101129 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 19 * 375649793 * 2^31 -.word 2196183867 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 83 * 375649793 * 2^31 -.word 2252083855 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 51 * 375649793 * 2^31 -.word 4038290309 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 28678040^115 * 375649793 * 2^31 -.word 24330211 // zeta^147 * 2^31 = 28678040^147 * 2^31 -.word 7682101 // zeta^211 * 2^31 = 28678040^211 * 2^31 -.word 7401943 // zeta^179 * 2^31 = 28678040^179 * 2^31 -.word 41757453 // zeta^243 * 2^31 = 28678040^243 * 2^31 -.word 65375453 // zeta^ 22 * 2^31 = 28678040^ 22 * 2^31 -.word 40797001 // zeta^150 * 2^31 = 28678040^150 * 2^31 -.word 59835311 // zeta^ 86 * 2^31 = 28678040^ 86 * 2^31 -.word 32875577 // zeta^214 * 2^31 = 28678040^214 * 2^31 -.word 4014413091 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 22 * 375649793 * 2^31 -.word 3224262327 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 28678040^150 * 375649793 * 2^31 -.word 741855825 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 86 * 375649793 * 2^31 -.word 2318439879 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 28678040^214 * 375649793 * 2^31 -.word 10045293 // zeta^ 11 * 2^31 = 28678040^ 11 * 2^31 -.word 53076657 // zeta^ 75 * 2^31 = 28678040^ 75 * 2^31 -.word 17896617 // zeta^ 43 * 2^31 = 28678040^ 43 * 2^31 -.word 58413331 // zeta^107 * 2^31 = 28678040^107 * 2^31 -.word 3080518291 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 11 * 375649793 * 2^31 -.word 3700229967 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 75 * 375649793 * 2^31 -.word 297370967 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 43 * 375649793 * 2^31 -.word 2151902445 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 28678040^107 * 375649793 * 2^31 -.word 19472551 // zeta^139 * 2^31 = 28678040^139 * 2^31 -.word 6043561 // zeta^203 * 2^31 = 28678040^203 * 2^31 -.word 20934449 // zeta^171 * 2^31 = 28678040^171 * 2^31 -.word 37620445 // zeta^235 * 2^31 = 28678040^235 * 2^31 -.word 12921459 // zeta^ 54 * 2^31 = 28678040^ 54 * 2^31 -.word 63769677 // zeta^182 * 2^31 = 28678040^182 * 2^31 -.word 61505033 // zeta^118 * 2^31 = 28678040^118 * 2^31 -.word 65692461 // zeta^246 * 2^31 = 28678040^246 * 2^31 -.word 1006064525 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 54 * 375649793 * 2^31 -.word 2459563443 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 28678040^182 * 375649793 * 2^31 -.word 2747128823 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 28678040^118 * 375649793 * 2^31 -.word 2288082643 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 28678040^246 * 375649793 * 2^31 -.word 20171011 // zeta^ 27 * 2^31 = 28678040^ 27 * 2^31 -.word 36495001 // zeta^ 91 * 2^31 = 28678040^ 91 * 2^31 -.word 62685175 // zeta^ 59 * 2^31 = 28678040^ 59 * 2^31 -.word 664745 // zeta^123 * 2^31 = 28678040^123 * 2^31 -.word 1031427325 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 27 * 375649793 * 2^31 -.word 2764118887 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 91 * 375649793 * 2^31 -.word 583476745 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 59 * 375649793 * 2^31 -.word 2371908951 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 28678040^123 * 375649793 * 2^31 -.word 56713759 // zeta^155 * 2^31 = 28678040^155 * 2^31 -.word 59594509 // zeta^219 * 2^31 = 28678040^219 * 2^31 -.word 41235703 // zeta^187 * 2^31 = 28678040^187 * 2^31 -.word 11581499 // zeta^251 * 2^31 = 28678040^251 * 2^31 -.word 23458751 // zeta^ 14 * 2^31 = 28678040^ 14 * 2^31 -.word 9406759 // zeta^142 * 2^31 = 28678040^142 * 2^31 -.word 33711991 // zeta^ 78 * 2^31 = 28678040^ 78 * 2^31 -.word 32167773 // zeta^206 * 2^31 = 28678040^206 * 2^31 -.word 1501790785 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 14 * 375649793 * 2^31 -.word 2911894745 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 28678040^142 * 375649793 * 2^31 -.word 1905016457 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 78 * 375649793 * 2^31 -.word 204130979 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 28678040^206 * 375649793 * 2^31 -.word 26043621 // zeta^ 7 * 2^31 = 28678040^ 7 * 2^31 -.word 51942461 // zeta^ 71 * 2^31 = 28678040^ 71 * 2^31 -.word 14401009 // zeta^ 39 * 2^31 = 28678040^ 39 * 2^31 -.word 60574133 // zeta^103 * 2^31 = 28678040^103 * 2^31 -.word 1827638555 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 7 * 375649793 * 2^31 -.word 3437088195 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 71 * 375649793 * 2^31 -.word 2892737551 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 39 * 375649793 * 2^31 -.word 3197159499 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 28678040^103 * 375649793 * 2^31 -.word 16031087 // zeta^135 * 2^31 = 28678040^135 * 2^31 -.word 25566271 // zeta^199 * 2^31 = 28678040^199 * 2^31 -.word 54040269 // zeta^167 * 2^31 = 28678040^167 * 2^31 -.word 36895029 // zeta^231 * 2^31 = 28678040^231 * 2^31 -.word 41803191 // zeta^ 46 * 2^31 = 28678040^ 46 * 2^31 -.word 19377381 // zeta^174 * 2^31 = 28678040^174 * 2^31 -.word 9664027 // zeta^110 * 2^31 = 28678040^110 * 2^31 -.word 55794235 // zeta^238 * 2^31 = 28678040^238 * 2^31 -.word 2460960841 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 46 * 375649793 * 2^31 -.word 1411728667 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 28678040^174 * 375649793 * 2^31 -.word 1300076517 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 28678040^110 * 375649793 * 2^31 -.word 3978752965 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 28678040^238 * 375649793 * 2^31 -.word 19675339 // zeta^ 23 * 2^31 = 28678040^ 23 * 2^31 -.word 21359151 // zeta^ 87 * 2^31 = 28678040^ 87 * 2^31 -.word 63140729 // zeta^ 55 * 2^31 = 28678040^ 55 * 2^31 -.word 23160723 // zeta^119 * 2^31 = 28678040^119 * 2^31 -.word 398439733 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 23 * 375649793 * 2^31 -.word 897838033 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 87 * 375649793 * 2^31 -.word 494618247 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 55 * 375649793 * 2^31 -.word 3040761453 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 28678040^119 * 375649793 * 2^31 -.word 9258847 // zeta^151 * 2^31 = 28678040^151 * 2^31 -.word 4669959 // zeta^215 * 2^31 = 28678040^215 * 2^31 -.word 41266143 // zeta^183 * 2^31 = 28678040^183 * 2^31 -.word 61464071 // zeta^247 * 2^31 = 28678040^247 * 2^31 -.word 43355169 // zeta^ 30 * 2^31 = 28678040^ 30 * 2^31 -.word 5591977 // zeta^158 * 2^31 = 28678040^158 * 2^31 -.word 40694335 // zeta^ 94 * 2^31 = 28678040^ 94 * 2^31 -.word 25071607 // zeta^222 * 2^31 = 28678040^222 * 2^31 -.word 1107279327 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 30 * 375649793 * 2^31 -.word 552289879 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 28678040^158 * 375649793 * 2^31 -.word 879592385 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 94 * 375649793 * 2^31 -.word 2040862217 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 28678040^222 * 375649793 * 2^31 -.word 34737117 // zeta^ 15 * 2^31 = 28678040^ 15 * 2^31 -.word 45994147 // zeta^ 79 * 2^31 = 28678040^ 79 * 2^31 -.word 42273719 // zeta^ 47 * 2^31 = 28678040^ 47 * 2^31 -.word 60428681 // zeta^111 * 2^31 = 28678040^111 * 2^31 -.word 303076899 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 15 * 375649793 * 2^31 -.word 3854339421 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 79 * 375649793 * 2^31 -.word 3799259721 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 47 * 375649793 * 2^31 -.word 1636911223 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 28678040^111 * 375649793 * 2^31 -.word 26028927 // zeta^143 * 2^31 = 28678040^143 * 2^31 -.word 64083527 // zeta^207 * 2^31 = 28678040^207 * 2^31 -.word 60382541 // zeta^175 * 2^31 = 28678040^175 * 2^31 -.word 31337387 // zeta^239 * 2^31 = 28678040^239 * 2^31 -.word 27553395 // zeta^ 62 * 2^31 = 28678040^ 62 * 2^31 -.word 7648471 // zeta^190 * 2^31 = 28678040^190 * 2^31 -.word 689375 // zeta^126 * 2^31 = 28678040^126 * 2^31 -.word 46555773 // zeta^254 * 2^31 = 28678040^254 * 2^31 -.word 1673531277 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 62 * 375649793 * 2^31 -.word 1889513769 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 28678040^190 * 375649793 * 2^31 -.word 1477062945 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 28678040^126 * 375649793 * 2^31 -.word 2252242819 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 28678040^254 * 375649793 * 2^31 -.word 15797163 // zeta^ 31 * 2^31 = 28678040^ 31 * 2^31 -.word 40170027 // zeta^ 95 * 2^31 = 28678040^ 95 * 2^31 -.word 10866061 // zeta^ 63 * 2^31 = 28678040^ 63 * 2^31 -.word 56298001 // zeta^127 * 2^31 = 28678040^127 * 2^31 -.word 683123285 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 31 * 375649793 * 2^31 -.word 2755967957 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 95 * 375649793 * 2^31 -.word 273527923 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 63 * 375649793 * 2^31 -.word 644194287 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 28678040^127 * 375649793 * 2^31 -.word 50400667 // zeta^159 * 2^31 = 28678040^159 * 2^31 -.word 33861863 // zeta^223 * 2^31 = 28678040^223 * 2^31 -.word 53736885 // zeta^191 * 2^31 = 28678040^191 * 2^31 -.word 31774129 // zeta^255 * 2^31 = 28678040^255 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040, %function -.global ntt_n256_u32_33556993_28678040 -ntt_n256_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r11, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q5, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -// Butterfly [0, 1, 2, 3] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [16, 17, 18, 19] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [32, 33, 34, 35] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [48, 49, 50, 51] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [64, 65, 66, 67] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [80, 81, 82, 83] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [96, 97, 98, 99] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [112, 113, 114, 115] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [128, 129, 130, 131] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [144, 145, 146, 147] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [160, 161, 162, 163] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [176, 177, 178, 179] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [192, 193, 194, 195] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [208, 209, 210, 211] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [224, 225, 226, 227] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r12 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Butterfly [240, 241, 242, 243] -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2875 -// Instruction count: 2421 \ No newline at end of file diff --git a/tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete.s b/tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete.s deleted file mode 100644 index e6259be..0000000 --- a/tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2027 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete, %function -.global ntt_n256_u32_33556993_28678040_incomplete -ntt_n256_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1995 -// Instruction count: 1557 \ No newline at end of file diff --git a/tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s b/tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s deleted file mode 100644 index c57d079..0000000 --- a/tests/saber/auto/ntt_n256_u32_33556993_28678040_incomplete_double.s +++ /dev/null @@ -1,2334 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_n256_u32_33556993_28678040_incomplete_double, %function -.global ntt_n256_u32_33556993_28678040_incomplete_double -ntt_n256_u32_33556993_28678040_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q1, [r1, #96] -vqrdmulh.s32 Q7, Q1, r6 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q1, Q1, r5 -/// Twist in[8] by r6 -vstrw.u32 Q3, [r1, #64] -vqrdmlah.s32 Q7, Q1, r12 -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q3, r6 -vsub.s32 Q4, Q2, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q4, [r1,#32] -vqrdmlah.s32 Q7, Q3, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r8 -vadd.s32 Q2, Q2, Q6 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q2, [r1], #128 -vqrdmlah.s32 Q7, Q4, r12 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q1, Q2, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q2, Q2, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[0] from Q2 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #176] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q0, [r1, #96] -vqrdmulh.s32 Q7, Q0, r6 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q0, Q0, r5 -/// Twist in[24] by r6 -vstrw.u32 Q2, [r1, #64] -vqrdmlah.s32 Q7, Q0, r12 -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q2, r6 -vsub.s32 Q3, Q1, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#32] -vqrdmlah.s32 Q7, Q2, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[24] from Q2 -vqrdmulh.s32 Q7, Q3, r8 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q1, [r1], #128 -vqrdmlah.s32 Q7, Q3, r12 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q0, Q1, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q1, Q1, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q0, [r1,#-112] -// Release input[16] from Q1 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #160] -vmul.u32 Q4, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[40] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[32] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #224] -vmul.u32 Q3, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #208] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #192] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[56] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[48] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #288] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #256] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #368] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[72] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[64] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #352] -vmul.u32 Q3, Q3, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #336] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[88] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[80] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #416] -vmul.u32 Q4, Q4, r9 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #384] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[104] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[96] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #480] -vmul.u32 Q3, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #464] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #448] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #-448] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[120] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[112] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #-464] -vmul.u32 Q4, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[136] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[128] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #-400] -vmul.u32 Q3, Q3, r9 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #-416] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #-432] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #-320] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[152] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[144] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #-336] -vmul.u32 Q4, Q4, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #-368] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[168] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[160] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vmul.u32 Q3, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[184] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[176] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vmul.u32 Q4, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[200] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[192] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #-144] -vmul.u32 Q3, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #-160] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #-176] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #-64] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[216] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[208] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vmul.u32 Q4, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[232] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[224] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vmul.u32 Q3, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q6, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[248] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q6, Q3, r12 -// Release input[252] from Q3 -vqrdmlah.s32 Q4, Q2, r12 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1, #112] -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q6, Q1, r12 -vstrw.u32 Q6, [r1, #80] -// Release input[248] from Q1 -vqrdmulh.s32 Q6, Q2, r8 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q6, Q6 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q6, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[240] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 2302 -// Instruction count: 1848 \ No newline at end of file diff --git a/tests/saber/auto/ntt_u32_33556993_28678040_complete.s b/tests/saber/auto/ntt_u32_33556993_28678040_complete.s deleted file mode 100644 index 0443f24..0000000 --- a/tests/saber/auto/ntt_u32_33556993_28678040_complete.s +++ /dev/null @@ -1,2915 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.word 13704133 // zeta^ 2 * 2^31 = 28678040^ 2 * 2^31 -.word 41177999 // zeta^130 * 2^31 = 28678040^130 * 2^31 -.word 26703739 // zeta^ 66 * 2^31 = 28678040^ 66 * 2^31 -.word 65289035 // zeta^194 * 2^31 = 28678040^194 * 2^31 -.word 1666225723 // zeta^ 2 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 2 * 375649793 * 2^31 -.word 2599633521 // zeta^130 * (q^(-1) mod 2^32) * 2^31 = 28678040^130 * 375649793 * 2^31 -.word 2869384837 // zeta^ 66 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 66 * 375649793 * 2^31 -.word 1260434101 // zeta^194 * (q^(-1) mod 2^32) * 2^31 = 28678040^194 * 375649793 * 2^31 -.word 50326315 // zeta^ 1 * 2^31 = 28678040^ 1 * 2^31 -.word 37746191 // zeta^ 65 * 2^31 = 28678040^ 65 * 2^31 -.word 49080301 // zeta^ 33 * 2^31 = 28678040^ 33 * 2^31 -.word 34232193 // zeta^ 97 * 2^31 = 28678040^ 97 * 2^31 -.word 1835254485 // zeta^ 1 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 1 * 375649793 * 2^31 -.word 360751089 // zeta^ 65 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 65 * 375649793 * 2^31 -.word 1200511507 // zeta^ 33 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 33 * 375649793 * 2^31 -.word 553431679 // zeta^ 97 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 97 * 375649793 * 2^31 -.word 22955837 // zeta^129 * 2^31 = 28678040^129 * 2^31 -.word 31411079 // zeta^193 * 2^31 = 28678040^193 * 2^31 -.word 492607 // zeta^161 * 2^31 = 28678040^161 * 2^31 -.word 22217509 // zeta^225 * 2^31 = 28678040^225 * 2^31 -.word 5481609 // zeta^ 34 * 2^31 = 28678040^ 34 * 2^31 -.word 12552175 // zeta^162 * 2^31 = 28678040^162 * 2^31 -.word 54494203 // zeta^ 98 * 2^31 = 28678040^ 98 * 2^31 -.word 32704019 // zeta^226 * 2^31 = 28678040^226 * 2^31 -.word 949335415 // zeta^ 34 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 34 * 375649793 * 2^31 -.word 3610496529 // zeta^162 * (q^(-1) mod 2^32) * 2^31 = 28678040^162 * 375649793 * 2^31 -.word 1474054661 // zeta^ 98 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 98 * 375649793 * 2^31 -.word 2061350893 // zeta^226 * (q^(-1) mod 2^32) * 2^31 = 28678040^226 * 375649793 * 2^31 -.word 48767307 // zeta^ 17 * 2^31 = 28678040^ 17 * 2^31 -.word 39600285 // zeta^ 81 * 2^31 = 28678040^ 81 * 2^31 -.word 31654617 // zeta^ 49 * 2^31 = 28678040^ 49 * 2^31 -.word 4736231 // zeta^113 * 2^31 = 28678040^113 * 2^31 -.word 2602093749 // zeta^ 17 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 17 * 375649793 * 2^31 -.word 3705004387 // zeta^ 81 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 81 * 375649793 * 2^31 -.word 427128615 // zeta^ 49 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 49 * 375649793 * 2^31 -.word 237814041 // zeta^113 * (q^(-1) mod 2^32) * 2^31 = 28678040^113 * 375649793 * 2^31 -.word 18965555 // zeta^145 * 2^31 = 28678040^145 * 2^31 -.word 50771049 // zeta^209 * 2^31 = 28678040^209 * 2^31 -.word 8794671 // zeta^177 * 2^31 = 28678040^177 * 2^31 -.word 59508707 // zeta^241 * 2^31 = 28678040^241 * 2^31 -.word 43973433 // zeta^ 18 * 2^31 = 28678040^ 18 * 2^31 -.word 14453865 // zeta^146 * 2^31 = 28678040^146 * 2^31 -.word 14937153 // zeta^ 82 * 2^31 = 28678040^ 82 * 2^31 -.word 39701997 // zeta^210 * 2^31 = 28678040^210 * 2^31 -.word 720191175 // zeta^ 18 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 18 * 375649793 * 2^31 -.word 3181088151 // zeta^146 * (q^(-1) mod 2^32) * 2^31 = 28678040^146 * 375649793 * 2^31 -.word 116563391 // zeta^ 82 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 82 * 375649793 * 2^31 -.word 3642323987 // zeta^210 * (q^(-1) mod 2^32) * 2^31 = 28678040^210 * 375649793 * 2^31 -.word 53455571 // zeta^ 9 * 2^31 = 28678040^ 9 * 2^31 -.word 35877127 // zeta^ 73 * 2^31 = 28678040^ 73 * 2^31 -.word 681755 // zeta^ 41 * 2^31 = 28678040^ 41 * 2^31 -.word 63245537 // zeta^105 * 2^31 = 28678040^105 * 2^31 -.word 4245721901 // zeta^ 9 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 9 * 375649793 * 2^31 -.word 2676675833 // zeta^ 73 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 73 * 375649793 * 2^31 -.word 3480266469 // zeta^ 41 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 41 * 375649793 * 2^31 -.word 1356315935 // zeta^105 * (q^(-1) mod 2^32) * 2^31 = 28678040^105 * 375649793 * 2^31 -.word 11718751 // zeta^137 * 2^31 = 28678040^137 * 2^31 -.word 41885553 // zeta^201 * 2^31 = 28678040^201 * 2^31 -.word 54210213 // zeta^169 * 2^31 = 28678040^169 * 2^31 -.word 16838301 // zeta^233 * 2^31 = 28678040^233 * 2^31 -.word 40841465 // zeta^ 50 * 2^31 = 28678040^ 50 * 2^31 -.word 3577749 // zeta^178 * 2^31 = 28678040^178 * 2^31 -.word 33845545 // zeta^114 * 2^31 = 28678040^114 * 2^31 -.word 19555165 // zeta^242 * 2^31 = 28678040^242 * 2^31 -.word 3459680519 // zeta^ 50 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 50 * 375649793 * 2^31 -.word 495008363 // zeta^178 * (q^(-1) mod 2^32) * 2^31 = 28678040^178 * 375649793 * 2^31 -.word 1885546711 // zeta^114 * (q^(-1) mod 2^32) * 2^31 = 28678040^114 * 375649793 * 2^31 -.word 3630382755 // zeta^242 * (q^(-1) mod 2^32) * 2^31 = 28678040^242 * 375649793 * 2^31 -.word 62758213 // zeta^ 25 * 2^31 = 28678040^ 25 * 2^31 -.word 8005843 // zeta^ 89 * 2^31 = 28678040^ 89 * 2^31 -.word 51922779 // zeta^ 57 * 2^31 = 28678040^ 57 * 2^31 -.word 7245689 // zeta^121 * 2^31 = 28678040^121 * 2^31 -.word 124982459 // zeta^ 25 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 25 * 375649793 * 2^31 -.word 2964460845 // zeta^ 89 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 89 * 375649793 * 2^31 -.word 1042630309 // zeta^ 57 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 57 * 375649793 * 2^31 -.word 3756534407 // zeta^121 * (q^(-1) mod 2^32) * 2^31 = 28678040^121 * 375649793 * 2^31 -.word 30225471 // zeta^153 * 2^31 = 28678040^153 * 2^31 -.word 44151511 // zeta^217 * 2^31 = 28678040^217 * 2^31 -.word 64890121 // zeta^185 * 2^31 = 28678040^185 * 2^31 -.word 65259669 // zeta^249 * 2^31 = 28678040^249 * 2^31 -.word 12974361 // zeta^ 10 * 2^31 = 28678040^ 10 * 2^31 -.word 41807515 // zeta^138 * 2^31 = 28678040^138 * 2^31 -.word 56379967 // zeta^ 74 * 2^31 = 28678040^ 74 * 2^31 -.word 13380915 // zeta^202 * 2^31 = 28678040^202 * 2^31 -.word 1194393831 // zeta^ 10 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 10 * 375649793 * 2^31 -.word 1648893797 // zeta^138 * (q^(-1) mod 2^32) * 2^31 = 28678040^138 * 375649793 * 2^31 -.word 753806273 // zeta^ 74 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 74 * 375649793 * 2^31 -.word 4010528973 // zeta^202 * (q^(-1) mod 2^32) * 2^31 = 28678040^202 * 375649793 * 2^31 -.word 16772797 // zeta^ 5 * 2^31 = 28678040^ 5 * 2^31 -.word 58675875 // zeta^ 69 * 2^31 = 28678040^ 69 * 2^31 -.word 59974505 // zeta^ 37 * 2^31 = 28678040^ 37 * 2^31 -.word 33980107 // zeta^101 * 2^31 = 28678040^101 * 2^31 -.word 2122281795 // zeta^ 5 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 5 * 375649793 * 2^31 -.word 2886667101 // zeta^ 69 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 69 * 375649793 * 2^31 -.word 3771397783 // zeta^ 37 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 37 * 375649793 * 2^31 -.word 1168207669 // zeta^101 * (q^(-1) mod 2^32) * 2^31 = 28678040^101 * 375649793 * 2^31 -.word 28448893 // zeta^133 * 2^31 = 28678040^133 * 2^31 -.word 24378249 // zeta^197 * 2^31 = 28678040^197 * 2^31 -.word 62687027 // zeta^165 * 2^31 = 28678040^165 * 2^31 -.word 65645595 // zeta^229 * 2^31 = 28678040^229 * 2^31 -.word 52771617 // zeta^ 42 * 2^31 = 28678040^ 42 * 2^31 -.word 23396495 // zeta^170 * 2^31 = 28678040^170 * 2^31 -.word 51483005 // zeta^106 * 2^31 = 28678040^106 * 2^31 -.word 11487943 // zeta^234 * 2^31 = 28678040^234 * 2^31 -.word 2185629407 // zeta^ 42 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 42 * 375649793 * 2^31 -.word 1858377073 // zeta^170 * (q^(-1) mod 2^32) * 2^31 = 28678040^170 * 375649793 * 2^31 -.word 432623747 // zeta^106 * (q^(-1) mod 2^32) * 2^31 = 28678040^106 * 375649793 * 2^31 -.word 2290121529 // zeta^234 * (q^(-1) mod 2^32) * 2^31 = 28678040^234 * 375649793 * 2^31 -.word 63287737 // zeta^ 21 * 2^31 = 28678040^ 21 * 2^31 -.word 56338313 // zeta^ 85 * 2^31 = 28678040^ 85 * 2^31 -.word 19445427 // zeta^ 53 * 2^31 = 28678040^ 53 * 2^31 -.word 29167561 // zeta^117 * 2^31 = 28678040^117 * 2^31 -.word 1659340871 // zeta^ 21 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 21 * 375649793 * 2^31 -.word 1504424567 // zeta^ 85 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 85 * 375649793 * 2^31 -.word 3591259981 // zeta^ 53 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 53 * 375649793 * 2^31 -.word 4032612919 // zeta^117 * (q^(-1) mod 2^32) * 2^31 = 28678040^117 * 375649793 * 2^31 -.word 7740335 // zeta^149 * 2^31 = 28678040^149 * 2^31 -.word 23515783 // zeta^213 * 2^31 = 28678040^213 * 2^31 -.word 33583453 // zeta^181 * 2^31 = 28678040^181 * 2^31 -.word 60337403 // zeta^245 * 2^31 = 28678040^245 * 2^31 -.word 35192755 // zeta^ 26 * 2^31 = 28678040^ 26 * 2^31 -.word 36544119 // zeta^154 * 2^31 = 28678040^154 * 2^31 -.word 6787663 // zeta^ 90 * 2^31 = 28678040^ 90 * 2^31 -.word 63484749 // zeta^218 * 2^31 = 28678040^218 * 2^31 -.word 3019374157 // zeta^ 26 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 26 * 375649793 * 2^31 -.word 2777089929 // zeta^154 * (q^(-1) mod 2^32) * 2^31 = 28678040^154 * 375649793 * 2^31 -.word 443777969 // zeta^ 90 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 90 * 375649793 * 2^31 -.word 723799731 // zeta^218 * (q^(-1) mod 2^32) * 2^31 = 28678040^218 * 375649793 * 2^31 -.word 61997615 // zeta^ 13 * 2^31 = 28678040^ 13 * 2^31 -.word 4479011 // zeta^ 77 * 2^31 = 28678040^ 77 * 2^31 -.word 38089877 // zeta^ 45 * 2^31 = 28678040^ 45 * 2^31 -.word 16590903 // zeta^109 * 2^31 = 28678040^109 * 2^31 -.word 201839569 // zeta^ 13 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 13 * 375649793 * 2^31 -.word 998311389 // zeta^ 77 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 77 * 375649793 * 2^31 -.word 1502911851 // zeta^ 45 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 45 * 375649793 * 2^31 -.word 1931017673 // zeta^109 * (q^(-1) mod 2^32) * 2^31 = 28678040^109 * 375649793 * 2^31 -.word 43852787 // zeta^141 * 2^31 = 28678040^141 * 2^31 -.word 24597857 // zeta^205 * 2^31 = 28678040^205 * 2^31 -.word 43936833 // zeta^173 * 2^31 = 28678040^173 * 2^31 -.word 15636061 // zeta^237 * 2^31 = 28678040^237 * 2^31 -.word 55869129 // zeta^ 58 * 2^31 = 28678040^ 58 * 2^31 -.word 16038683 // zeta^186 * 2^31 = 28678040^186 * 2^31 -.word 43560065 // zeta^122 * 2^31 = 28678040^122 * 2^31 -.word 25949329 // zeta^250 * 2^31 = 28678040^250 * 2^31 -.word 2098944823 // zeta^ 58 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 58 * 375649793 * 2^31 -.word 634278629 // zeta^186 * (q^(-1) mod 2^32) * 2^31 = 28678040^186 * 375649793 * 2^31 -.word 2076204415 // zeta^122 * (q^(-1) mod 2^32) * 2^31 = 28678040^122 * 375649793 * 2^31 -.word 2002629999 // zeta^250 * (q^(-1) mod 2^32) * 2^31 = 28678040^250 * 375649793 * 2^31 -.word 6591765 // zeta^ 29 * 2^31 = 28678040^ 29 * 2^31 -.word 1696249 // zeta^ 93 * 2^31 = 28678040^ 93 * 2^31 -.word 21795289 // zeta^ 61 * 2^31 = 28678040^ 61 * 2^31 -.word 17734591 // zeta^125 * 2^31 = 28678040^125 * 2^31 -.word 3812244715 // zeta^ 29 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 29 * 375649793 * 2^31 -.word 1467340807 // zeta^ 93 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 93 * 375649793 * 2^31 -.word 1570891815 // zeta^ 61 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 61 * 375649793 * 2^31 -.word 1349179969 // zeta^125 * (q^(-1) mod 2^32) * 2^31 = 28678040^125 * 375649793 * 2^31 -.word 66853037 // zeta^157 * 2^31 = 28678040^157 * 2^31 -.word 24930199 // zeta^221 * 2^31 = 28678040^221 * 2^31 -.word 54854635 // zeta^189 * 2^31 = 28678040^189 * 2^31 -.word 39952565 // zeta^253 * 2^31 = 28678040^253 * 2^31 -.word 5623923 // zeta^ 6 * 2^31 = 28678040^ 6 * 2^31 -.word 38701067 // zeta^134 * 2^31 = 28678040^134 * 2^31 -.word 18571677 // zeta^ 70 * 2^31 = 28678040^ 70 * 2^31 -.word 14491707 // zeta^198 * 2^31 = 28678040^198 * 2^31 -.word 182627725 // zeta^ 6 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 6 * 375649793 * 2^31 -.word 4172670453 // zeta^134 * (q^(-1) mod 2^32) * 2^31 = 28678040^134 * 375649793 * 2^31 -.word 1902166115 // zeta^ 70 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 70 * 375649793 * 2^31 -.word 4183371205 // zeta^198 * (q^(-1) mod 2^32) * 2^31 = 28678040^198 * 375649793 * 2^31 -.word 17941849 // zeta^ 3 * 2^31 = 28678040^ 3 * 2^31 -.word 12982967 // zeta^ 67 * 2^31 = 28678040^ 67 * 2^31 -.word 8061707 // zeta^ 35 * 2^31 = 28678040^ 35 * 2^31 -.word 17774995 // zeta^ 99 * 2^31 = 28678040^ 99 * 2^31 -.word 4091524263 // zeta^ 3 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 3 * 375649793 * 2^31 -.word 2462649161 // zeta^ 67 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 67 * 375649793 * 2^31 -.word 2874632949 // zeta^ 35 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 35 * 375649793 * 2^31 -.word 2009367661 // zeta^ 99 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 99 * 375649793 * 2^31 -.word 61107981 // zeta^131 * 2^31 = 28678040^131 * 2^31 -.word 38975641 // zeta^195 * 2^31 = 28678040^195 * 2^31 -.word 40352225 // zeta^163 * 2^31 = 28678040^163 * 2^31 -.word 49569327 // zeta^227 * 2^31 = 28678040^227 * 2^31 -.word 26799603 // zeta^ 38 * 2^31 = 28678040^ 38 * 2^31 -.word 33463463 // zeta^166 * 2^31 = 28678040^166 * 2^31 -.word 39332725 // zeta^102 * 2^31 = 28678040^102 * 2^31 -.word 61125067 // zeta^230 * 2^31 = 28678040^230 * 2^31 -.word 583438349 // zeta^ 38 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 38 * 375649793 * 2^31 -.word 1692658009 // zeta^166 * (q^(-1) mod 2^32) * 2^31 = 28678040^166 * 375649793 * 2^31 -.word 1738958475 // zeta^102 * (q^(-1) mod 2^32) * 2^31 = 28678040^102 * 375649793 * 2^31 -.word 2248227893 // zeta^230 * (q^(-1) mod 2^32) * 2^31 = 28678040^230 * 375649793 * 2^31 -.word 40014327 // zeta^ 19 * 2^31 = 28678040^ 19 * 2^31 -.word 562885 // zeta^ 83 * 2^31 = 28678040^ 83 * 2^31 -.word 51009393 // zeta^ 51 * 2^31 = 28678040^ 51 * 2^31 -.word 51995259 // zeta^115 * 2^31 = 28678040^115 * 2^31 -.word 2564101129 // zeta^ 19 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 19 * 375649793 * 2^31 -.word 2196183867 // zeta^ 83 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 83 * 375649793 * 2^31 -.word 2252083855 // zeta^ 51 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 51 * 375649793 * 2^31 -.word 4038290309 // zeta^115 * (q^(-1) mod 2^32) * 2^31 = 28678040^115 * 375649793 * 2^31 -.word 24330211 // zeta^147 * 2^31 = 28678040^147 * 2^31 -.word 7682101 // zeta^211 * 2^31 = 28678040^211 * 2^31 -.word 7401943 // zeta^179 * 2^31 = 28678040^179 * 2^31 -.word 41757453 // zeta^243 * 2^31 = 28678040^243 * 2^31 -.word 65375453 // zeta^ 22 * 2^31 = 28678040^ 22 * 2^31 -.word 40797001 // zeta^150 * 2^31 = 28678040^150 * 2^31 -.word 59835311 // zeta^ 86 * 2^31 = 28678040^ 86 * 2^31 -.word 32875577 // zeta^214 * 2^31 = 28678040^214 * 2^31 -.word 4014413091 // zeta^ 22 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 22 * 375649793 * 2^31 -.word 3224262327 // zeta^150 * (q^(-1) mod 2^32) * 2^31 = 28678040^150 * 375649793 * 2^31 -.word 741855825 // zeta^ 86 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 86 * 375649793 * 2^31 -.word 2318439879 // zeta^214 * (q^(-1) mod 2^32) * 2^31 = 28678040^214 * 375649793 * 2^31 -.word 10045293 // zeta^ 11 * 2^31 = 28678040^ 11 * 2^31 -.word 53076657 // zeta^ 75 * 2^31 = 28678040^ 75 * 2^31 -.word 17896617 // zeta^ 43 * 2^31 = 28678040^ 43 * 2^31 -.word 58413331 // zeta^107 * 2^31 = 28678040^107 * 2^31 -.word 3080518291 // zeta^ 11 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 11 * 375649793 * 2^31 -.word 3700229967 // zeta^ 75 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 75 * 375649793 * 2^31 -.word 297370967 // zeta^ 43 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 43 * 375649793 * 2^31 -.word 2151902445 // zeta^107 * (q^(-1) mod 2^32) * 2^31 = 28678040^107 * 375649793 * 2^31 -.word 19472551 // zeta^139 * 2^31 = 28678040^139 * 2^31 -.word 6043561 // zeta^203 * 2^31 = 28678040^203 * 2^31 -.word 20934449 // zeta^171 * 2^31 = 28678040^171 * 2^31 -.word 37620445 // zeta^235 * 2^31 = 28678040^235 * 2^31 -.word 12921459 // zeta^ 54 * 2^31 = 28678040^ 54 * 2^31 -.word 63769677 // zeta^182 * 2^31 = 28678040^182 * 2^31 -.word 61505033 // zeta^118 * 2^31 = 28678040^118 * 2^31 -.word 65692461 // zeta^246 * 2^31 = 28678040^246 * 2^31 -.word 1006064525 // zeta^ 54 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 54 * 375649793 * 2^31 -.word 2459563443 // zeta^182 * (q^(-1) mod 2^32) * 2^31 = 28678040^182 * 375649793 * 2^31 -.word 2747128823 // zeta^118 * (q^(-1) mod 2^32) * 2^31 = 28678040^118 * 375649793 * 2^31 -.word 2288082643 // zeta^246 * (q^(-1) mod 2^32) * 2^31 = 28678040^246 * 375649793 * 2^31 -.word 20171011 // zeta^ 27 * 2^31 = 28678040^ 27 * 2^31 -.word 36495001 // zeta^ 91 * 2^31 = 28678040^ 91 * 2^31 -.word 62685175 // zeta^ 59 * 2^31 = 28678040^ 59 * 2^31 -.word 664745 // zeta^123 * 2^31 = 28678040^123 * 2^31 -.word 1031427325 // zeta^ 27 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 27 * 375649793 * 2^31 -.word 2764118887 // zeta^ 91 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 91 * 375649793 * 2^31 -.word 583476745 // zeta^ 59 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 59 * 375649793 * 2^31 -.word 2371908951 // zeta^123 * (q^(-1) mod 2^32) * 2^31 = 28678040^123 * 375649793 * 2^31 -.word 56713759 // zeta^155 * 2^31 = 28678040^155 * 2^31 -.word 59594509 // zeta^219 * 2^31 = 28678040^219 * 2^31 -.word 41235703 // zeta^187 * 2^31 = 28678040^187 * 2^31 -.word 11581499 // zeta^251 * 2^31 = 28678040^251 * 2^31 -.word 23458751 // zeta^ 14 * 2^31 = 28678040^ 14 * 2^31 -.word 9406759 // zeta^142 * 2^31 = 28678040^142 * 2^31 -.word 33711991 // zeta^ 78 * 2^31 = 28678040^ 78 * 2^31 -.word 32167773 // zeta^206 * 2^31 = 28678040^206 * 2^31 -.word 1501790785 // zeta^ 14 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 14 * 375649793 * 2^31 -.word 2911894745 // zeta^142 * (q^(-1) mod 2^32) * 2^31 = 28678040^142 * 375649793 * 2^31 -.word 1905016457 // zeta^ 78 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 78 * 375649793 * 2^31 -.word 204130979 // zeta^206 * (q^(-1) mod 2^32) * 2^31 = 28678040^206 * 375649793 * 2^31 -.word 26043621 // zeta^ 7 * 2^31 = 28678040^ 7 * 2^31 -.word 51942461 // zeta^ 71 * 2^31 = 28678040^ 71 * 2^31 -.word 14401009 // zeta^ 39 * 2^31 = 28678040^ 39 * 2^31 -.word 60574133 // zeta^103 * 2^31 = 28678040^103 * 2^31 -.word 1827638555 // zeta^ 7 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 7 * 375649793 * 2^31 -.word 3437088195 // zeta^ 71 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 71 * 375649793 * 2^31 -.word 2892737551 // zeta^ 39 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 39 * 375649793 * 2^31 -.word 3197159499 // zeta^103 * (q^(-1) mod 2^32) * 2^31 = 28678040^103 * 375649793 * 2^31 -.word 16031087 // zeta^135 * 2^31 = 28678040^135 * 2^31 -.word 25566271 // zeta^199 * 2^31 = 28678040^199 * 2^31 -.word 54040269 // zeta^167 * 2^31 = 28678040^167 * 2^31 -.word 36895029 // zeta^231 * 2^31 = 28678040^231 * 2^31 -.word 41803191 // zeta^ 46 * 2^31 = 28678040^ 46 * 2^31 -.word 19377381 // zeta^174 * 2^31 = 28678040^174 * 2^31 -.word 9664027 // zeta^110 * 2^31 = 28678040^110 * 2^31 -.word 55794235 // zeta^238 * 2^31 = 28678040^238 * 2^31 -.word 2460960841 // zeta^ 46 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 46 * 375649793 * 2^31 -.word 1411728667 // zeta^174 * (q^(-1) mod 2^32) * 2^31 = 28678040^174 * 375649793 * 2^31 -.word 1300076517 // zeta^110 * (q^(-1) mod 2^32) * 2^31 = 28678040^110 * 375649793 * 2^31 -.word 3978752965 // zeta^238 * (q^(-1) mod 2^32) * 2^31 = 28678040^238 * 375649793 * 2^31 -.word 19675339 // zeta^ 23 * 2^31 = 28678040^ 23 * 2^31 -.word 21359151 // zeta^ 87 * 2^31 = 28678040^ 87 * 2^31 -.word 63140729 // zeta^ 55 * 2^31 = 28678040^ 55 * 2^31 -.word 23160723 // zeta^119 * 2^31 = 28678040^119 * 2^31 -.word 398439733 // zeta^ 23 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 23 * 375649793 * 2^31 -.word 897838033 // zeta^ 87 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 87 * 375649793 * 2^31 -.word 494618247 // zeta^ 55 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 55 * 375649793 * 2^31 -.word 3040761453 // zeta^119 * (q^(-1) mod 2^32) * 2^31 = 28678040^119 * 375649793 * 2^31 -.word 9258847 // zeta^151 * 2^31 = 28678040^151 * 2^31 -.word 4669959 // zeta^215 * 2^31 = 28678040^215 * 2^31 -.word 41266143 // zeta^183 * 2^31 = 28678040^183 * 2^31 -.word 61464071 // zeta^247 * 2^31 = 28678040^247 * 2^31 -.word 43355169 // zeta^ 30 * 2^31 = 28678040^ 30 * 2^31 -.word 5591977 // zeta^158 * 2^31 = 28678040^158 * 2^31 -.word 40694335 // zeta^ 94 * 2^31 = 28678040^ 94 * 2^31 -.word 25071607 // zeta^222 * 2^31 = 28678040^222 * 2^31 -.word 1107279327 // zeta^ 30 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 30 * 375649793 * 2^31 -.word 552289879 // zeta^158 * (q^(-1) mod 2^32) * 2^31 = 28678040^158 * 375649793 * 2^31 -.word 879592385 // zeta^ 94 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 94 * 375649793 * 2^31 -.word 2040862217 // zeta^222 * (q^(-1) mod 2^32) * 2^31 = 28678040^222 * 375649793 * 2^31 -.word 34737117 // zeta^ 15 * 2^31 = 28678040^ 15 * 2^31 -.word 45994147 // zeta^ 79 * 2^31 = 28678040^ 79 * 2^31 -.word 42273719 // zeta^ 47 * 2^31 = 28678040^ 47 * 2^31 -.word 60428681 // zeta^111 * 2^31 = 28678040^111 * 2^31 -.word 303076899 // zeta^ 15 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 15 * 375649793 * 2^31 -.word 3854339421 // zeta^ 79 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 79 * 375649793 * 2^31 -.word 3799259721 // zeta^ 47 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 47 * 375649793 * 2^31 -.word 1636911223 // zeta^111 * (q^(-1) mod 2^32) * 2^31 = 28678040^111 * 375649793 * 2^31 -.word 26028927 // zeta^143 * 2^31 = 28678040^143 * 2^31 -.word 64083527 // zeta^207 * 2^31 = 28678040^207 * 2^31 -.word 60382541 // zeta^175 * 2^31 = 28678040^175 * 2^31 -.word 31337387 // zeta^239 * 2^31 = 28678040^239 * 2^31 -.word 27553395 // zeta^ 62 * 2^31 = 28678040^ 62 * 2^31 -.word 7648471 // zeta^190 * 2^31 = 28678040^190 * 2^31 -.word 689375 // zeta^126 * 2^31 = 28678040^126 * 2^31 -.word 46555773 // zeta^254 * 2^31 = 28678040^254 * 2^31 -.word 1673531277 // zeta^ 62 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 62 * 375649793 * 2^31 -.word 1889513769 // zeta^190 * (q^(-1) mod 2^32) * 2^31 = 28678040^190 * 375649793 * 2^31 -.word 1477062945 // zeta^126 * (q^(-1) mod 2^32) * 2^31 = 28678040^126 * 375649793 * 2^31 -.word 2252242819 // zeta^254 * (q^(-1) mod 2^32) * 2^31 = 28678040^254 * 375649793 * 2^31 -.word 15797163 // zeta^ 31 * 2^31 = 28678040^ 31 * 2^31 -.word 40170027 // zeta^ 95 * 2^31 = 28678040^ 95 * 2^31 -.word 10866061 // zeta^ 63 * 2^31 = 28678040^ 63 * 2^31 -.word 56298001 // zeta^127 * 2^31 = 28678040^127 * 2^31 -.word 683123285 // zeta^ 31 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 31 * 375649793 * 2^31 -.word 2755967957 // zeta^ 95 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 95 * 375649793 * 2^31 -.word 273527923 // zeta^ 63 * (q^(-1) mod 2^32) * 2^31 = 28678040^ 63 * 375649793 * 2^31 -.word 644194287 // zeta^127 * (q^(-1) mod 2^32) * 2^31 = 28678040^127 * 375649793 * 2^31 -.word 50400667 // zeta^159 * 2^31 = 28678040^159 * 2^31 -.word 33861863 // zeta^223 * 2^31 = 28678040^223 * 2^31 -.word 53736885 // zeta^191 * 2^31 = 28678040^191 * 2^31 -.word 31774129 // zeta^255 * 2^31 = 28678040^255 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_u32_33556993_28678040, %function -.global ntt_u32_33556993_28678040 -ntt_u32_33556993_28678040: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vldrw.s32 Q5, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q5 -vldrw.s32 Q6, [r11, #-64] -vmul.u32 Q3, Q3, Q6 -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q5, Q2, Q5 -vsub.s32 Q7, Q1, Q4 -vmul.u32 Q2, Q2, Q6 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q5 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q5 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q5, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q6, Q5, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q5, Q7, Q5 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q7, Q7, Q6 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q5, Q7, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q7, Q4, Q5 -vstrw.s32 Q7, [r0, #-80] -vadd.s32 Q4, Q4, Q5 -// Butterfly [0, 1, 2, 3] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [16, 17, 18, 19] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [32, 33, 34, 35] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [48, 49, 50, 51] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [64, 65, 66, 67] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [80, 81, 82, 83] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [96, 97, 98, 99] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [112, 113, 114, 115] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [128, 129, 130, 131] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [144, 145, 146, 147] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [160, 161, 162, 163] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [176, 177, 178, 179] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [192, 193, 194, 195] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q4, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q4, Q4, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q4, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-80] -vadd.s32 Q5, Q5, Q6 -// Butterfly [208, 209, 210, 211] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q4, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q5, [r0, #-96] -vqrdmlah.s32 Q4, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q5, Q1, Q4 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q4, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vld41.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmulh.s32 Q6, Q5, Q6 -vld40.s32 {Q0,Q1,Q2,Q3}, [r0] -vmul.u32 Q5, Q5, Q7 -vld42.s32 {Q0,Q1,Q2,Q3}, [r0] -vqrdmlah.s32 Q6, Q5, r12 -vld43.s32 {Q0,Q1,Q2,Q3}, [r0]! -vsub.s32 Q5, Q4, Q6 -vstrw.s32 Q5, [r0, #-80] -vadd.s32 Q4, Q4, Q6 -// Butterfly [224, 225, 226, 227] -vldrw.s32 Q6, [r11], #80 -vqrdmulh.s32 Q5, Q3, Q6 -vldrw.s32 Q7, [r11, #-64] -vmul.u32 Q3, Q3, Q7 -vstrw.s32 Q4, [r0, #-96] -vqrdmlah.s32 Q5, Q3, r12 -vldrw.s32 Q3, [r11, #-48] -vqrdmulh.s32 Q6, Q2, Q6 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, Q7 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q6, Q2, r12 -vldrw.s32 Q2, [r11, #-32] -vqrdmulh.s32 Q3, Q1, Q3 -vsub.s32 Q5, Q0, Q6 -vmul.u32 Q1, Q1, Q2 -vadd.s32 Q0, Q0, Q6 -vqrdmlah.s32 Q3, Q1, r12 -vldrw.s32 Q6, [r11, #-16] -vsub.s32 Q1, Q0, Q3 -vstrw.s32 Q1, [r0,#-48] -vadd.s32 Q0, Q0, Q3 -vstrw.s32 Q0, [r0, #-64] -vmul.u32 Q7, Q6, r10 -vqrdmulh.s32 Q6, Q4, Q6 -vmul.u32 Q4, Q4, Q7 -vqrdmlah.s32 Q6, Q4, r12 -vsub.s32 Q4, Q5, Q6 -vstrw.s32 Q4, [r0, #-16] -vadd.s32 Q5, Q5, Q6 -vstrw.s32 Q5, [r0, #-32] -// Butterfly [240, 241, 242, 243] -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2883 -// Instruction count: 2429 \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_16_anticyclic_mve_simd.s b/tests/saber/auto/poly_u16_mul_16_anticyclic_mve_simd.s deleted file mode 100644 index 51e731a..0000000 --- a/tests/saber/auto/poly_u16_mul_16_anticyclic_mve_simd.s +++ /dev/null @@ -1,108 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_mve_simd, %function -.global poly_u16_mul_16_anticyclic_mve_simd -poly_u16_mul_16_anticyclic_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -vmov.u16 Q2, #0 -mov r12, #0 -vldrh.u16 Q3, [r2, #0] -vldrh.u16 Q4, [r2, #16] -vneg.s16 Q5, Q3 -ldrd r14, r11, [r1, #8] -ldrd r10, r9, [r1, #24] -vmul.u16 Q0, Q4, r14 -vmla.s16 Q0, Q3, r10 -vmul.u16 Q1, Q4, r10 -vmla.s16 Q1, Q5, r14 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -ldrd r8, r7, [r1, #0] -ldrd r6, r5, [r1, #16] -vmla.s16 Q0, Q4, r7 -vmla.s16 Q0, Q3, r5 -vmla.s16 Q1, Q4, r5 -vmla.s16 Q1, Q5, r7 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vmla.s16 Q0, Q4, r8 -vmla.s16 Q0, Q3, r6 -vmla.s16 Q1, Q4, r6 -vmla.s16 Q1, Q5, r8 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vneg.s16 Q5, Q4 -vmla.s16 Q0, Q3, r11 -vmla.s16 Q0, Q5, r9 -vmla.s16 Q1, Q4, r11 -vmla.s16 Q1, Q3, r9 -vsub.u16 Q0, Q0, Q2 -vmov.u16 Q2, #0 -vshlc Q0, r12, #16 -vshlc Q1, r12, #16 -vshlc Q2, r12, #16 -asrl r14, r11, #16 -asrl r10, r9, #16 -vmla.s16 Q0, Q3, r14 -vmla.s16 Q0, Q5, r10 -vmla.s16 Q1, Q4, r14 -vmla.s16 Q1, Q3, r10 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -asrl r8, r7, #16 -asrl r6, r5, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q0, Q5, r5 -vmla.s16 Q1, Q4, r7 -vmla.s16 Q1, Q3, r5 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vmla.s16 Q0, Q3, r8 -vmla.s16 Q0, Q5, r6 -vmla.s16 Q1, Q4, r8 -vmla.s16 Q1, Q3, r6 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -neg r9, r9 -vmla.s16 Q0, Q3, r9 -vmla.s16 Q0, Q5, r11 -vmla.s16 Q1, Q4, r9 -vmla.s16 Q1, Q3, r11 -vsub.u16 Q0, Q0, Q2 -vstrh.u16 Q0, [r0,#(0)] -vstrh.u16 Q1, [r0,#(16)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s b/tests/saber/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s deleted file mode 100644 index 3d06c04..0000000 --- a/tests/saber/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,425 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_opt_mve_simd, %function -.global poly_u16_mul_16_anticyclic_opt_mve_simd -poly_u16_mul_16_anticyclic_opt_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -vldrh.u16 Q2, [r0, #0] -vldrh.u16 Q3, [r0, #16] -ldrd r10, r11, [r1, #0] -ldrd r8, r9, [r1, #16] -ldrd r6, r7, [r1, #24] -vldrh.u16 Q0, [r2, #0] -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2, #16] -vmla.s16 Q3, Q1, r6 -ldrd r4, r5, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r9 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r11 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r7 -asrl r4, r5, #16 -asrl r6, r7, #16 -asrl r10, r11, #16 -asrl r8, r9, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r5 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r11 -vmla.s16 Q3, Q0, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r8 -vstrh.u16 Q3, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -mov r14, #30 -wls r14, r14, loop_end -loop_start: -mov r11, r11 -mov r11, r11 -mov r11, r11 -vldrh.u16 Q5, [r0, #0] -vldrh.u16 Q4, [r0, #16] -ldrd r10, r9, [r1, #0] -ldrd r8, r7, [r1, #16] -ldrd r6, r5, [r1, #24] -vldrh.u16 Q7, [r2, #0] -vmla.s16 Q5, Q7, r6 -vldrh.u16 Q6, [r2, #16] -vmla.s16 Q4, Q6, r6 -ldrd r4, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q5, Q6, r4 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r4 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r8 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r10 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r5 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r3 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r5 -asrl r4, r3, #16 -asrl r6, r5, #16 -asrl r10, r9, #16 -asrl r8, r7, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r5 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r3 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r5 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r4 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r6 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r4 -vmla.s16 Q4, Q7, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r9 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0,#(0)] -vmla.s16 Q4, Q7, r8 -vstrh.u16 Q4, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -vldrh.u16 Q2, [r0, #0] -vldrh.u16 Q3, [r0, #16] -ldrd r10, r9, [r1, #0] -ldrd r8, r7, [r1, #16] -ldrd r6, r5, [r1, #24] -vldrh.u16 Q0, [r2, #0] -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2, #16] -vmla.s16 Q3, Q1, r6 -ldrd r4, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r5 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r5 -asrl r4, r3, #16 -asrl r6, r5, #16 -asrl r10, r9, #16 -asrl r8, r7, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r5 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r3 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r9 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r8 -vstrh.u16 Q3, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -le r14, loop_start -loop_end: -vldrh.u16 Q5, [r0, #0] -vldrh.u16 Q4, [r0, #16] -ldrd r14, r9, [r1, #0] -ldrd r10, r7, [r1, #16] -ldrd r8, r5, [r1, #24] -vldrh.u16 Q7, [r2, #0] -vmla.s16 Q5, Q7, r8 -vldrh.u16 Q6, [r2, #16] -vmla.s16 Q4, Q6, r8 -ldrd r6, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q5, Q6, r6 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r14 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r5 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r3 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r5 -asrl r6, r3, #16 -asrl r8, r5, #16 -asrl r14, r9, #16 -asrl r10, r7, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r5 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r3 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r5 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r6 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r6 -vmla.s16 Q4, Q7, r8 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r9 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r14 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0,#(0)] -vmla.s16 Q4, Q7, r10 -vstrh.u16 Q4, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -vldrh.u16 Q2, [r0, #0] -vldrh.u16 Q3, [r0, #16] -ldrd r14, r9, [r1, #0] -ldrd r10, r7, [r1, #16] -ldrd r8, r5, [r1, #24] -vldrh.u16 Q0, [r2, #0] -vmla.s16 Q2, Q0, r8 -vldrh.u16 Q1, [r2, #16] -vmla.s16 Q3, Q1, r8 -ldrd r6, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r6 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r14 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r5 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r5 -asrl r6, r3, #16 -asrl r8, r5, #16 -asrl r14, r9, #16 -asrl r10, r7, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r5 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r3 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r6 -vmla.s16 Q3, Q0, r8 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r9 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r14 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r10 -vstrh.u16 Q3, [r0,#(16)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_16_mve_simd.s b/tests/saber/auto/poly_u16_mul_16_mve_simd.s deleted file mode 100644 index 3f1d6f9..0000000 --- a/tests/saber/auto/poly_u16_mul_16_mve_simd.s +++ /dev/null @@ -1,178 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_mve_simd, %function -.global poly_u16_mul_16_mve_simd -poly_u16_mul_16_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q0, [r2, #0] -vmul.u16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vmul.u16 Q1, Q0, r11 -vldrh.u16 Q2, [r2, #16] -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vmul.u16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vmov.u16 Q1, #0 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_256_toom4_mve.s b/tests/saber/auto/poly_u16_mul_256_toom4_mve.s deleted file mode 100644 index 944c900..0000000 --- a/tests/saber/auto/poly_u16_mul_256_toom4_mve.s +++ /dev/null @@ -1,1287 +0,0 @@ -.syntax unified -.type poly_u16_mul_64_C, %function -.global poly_u16_mul_64_C -.syntax unified -.type poly_u16_mul_256_toom4_mve, %function -.global poly_u16_mul_256_toom4_mve -poly_u16_mul_256_toom4_mve: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #1792 -add sp, sp, #504 -add r14, sp, #1008 -add r1, r1, #504 -add r2, r2, #504 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vldrw.u32 Q0, [r1, #-504] -vldrw.u32 Q1, [r1, #-376] -vldrw.u32 Q2, [r1, #-248] -vldrw.u32 Q3, [r1, #-120] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(264)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r1, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-360] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #-232] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #-104] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-488)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-248)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(280)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r1, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-344] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #-216] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #-88] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-472)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-232)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(296)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r1, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-328] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #-200] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #-72] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-456)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-216)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(312)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r1, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #-312] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #-184] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #-56] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-440)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-200)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(328)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r1, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-296] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #-168] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #-40] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-424)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-184)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(344)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r1, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-280] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #-152] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #-24] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-408)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-168)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(360)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r1, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-264] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #-136] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #-8] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-392)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-152)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r2, #-504] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-376] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #-248] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #-120] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-376)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-136)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(392)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r2, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-360] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #-232] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #-104] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-360)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-120)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(408)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r2, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-344] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #-216] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #-88] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-344)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-104)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(424)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r2, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-328] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #-200] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #-72] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-328)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-88)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r2, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-312] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #-184] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #-56] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-312)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-72)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(456)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r2, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-296] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #-168] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #-40] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-296)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-56)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(472)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r2, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-280] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #-152] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #-24] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-280)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-40)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(488)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r2, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-264] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #-136] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #-8] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-264)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-24)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(248)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-248)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-264)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-8)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #256 -add r10, r2, #256 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(1024) -add r2, sp, #(1152) -add r0, sp, #(1280) -bl poly_u16_mul_64_C -add r1, sp, #(768) -add r2, sp, #(896) -add r0, sp, #(1024) -bl poly_u16_mul_64_C -add r1, sp, #(512) -add r2, sp, #(640) -add r0, sp, #(768) -bl poly_u16_mul_64_C -add r1, sp, #(256) -add r2, sp, #(384) -add r0, sp, #(512) -bl poly_u16_mul_64_C -add r1, sp, #(0) -add r2, sp, #(128) -add r0, sp, #(256) -bl poly_u16_mul_64_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_64_C -add r1, r11, #(128) -add r2, r10, #(128) -add r0, sp, #(1536) -bl poly_u16_mul_64_C -add sp, sp, #504 -add r14, sp, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r14, #-232] -vldrw.u32 Q1, [sp, #8] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #-248] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #-488] -vldrw.u32 Q4, [sp, #264] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #-216] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [sp, #280] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [sp, #24] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [sp,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [sp,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-472] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #296] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #40] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-456] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #56] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #312] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #56] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-440] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #328] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #72] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-424] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #344] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #88] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-408] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #360] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #104] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-392] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #376] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #120] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-376] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #136] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #392] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #136] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-360] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #152] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-88] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-248] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-248)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #408] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #264] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-232] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #152] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-488] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #8] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #24] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-344] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #168] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-72] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-232] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-232)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #424] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #280] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-216] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #168] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #-472] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #24] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(24)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #40] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-328] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #184] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-216] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-216)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #440] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #296] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-200] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #184] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-456] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #40] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(40)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #56] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-312] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #200] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-40] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-200] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #456] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #312] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #200] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #-440] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #56] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(56)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #72] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-296] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #216] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-24] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-184] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-184)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #472] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #328] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-168] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #216] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-424] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #72] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(72)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #88] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-280] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #232] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-168] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-168)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #488] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #344] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-152] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #232] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #-408] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #88] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(88)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #104] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-264] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #248] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-152] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-152)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #504] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #360] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-136] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #248] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-392] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #104] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(104)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #120] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-248] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #264] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #-136] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(-136)] -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q3, [sp, #376] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #-120] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r14, #-376] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r14,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q1, [sp, #120] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(120)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #136] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(136)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -add sp, sp, #1792 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s b/tests/saber/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s deleted file mode 100644 index 63a5849..0000000 --- a/tests/saber/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s +++ /dev/null @@ -1,773 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd, %function -.global poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd -poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #224 -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add sp, sp, #224 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s b/tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s deleted file mode 100644 index afb4552..0000000 --- a/tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s +++ /dev/null @@ -1,268 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd, %function -.global poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd -poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -nop -nop -nop -nop -nop -nop -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #24] -vmul.u16 Q0, Q4, r11 -ldrd r10, r9, [r1, #56] -vmul.u16 Q1, Q4, r9 -vneg.s16 Q3, Q6 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #16] -vmla.s16 Q1, Q6, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #48] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #8] -vmla.s16 Q1, Q6, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #40] -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #0] -vmla.s16 Q1, Q6, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #32] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q6, r11 -vstrh.u16 Q1, [r0,#(16)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(0)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vld20.u16 {Q2, Q3}, [r1] -vld21.u16 {Q2, Q3}, [r1]! -vst20.u16 {Q1, Q2}, [r1] -vst21.u16 {Q1, Q2}, [r1]! -vst20.u16 {Q3, Q4}, [r1] -vst21.u16 {Q3, Q4}, [r1]! -vadd.u16 Q0, Q0, Q1 -vadd.u16 Q2, Q2, Q3 -vst20.u16 {Q0, Q1}, [r1] -vst21.u16 {Q0, Q1}, [r1]! -vst20.u16 {Q2, Q3}, [r1] -vst21.u16 {Q2, Q3}, [r1]! -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #-104] -vmul.u16 Q0, Q5, r11 -ldrd r10, r9, [r1, #-72] -vmul.u16 Q1, Q5, r9 -vneg.s16 Q3, Q7 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #-112] -vmla.s16 Q1, Q7, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #-80] -vmla.s16 Q1, Q7, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #-120] -vmla.s16 Q1, Q7, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #-88] -vmla.s16 Q1, Q7, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #-128] -vmla.s16 Q1, Q7, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #-96] -vmla.s16 Q1, Q7, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q7, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q7, r11 -vstrh.u16 Q1, [r0,#(48)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(32)] -vadd.u16 Q4, Q4, Q5 -vadd.u16 Q6, Q6, Q7 -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #-40] -vmul.u16 Q0, Q4, r11 -ldrd r10, r9, [r1, #-8] -vmul.u16 Q1, Q4, r9 -vneg.s16 Q3, Q6 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #-48] -vmla.s16 Q1, Q6, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #-16] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #-56] -vmla.s16 Q1, Q6, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #-24] -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #-64] -vmla.s16 Q1, Q6, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #-32] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q6, r11 -vstrh.u16 Q1, [r0,#(80)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(64)] -nop -nop -nop -nop -nop -nop -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s b/tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s deleted file mode 100644 index 86e91d2..0000000 --- a/tests/saber/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s +++ /dev/null @@ -1,749 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd -poly_u16_mul_32_anticyclic_karatsuba_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #224 -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add sp, sp, #224 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_32_anticyclic_mve_simd.s b/tests/saber/auto/poly_u16_mul_32_anticyclic_mve_simd.s deleted file mode 100644 index 32f2d87..0000000 --- a/tests/saber/auto/poly_u16_mul_32_anticyclic_mve_simd.s +++ /dev/null @@ -1,274 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_mve_simd, %function -.global poly_u16_mul_32_anticyclic_mve_simd -poly_u16_mul_32_anticyclic_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -nop // XXX -mov r14, #0x42 -mov r14, #0x3 -vmsr p0, r14 -vldrh.u16 Q0, [r2, #0] -vldrh.u16 Q1, [r2, #16] -vldrh.u16 Q2, [r2, #32] -vldrh.u16 Q3, [r2, #48] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -ldrh r10, [r1, #46] -ldrh r9, [r1, #62] -vmul.u16 Q4, Q0, r14 -vmul.u16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmul.u16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmul.u16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #12] -ldrh r11, [r1, #28] -ldrh r10, [r1, #44] -ldrh r9, [r1, #60] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #10] -ldrh r11, [r1, #26] -ldrh r10, [r1, #42] -ldrh r9, [r1, #58] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #8] -ldrh r11, [r1, #24] -ldrh r10, [r1, #40] -ldrh r9, [r1, #56] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #6] -ldrh r11, [r1, #22] -ldrh r10, [r1, #38] -ldrh r9, [r1, #54] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #4] -ldrh r11, [r1, #20] -ldrh r10, [r1, #36] -ldrh r9, [r1, #52] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #2] -ldrh r11, [r1, #18] -ldrh r10, [r1, #34] -ldrh r9, [r1, #50] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #0] -ldrh r11, [r1, #16] -ldrh r10, [r1, #32] -ldrh r9, [r1, #48] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vmla.s16 Q6, Q3, r9 -neg r12, r12 -vstrh.u16 Q4, [r0,#(0)] -vstrh.u16 Q5, [r0,#(16)] -vstrh.u16 Q6, [r0,#(32)] -vstrh.u16 Q7, [r0,#(48)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s b/tests/saber/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s deleted file mode 100644 index 7900495..0000000 --- a/tests/saber/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,274 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_opt_mve_simd, %function -.global poly_u16_mul_32_anticyclic_opt_mve_simd -poly_u16_mul_32_anticyclic_opt_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -ldrh r14, [r1, #30] -vldrh.u16 Q4, [r2, #0] -vmul.u16 Q1, Q4, r14 -ldrh r11, [r1, #46] -vldrh.u16 Q5, [r2, #16] -vmul.u16 Q2, Q4, r11 -ldrh r10, [r1, #62] -vldrh.u16 Q6, [r2, #32] -vmul.u16 Q3, Q4, r10 -ldrh r9, [r1, #14] -vldrh.u16 Q7, [r2, #48] -vmla.s16 Q3, Q5, r11 -neg r10, r10 -vmla.s16 Q2, Q5, r14 -neg r11, r11 -vmla.s16 Q3, Q6, r14 -neg r14, r14 -vmla.s16 Q3, Q7, r9 -ldrh r8, [r1, #12] -vmul.u16 Q0, Q7, r14 -ldrh r7, [r1, #28] -vmla.s16 Q0, Q6, r11 -ldrh r6, [r1, #44] -vmla.s16 Q0, Q5, r10 -ldrh r5, [r1, #60] -vmla.s16 Q0, Q4, r9 -vmla.s16 Q1, Q5, r9 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r11 -vmla.s16 Q1, Q6, r10 -neg r12, r12 -vmla.s16 Q2, Q6, r9 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #10] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #26] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #42] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #58] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #8] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #24] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #40] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #56] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #6] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #22] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #38] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #54] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #4] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #20] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #36] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #52] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #2] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #18] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #34] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #50] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #0] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #16] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #32] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #48] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #-2] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #14] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #30] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #46] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vstrh.u16 Q3, [r0,#(48)] -vmla.s16 Q1, Q7, r6 -vstrh.u16 Q0, [r0,#(0)] -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vstrh.u16 Q1, [r0,#(16)] -vmla.s16 Q2, Q7, r5 -vstrh.u16 Q2, [r0,#(32)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_32_mve_simd.s b/tests/saber/auto/poly_u16_mul_32_mve_simd.s deleted file mode 100644 index b6b25e7..0000000 --- a/tests/saber/auto/poly_u16_mul_32_mve_simd.s +++ /dev/null @@ -1,386 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_mve_simd, %function -.global poly_u16_mul_32_mve_simd -poly_u16_mul_32_mve_simd: -push {r4-r11,lr} -vpush {d0-d15} -mov r0, r0 -mov r0, r0 -mov r12, #0 -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -ldrh r10, [r1, #46] -ldrh r9, [r1, #62] -vldrh.u16 Q0, [r2, #0] -vmul.u16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vmul.u16 Q1, Q0, r11 -vldrh.u16 Q2, [r2, #16] -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vmul.u16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vldrh.u16 Q3, [r2, #32] -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vmul.u16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vldrh.u16 Q4, [r2, #48] -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vmul.u16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vmul.u16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vmul.u16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vmov.u16 Q1, #0 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #12] -ldrh r11, [r1, #28] -ldrh r10, [r1, #44] -ldrh r9, [r1, #60] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #10] -ldrh r11, [r1, #26] -ldrh r10, [r1, #42] -ldrh r9, [r1, #58] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #8] -ldrh r11, [r1, #24] -ldrh r10, [r1, #40] -ldrh r9, [r1, #56] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #6] -ldrh r11, [r1, #22] -ldrh r10, [r1, #38] -ldrh r9, [r1, #54] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #4] -ldrh r11, [r1, #20] -ldrh r10, [r1, #36] -ldrh r9, [r1, #52] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #2] -ldrh r11, [r1, #18] -ldrh r10, [r1, #34] -ldrh r9, [r1, #50] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #0] -ldrh r11, [r1, #16] -ldrh r10, [r1, #32] -ldrh r9, [r1, #48] -vldrh.u16 Q1, [r0, #0] -vmla.s16 Q1, Q0, r14 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #16] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #32] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #48] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #64] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #80] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #96] -vmla.s16 Q1, Q4, r9 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #112] -vstrh.u16 Q1, [r0,#(112)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_512_toom4_mve.s b/tests/saber/auto/poly_u16_mul_512_toom4_mve.s deleted file mode 100644 index cbfbece..0000000 --- a/tests/saber/auto/poly_u16_mul_512_toom4_mve.s +++ /dev/null @@ -1,2501 +0,0 @@ -.syntax unified -.type poly_u16_mul_128_C, %function -.global poly_u16_mul_128_C -.syntax unified -.type poly_u16_mul_512_toom4_mve, %function -.global poly_u16_mul_512_toom4_mve -poly_u16_mul_512_toom4_mve: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #3584 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r1, r1, #504 -add r10, r1, #1008 -add r2, r2, #504 -add r9, r2, #1008 -mov r8, #1 -mov r7, #2 -mov r6, #3 -mov r5, #7 -vldrw.u32 Q0, [r1, #-504] -vldrw.u32 Q1, [r1, #-248] -vldrw.u32 Q2, [r1, #8] -vldrw.u32 Q3, [r1, #264] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-488)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(24)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-232] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #24] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #280] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-472)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(8)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-472)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(40)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-216] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #40] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #296] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-456)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-456)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(56)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-200] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #56] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #312] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-440)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(40)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #-184] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #72] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #328] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-424)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-424)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(88)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-168] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #88] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #344] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-408)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(72)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-408)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(104)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-152] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #104] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #360] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-392)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(120)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-136] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #120] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #376] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(104)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #-376] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #-120] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #136] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #392] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-360)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-360)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #-360] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-104] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #152] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #408] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-344)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(136)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-344)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #-344] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-88] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #168] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #424] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-328)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #-328] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-72] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #184] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #440] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(168)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #-312] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #-56] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #200] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #456] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-296)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-296)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #-296] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-40] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #216] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #472] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-280)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(200)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #-280] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-24] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #232] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #488] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-264)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #-264] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-8] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #248] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #504] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(232)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-248)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #-504] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-248] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #8] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #264] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-232)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-232] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #24] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #280] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(264)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-216)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-216] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #40] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #296] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-200)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-200] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #56] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #312] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-184)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(296)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-184)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-184] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #72] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #328] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-168)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-168] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #88] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #344] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-152)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-152] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #104] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #360] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-136)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-136] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #120] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #376] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-120)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-120)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #-376] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-120] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #136] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #392] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-104)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #-360] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-104] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #152] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #408] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-88)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #-344] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-88] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #168] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #424] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-72)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #-328] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-72] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #184] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #440] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-56)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-56)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #-312] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-56] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #200] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #456] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-40)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #-296] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-40] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #216] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #472] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-24)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #-280] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-24] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #232] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #488] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-8)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #-264] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-8] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #248] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #504] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(8)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(8)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(24)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-8)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(504)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #512 -add r10, r2, #512 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(2048) -add r2, sp, #(2304) -add r0, sp, #(2560) -bl poly_u16_mul_128_C -add r1, sp, #(1536) -add r2, sp, #(1792) -add r0, sp, #(2048) -bl poly_u16_mul_128_C -add r1, sp, #(1024) -add r2, sp, #(1280) -add r0, sp, #(1536) -bl poly_u16_mul_128_C -add r1, sp, #(512) -add r2, sp, #(768) -add r0, sp, #(1024) -bl poly_u16_mul_128_C -add r1, sp, #(0) -add r2, sp, #(256) -add r0, sp, #(512) -bl poly_u16_mul_128_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_128_C -add r1, r11, #(256) -add r2, r10, #(256) -add r0, sp, #(3072) -bl poly_u16_mul_128_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #-64 -mov r9, #45 -mov r8, #-8 -mov r6, #43691 -mov r5, #16 -mov r4, #30 -mov r3, #61167 -mov r2, #-65 -mov r1, #36409 -mov r0, #1 -vldrw.u32 Q0, [r12, #40] -vldrw.u32 Q1, [r14, #-488] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #8] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #-472] -vldrw.u32 Q4, [r14, #24] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r11, #-456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r2 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #56] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r8 -vldrw.u32 Q5, [r14, #40] -vmla.s16 Q0, Q3, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r5 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(-472)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r1 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #-472] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r4 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r3 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #72] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #56] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-456] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #88] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #72] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-440] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #88] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-424] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #104] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-408] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #120] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-392] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-376] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #136] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-376] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-360] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #152] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-360] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-344] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #168] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-344] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-328] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #184] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-328] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-312] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #216] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #200] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-312] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #216] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-296] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #232] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-280] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #264] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #248] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-264] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #280] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #264] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-248] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #296] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #280] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-232] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #312] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #8] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(8)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #296] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #40] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-216] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-472] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-488] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-456] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #328] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #24] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(24)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #312] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #56] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-200] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-456] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-472] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-440] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #344] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #40] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(40)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #328] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #72] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-184] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-440] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-456] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-424] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #360] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #56] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #344] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #88] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-168] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-424] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-440] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-408] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #376] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #72] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(72)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #360] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #104] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-152] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-408] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-424] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-392] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #392] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #88] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(88)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #376] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #120] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-136] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-392] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-408] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-376] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #408] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #104] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(104)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #392] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #136] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-120] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-376] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-392] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-360] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #424] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #120] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #408] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #152] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-104] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-360] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-344] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #440] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #136] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(136)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #424] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #152] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #168] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-88] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-344] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-360] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-328] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-56] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #456] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #152] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(152)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #440] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #168] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-72] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-328] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-344] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-312] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-88] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-40] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #472] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #168] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(168)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #456] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #184] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #200] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-56] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-312] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-328] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-296] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-72] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-24] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #488] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #184] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #472] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #200] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #216] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-40] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-296] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-312] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-280] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-24] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-56] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-8] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #504] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #200] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(200)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #488] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #216] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #232] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-24] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-280] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-296] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-264] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-8] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-40] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #8] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #-488] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #216] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(216)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #504] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #232] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #248] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-8] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-264] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-280] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-248] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #8] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-24] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #-472] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #232] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(232)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r12, #-488] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #248] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #264] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #8] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-248] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #-264] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-264)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-232] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #504] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-8] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #248] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(248)] -vmla.s16 Q1, Q2, r8 -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q3, [r14, #264] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #280] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r12,#(280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r12, #-232] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q1, [r14, #-248] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-216] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-216)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -add sp, sp, #3584 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_64_toom4_mve.s b/tests/saber/auto/poly_u16_mul_64_toom4_mve.s deleted file mode 100644 index a80da86..0000000 --- a/tests/saber/auto/poly_u16_mul_64_toom4_mve.s +++ /dev/null @@ -1,379 +0,0 @@ -.syntax unified -.type poly_u16_mul_16_C, %function -.global poly_u16_mul_16_C -.syntax unified -.type poly_u16_mul_64_toom4_mve, %function -.global poly_u16_mul_64_toom4_mve -poly_u16_mul_64_toom4_mve: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #448 -add sp, sp, #504 -add r1, r1, #504 -add r2, r2, #504 -mov r14, #1 -mov r12, #2 -mov r11, #3 -mov r10, #7 -vldrw.u32 Q0, [r1, #-504] -vldrw.u32 Q1, [r1, #-472] -vldrw.u32 Q2, [r1, #-440] -vldrw.u32 Q3, [r1, #-408] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r11 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q6, Q5, r12 -vmla.s16 Q5, Q1, r11 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r10 -vldrw.u32 Q0, [r1, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-456] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #-424] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #-392] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [sp,#(-248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-440)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r11 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q4, Q7, r12 -vmla.s16 Q7, Q1, r11 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q6, Q2, r11 -vmla.s16 Q6, Q3, r10 -vldrw.u32 Q0, [r2, #-504] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-472] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #-440] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #-408] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [sp,#(-232)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-424)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r11 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q5, Q6, r12 -vmla.s16 Q6, Q1, r11 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q4, Q2, r11 -vmla.s16 Q4, Q3, r10 -vldrw.u32 Q0, [r2, #-488] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-456] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #-424] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #-392] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [sp,#(-216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-408)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r11 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q7, Q4, r12 -vmla.s16 Q4, Q1, r11 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q5, Q2, r11 -vmla.s16 Q5, Q3, r10 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [sp,#(-200)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-456)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-392)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #64 -add r10, r2, #64 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(256) -add r2, sp, #(288) -add r0, sp, #(320) -bl poly_u16_mul_16_C -add r1, sp, #(192) -add r2, sp, #(224) -add r0, sp, #(256) -bl poly_u16_mul_16_C -add r1, sp, #(128) -add r2, sp, #(160) -add r0, sp, #(192) -bl poly_u16_mul_16_C -add r1, sp, #(64) -add r2, sp, #(96) -add r0, sp, #(128) -bl poly_u16_mul_16_C -add r1, sp, #(0) -add r2, sp, #(32) -add r0, sp, #(64) -bl poly_u16_mul_16_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_16_C -add r1, r11, #(32) -add r2, r10, #(32) -add r0, sp, #(384) -bl poly_u16_mul_16_C -add sp, sp, #504 -mov r14, #-64 -mov r12, #45 -mov r11, #-8 -mov r10, #43691 -mov r9, #16 -mov r8, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [sp, #-184] -vldrw.u32 Q1, [sp, #-376] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #-440] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [sp, #-248] -vldrw.u32 Q4, [sp, #-312] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [sp, #-120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [sp, #-168] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r11 -vldrw.u32 Q5, [sp, #-296] -vmla.s16 Q0, Q3, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [sp,#(-376)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r9 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(-248)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [sp, #-360] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r8 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [sp,#(-312)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [sp,#(-440)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [sp,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #-232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [sp, #-104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [sp, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [sp, #-280] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [sp,#(-360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [sp,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #-344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [sp,#(-424)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [sp,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #-408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #-216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [sp, #-88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [sp, #-136] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #-440] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-440)] -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [sp, #-264] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vldrw.u32 Q7, [sp, #-312] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [sp, #-184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [sp,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #-328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [sp, #-248] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [sp,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [sp, #-376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(-376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [sp, #-120] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #-392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #-200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [sp, #-72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #-424] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q1, Q2, r11 -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vldrw.u32 Q3, [sp, #-296] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(-296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [sp, #-168] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [sp,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [sp, #-232] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q1, [sp, #-360] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(-360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [sp, #-104] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [sp,#(-104)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -add sp, sp, #448 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_768_toom4_mve.s b/tests/saber/auto/poly_u16_mul_768_toom4_mve.s deleted file mode 100644 index 5f961e9..0000000 --- a/tests/saber/auto/poly_u16_mul_768_toom4_mve.s +++ /dev/null @@ -1,3759 +0,0 @@ -.syntax unified -.type poly_u16_mul_192_C, %function -.global poly_u16_mul_192_C -.syntax unified -.type poly_u16_mul_768_toom4_mve, %function -.global poly_u16_mul_768_toom4_mve -poly_u16_mul_768_toom4_mve: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #5376 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #1 -mov r5, #2 -mov r4, #3 -mov r3, #7 -vldrw.u32 Q0, [r1, #-504] -vldrw.u32 Q1, [r1, #-120] -vldrw.u32 Q2, [r1, #264] -vldrw.u32 Q3, [r8, #-360] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(24)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-216)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-104] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #280] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-344] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-456)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(264)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(40)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-200)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-88] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #296] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-328] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-440)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(56)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-184)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-72] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #312] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-312] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-424)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(296)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(72)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #-56] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #328] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-296] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-408)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(88)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-152)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-40] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #344] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-280] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-392)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(104)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-136)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-24] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #360] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-264] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-376)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(120)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-120)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-8] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #376] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-248] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-360)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(136)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-376] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #8] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #392] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-232] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-344)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-88)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-360] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #24] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #408] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-216] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-328)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-72)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-344] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #40] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #424] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-200] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-312)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-56)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-328] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #56] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #440] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-184] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-296)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-312] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #72] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #456] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-168] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-280)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-24)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-296] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #88] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #472] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-152] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-264)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-8)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-280] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #104] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #488] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-136] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-248)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(8)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-264] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #120] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #504] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-120] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-232)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-248] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #136] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #-488] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-104] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-216)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(504)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(40)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-232] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #152] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #-472] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-88] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-200)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-488)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(56)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-216] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #168] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #-456] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-72] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-184)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(72)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-200] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #184] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #-440] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-56] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-168)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-456)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-184] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #200] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #-424] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-40] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-152)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(104)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-168] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #216] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #-408] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-24] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-136)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-424)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(120)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-152] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #232] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #-392] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-8] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-120)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(136)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-136] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #248] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #-376] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #8] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-104)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-392)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-504] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-120] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #264] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-360] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-88)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(168)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-104] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #280] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-344] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-72)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-360)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(184)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-88] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #296] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-328] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-56)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(200)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-72] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #312] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-312] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-40)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-328)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-56] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #328] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-296] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-24)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(232)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-40] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #344] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-280] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-8)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-296)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-24] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #360] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-264] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(264)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-8] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #376] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-248] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(24)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-264)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-376] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #8] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #392] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-232] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-8)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(296)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-360] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #24] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #408] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-216] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(56)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-232)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-456)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-344] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #40] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #424] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-200] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(328)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-328] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #56] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #440] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-184] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(88)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-200)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-424)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-312] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #72] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #456] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-168] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(360)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-296] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #88] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #472] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-152] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(120)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-168)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-392)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-280] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #104] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #488] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-136] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(392)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-264] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #120] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #504] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-120] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(152)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-136)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-360)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-248] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #136] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #-488] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-104] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(424)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-232] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #152] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #-472] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-88] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(184)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-104)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-328)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-216] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #168] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #-456] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-72] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(456)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-200] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #184] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #-440] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-56] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-72)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-296)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-184] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #200] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #-424] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-40] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(488)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-168] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #216] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #-408] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-24] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-40)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-264)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(504)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-152] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #232] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #-392] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-8] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-488)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-136] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #248] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #-376] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #8] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(280)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-8)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-232)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r11,#(296)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(248)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(8)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #768 -add r10, r2, #768 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(3072) -add r2, sp, #(3456) -add r0, sp, #(3840) -bl poly_u16_mul_192_C -add r1, sp, #(2304) -add r2, sp, #(2688) -add r0, sp, #(3072) -bl poly_u16_mul_192_C -add r1, sp, #(1536) -add r2, sp, #(1920) -add r0, sp, #(2304) -bl poly_u16_mul_192_C -add r1, sp, #(768) -add r2, sp, #(1152) -add r0, sp, #(1536) -bl poly_u16_mul_192_C -add r1, sp, #(0) -add r2, sp, #(384) -add r0, sp, #(768) -bl poly_u16_mul_192_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_192_C -add r1, r11, #(384) -add r2, r10, #(384) -add r0, sp, #(4608) -bl poly_u16_mul_192_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r6, #45 -mov r5, #-8 -mov r4, #43691 -mov r3, #16 -mov r2, #30 -mov r1, #61167 -mov r0, #-65 -vldrw.u32 Q0, [r11, #312] -vldrw.u32 Q1, [r14, #24] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #264] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #-456] -vldrw.u32 Q4, [r12, #-216] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r0 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r11, #328] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r5 -vldrw.u32 Q5, [r12, #-200] -vmla.s16 Q0, Q3, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(24)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r3 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-456)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #40] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r2 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-216)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r1 -vstrw.u32 Q2, [sp,#(264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r11,#(312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-440] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #-184] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #56] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-424] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #360] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #-168] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #72] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-408] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #376] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #-152] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #88] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-392] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #392] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #-136] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #104] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-376] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #408] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #-120] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #120] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-360] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #424] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #-104] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #136] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-344] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #440] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #-88] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #152] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-328] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #456] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #-72] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #168] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-312] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #472] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #-56] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-296] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #488] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #-40] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-280] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #504] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #-24] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-264] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-488] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #-8] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-248] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-472] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #8] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-232] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-456] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #24] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #504] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-216] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-440] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #40] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-200] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-424] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #56] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-184] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-408] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #72] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-168] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-392] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #88] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-152] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-376] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #104] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-136] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-360] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #120] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-120] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #136] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-104] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-328] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #152] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-88] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-312] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #168] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-72] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-296] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #264] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(264)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #184] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-216] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #312] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-456] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #24] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #72] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-56] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-280] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #280] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(280)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #200] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-200] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #328] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-440] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #40] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(40)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #88] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-40] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-88] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-264] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #296] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(296)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #216] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-184] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #344] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-424] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #56] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #104] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-72] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #504] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-248] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #312] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(312)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #232] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-168] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #360] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-408] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #72] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(72)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #120] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-8] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-56] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-232] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #328] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #248] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-152] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #376] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-392] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #88] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #136] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #8] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-40] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-216] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #344] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #264] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #392] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #504] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-376] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #104] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(104)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #152] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #24] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-24] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-200] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #360] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #280] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #408] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-360] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #120] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #168] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-8] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-184] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #376] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #296] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #424] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-344] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #136] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(136)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #184] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #8] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-168] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #312] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #440] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-328] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #152] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #200] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #24] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-152] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #328] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #456] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-312] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #168] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #216] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #40] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-136] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #344] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #472] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-296] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #184] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #232] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #56] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-120] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #360] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #488] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-280] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #200] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #248] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #72] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-104] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #376] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #504] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-264] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #216] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #264] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #88] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-88] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #392] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-488] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-248] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #232] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #280] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #104] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-72] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #408] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-472] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-232] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #248] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #296] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #120] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #504] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #424] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-456] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-216] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #264] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #312] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #136] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-40] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #440] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-440] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-200] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #280] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #328] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #152] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-24] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #456] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-424] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-184] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #296] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #344] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #168] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #472] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-408] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-168] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #312] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #360] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #184] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #488] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-392] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-152] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #328] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #376] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #200] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #24] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #504] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-376] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-136] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #344] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #392] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #216] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #40] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-488] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-360] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-120] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #360] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #408] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #232] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-472] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-344] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-104] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #424] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #248] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #-376] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-376)] -vmla.s16 Q1, Q2, r5 -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q3, [r12, #152] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #-328] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #-88] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q1, [r14, #392] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #440] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(440)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -add sp, sp, #5376 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_mul_832_toom4_mve.s b/tests/saber/auto/poly_u16_mul_832_toom4_mve.s deleted file mode 100644 index ae5897c..0000000 --- a/tests/saber/auto/poly_u16_mul_832_toom4_mve.s +++ /dev/null @@ -1,4065 +0,0 @@ -.syntax unified -.type poly_u16_mul_208_C, %function -.global poly_u16_mul_208_C -.syntax unified -.type poly_u16_mul_832_toom4_mve, %function -.global poly_u16_mul_832_toom4_mve -poly_u16_mul_832_toom4_mve: -push {r4-r11,lr} -vpush {d0-d15} -sub sp, sp, #5824 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #1 -mov r5, #2 -mov r4, #3 -mov r3, #7 -vldrw.u32 Q0, [r1, #-504] -vldrw.u32 Q1, [r1, #-88] -vldrw.u32 Q2, [r1, #328] -vldrw.u32 Q3, [r8, #-264] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-24)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-488] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-72] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #344] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-248] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-200)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-8)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-472] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #-56] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #360] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-232] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-184)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(8)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-456] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #-40] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #376] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-216] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-168)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-440] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #-24] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #392] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-200] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-152)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(40)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-424] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #-8] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #408] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-184] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-136)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(56)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-408] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #8] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #424] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-168] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-120)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(72)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-392] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #24] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #440] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-152] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-104)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-376] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #40] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #456] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-136] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-88)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(104)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-360] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #56] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #472] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-120] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-72)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(120)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-344] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #72] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #488] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-104] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-56)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(136)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-328] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #88] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #504] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-88] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-40)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-312] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #104] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #-488] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-72] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-24)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(504)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(168)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-296] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #120] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #-472] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #-56] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-8)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-488)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(184)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-280] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #136] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #-456] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #-40] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(200)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-264] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #152] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #-440] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #-24] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(24)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-456)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-248] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #168] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #-424] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #-8] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(232)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-232] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #184] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #-408] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #8] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(56)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-424)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-216] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #200] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #-392] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #24] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(264)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-200] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #216] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #-376] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #40] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(88)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-392)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-184] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #232] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #-360] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #56] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(296)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-168] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #248] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #-344] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #72] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(120)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-360)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #-152] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #264] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #-328] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #88] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(328)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #-136] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #280] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #-312] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #104] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(152)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-328)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #-120] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #296] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #-296] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #120] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(360)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #-104] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #312] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #-280] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #136] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(184)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-296)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-456)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-504] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-88] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #328] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-264] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(392)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-488] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-72] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #344] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-248] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-264)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-424)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-472] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #-56] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #360] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-232] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(424)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-456] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #-40] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #376] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-216] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-232)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-392)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-440] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #-24] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #392] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-200] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(456)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-424] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #-8] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #408] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-184] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(280)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-200)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-360)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-408] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #8] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #424] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-168] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(296)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-8)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(488)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-392] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #24] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #440] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-152] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(312)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-168)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-328)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(504)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-376] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #40] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #456] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-136] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-488)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-360] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #56] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #472] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-120] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-136)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-296)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-344] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #72] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #488] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-104] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(360)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-456)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-328] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #88] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #504] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-88] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(376)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-104)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-264)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-312] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #104] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #-488] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-72] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-424)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-296] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #120] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #-472] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #-56] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-72)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-232)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-280] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #136] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #-456] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #-40] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(424)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-392)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-264] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #152] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #-440] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #-24] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(440)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-40)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-200)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-248] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #168] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #-424] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #-8] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-360)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-232] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #184] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #-408] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #8] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-8)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-168)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-216] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #200] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #-392] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #24] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(488)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(8)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-328)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-200] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #216] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #-376] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #40] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(504)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(24)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-136)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-184] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #232] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #-360] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #56] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(40)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-120)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-296)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-168] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #248] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #-344] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #72] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(56)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-104)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #-152] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #264] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #-328] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #88] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r10,#(-456)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(72)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(248)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-88)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-264)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #-136] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #280] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #-312] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #104] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(88)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(264)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-72)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #-120] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #296] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #-296] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #120] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(104)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(280)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-56)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-232)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #-104] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #312] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #-280] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #136] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(120)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(296)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-40)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r10,#(-392)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(312)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(136)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #832 -add r10, r2, #832 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(3328) -add r2, sp, #(3744) -add r0, sp, #(4160) -bl poly_u16_mul_208_C -add r1, sp, #(2496) -add r2, sp, #(2912) -add r0, sp, #(3328) -bl poly_u16_mul_208_C -add r1, sp, #(1664) -add r2, sp, #(2080) -add r0, sp, #(2496) -bl poly_u16_mul_208_C -add r1, sp, #(832) -add r2, sp, #(1248) -add r0, sp, #(1664) -bl poly_u16_mul_208_C -add r1, sp, #(0) -add r2, sp, #(416) -add r0, sp, #(832) -bl poly_u16_mul_208_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_208_C -add r1, r11, #(416) -add r2, r10, #(416) -add r0, sp, #(4992) -bl poly_u16_mul_208_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r6, #45 -mov r5, #-8 -mov r4, #43691 -mov r3, #16 -mov r2, #30 -mov r1, #61167 -mov r0, #-65 -vldrw.u32 Q0, [r10, #-376] -vldrw.u32 Q1, [r14, #152] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #328] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #-200] -vldrw.u32 Q4, [r12, #-24] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r0 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r10, #-360] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r5 -vldrw.u32 Q5, [r12, #-8] -vmla.s16 Q0, Q3, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r3 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-200)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #168] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r2 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r1 -vstrw.u32 Q2, [sp,#(328)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-184] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #8] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-168] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-328] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #24] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-152] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #504] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-312] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #40] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-136] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-296] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #56] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-120] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-280] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #72] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-104] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-264] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #88] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-88] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #104] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-72] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #120] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-56] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-216] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #136] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-40] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #152] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #504] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #168] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-8] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #184] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #8] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #200] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #24] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #216] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #232] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #248] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(408)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-88] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #264] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(424)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-72] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #280] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(440)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-56] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #296] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-40] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #312] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-24] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #328] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #504] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-8] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #344] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(504)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #8] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #360] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-8)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #24] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #376] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #40] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #392] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-88] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #328] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #408] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-376] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-200] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #152] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #456] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-72] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #72] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #344] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #424] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #-8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-360] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-184] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #168] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #472] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-56] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #88] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #360] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #440] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-344] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-168] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #184] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #488] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-40] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #104] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #376] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #456] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-328] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-152] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #200] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #504] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(504)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #-24] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #120] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #472] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-312] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-136] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #216] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-488] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #-8] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-56] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #136] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #488] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-296] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-120] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #232] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-472] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #312] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #8] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-40] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #152] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #504] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-280] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-104] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #248] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-456] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #328] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #24] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-24] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #168] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-488] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-264] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-88] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #264] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-440] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #344] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #40] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-8] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #184] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-472] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-248] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-72] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #280] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-424] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #360] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #56] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #8] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #200] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-456] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-232] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-56] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #296] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-408] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #376] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #72] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #216] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-440] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-216] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-40] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #312] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-392] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #88] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #232] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #504] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-424] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #152] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-200] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-24] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #328] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-376] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #104] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #56] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #248] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-408] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #168] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-8] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #344] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-360] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #120] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #264] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-392] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #184] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-168] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #8] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #360] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-344] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #136] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #280] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-376] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #200] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-152] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #24] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-328] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #152] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #296] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-360] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #216] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-136] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #40] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #392] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-312] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #472] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #168] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #312] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-344] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #232] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-120] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-168] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #56] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #408] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-296] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #488] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #184] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #136] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #328] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-328] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #248] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-104] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-152] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #72] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #424] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(424)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-280] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #504] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #200] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #152] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #344] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-312] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #264] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-88] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-136] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #88] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #440] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-264] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-488] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #216] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #168] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #360] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-376] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-296] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #280] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-72] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-120] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #104] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #456] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-248] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-472] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #232] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #184] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #376] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-360] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-280] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #296] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-56] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-104] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #120] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #472] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-232] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #248] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #200] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #392] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-344] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-264] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #312] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-40] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-88] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #136] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #488] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-216] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #264] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #216] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #408] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-328] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-248] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #328] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-24] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-72] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #152] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #504] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-200] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #280] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #232] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #424] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-312] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #-232] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #344] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-8] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-56] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #168] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r12, #-488] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-184] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #296] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #248] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #440] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-296] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #-216] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #360] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #8] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-40] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #184] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r12, #-472] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-168] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #312] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #264] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #-280] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q1, Q2, r5 -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q3, [r12, #376] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #24] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #200] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q1, [r12, #-456] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-152] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-152)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -add sp, sp, #5824 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256.s b/tests/saber/auto/poly_u16_toom4_fwd_256.s deleted file mode 100644 index d66be05..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256.s +++ /dev/null @@ -1,182 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_256_mve, %function -.global poly_u16_toom4_fwd_256_mve -poly_u16_toom4_fwd_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #128] -vldrw.u32 Q2, [r0, #256] -vldrw.u32 Q3, [r0, #384] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #272] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #400] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-240)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #160] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #416] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-224)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #176] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #304] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #432] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-208)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #192] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #320] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #448] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-192)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-304)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #336] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #464] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-176)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #224] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #480] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-160)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-272)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #240] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #368] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #496] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-144)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-256)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-128)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(368)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_bottom.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_bottom.s deleted file mode 100644 index cc6303a..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_bottom.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_bottom_256_mve, %function -.global poly_u16_toom4_fwd_dual_bottom_256_mve -poly_u16_toom4_fwd_dual_bottom_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #-384 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-32)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s deleted file mode 100644 index 51d7ba4..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(144)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-240)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-48)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(384)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(400)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(176)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-208)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-16)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(416)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-192)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(0)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(432)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(240)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(208)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(16)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(448)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(256)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(464)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(272)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(240)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-144)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(48)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(480)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(288)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(256)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-128)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(64)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(304)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(496)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s deleted file mode 100644 index c9fa8c9..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -add r12, r14, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-144)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r12,#(-288)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(432)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(288)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-128)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r12,#(-272)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(304)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-112)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r12,#(-256)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(464)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(320)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-96)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r12,#(-240)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(336)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-48)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r12,#(-192)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(240)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-32)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r12,#(-176)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(256)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-320)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-16)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(128)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r12,#(-160)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(272)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-304)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(0)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(144)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r12,#(-144)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(288)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-432)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(432)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-288)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s deleted file mode 100644 index 24e5fd5..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s deleted file mode 100644 index 9a2599d..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_oop_256_mve -poly_u16_toom4_fwd_dual_packed_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_top.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_top.s deleted file mode 100644 index 969c014..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_top.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_top_256_mve, %function -.global poly_u16_toom4_fwd_dual_top_256_mve -poly_u16_toom4_fwd_dual_top_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #512 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-32)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_top_oop.s b/tests/saber/auto/poly_u16_toom4_fwd_256_dual_top_oop.s deleted file mode 100644 index d76be58..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_256_dual_top_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_top_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_top_oop_256_mve -poly_u16_toom4_fwd_dual_top_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #512 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(-32)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_512.s b/tests/saber/auto/poly_u16_toom4_fwd_512.s deleted file mode 100644 index c82123a..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_512.s +++ /dev/null @@ -1,351 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_512_mve, %function -.global poly_u16_toom4_fwd_512_mve -poly_u16_toom4_fwd_512_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #256] -vldrw.u32 Q2, [r14, #-496] -vldrw.u32 Q3, [r14, #-240] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(16)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(272)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #272] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-480] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #-224] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-496)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(256)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(32)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #288] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-464] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #-208] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-480)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(272)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(48)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(304)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #304] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-448] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #-192] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-464)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(288)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(64)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(320)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #320] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-432] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #-176] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-432)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-448)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(304)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(80)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(336)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #336] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-416] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #-160] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-416)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(320)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(96)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #352] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-400] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #-144] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-400)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(336)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(112)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(368)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #368] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-384] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #-128] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-384)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(352)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(128)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(384)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #128] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #384] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-368] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #-112] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-368)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(368)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(400)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #144] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #400] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-352] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #-96] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-352)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-368)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(384)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #160] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #416] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-336] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #-80] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-336)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-352)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(400)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(432)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #176] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #432] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-320] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #-64] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-320)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-336)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(416)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(448)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #192] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #448] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-304] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #-48] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-304)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-320)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(432)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(208)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(464)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #208] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #464] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-288] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #-32] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-288)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-304)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(448)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(224)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #224] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #480] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-272] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #-16] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-272)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-288)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(464)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(240)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(496)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #240] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #496] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-256] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-256)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-272)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(480)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(256)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-496)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-240)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(496)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-256)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_768.s b/tests/saber/auto/poly_u16_toom4_fwd_768.s deleted file mode 100644 index 37dd5da..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_768.s +++ /dev/null @@ -1,520 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_768_mve, %function -.global poly_u16_toom4_fwd_768_mve -poly_u16_toom4_fwd_768_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #384] -vldrw.u32 Q2, [r14, #-240] -vldrw.u32 Q3, [r14, #144] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-480)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-96)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #400] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-224] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #160] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(288)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-240)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(384)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-464)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-80)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #416] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-208] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #176] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(304)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-224)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(400)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-448)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-64)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #432] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-192] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #192] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(320)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-208)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(416)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-432)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-48)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #448] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-176] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #208] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(336)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-192)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(432)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-416)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #464] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-160] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #224] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(352)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(448)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-400)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-16)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #480] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-144] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #240] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(368)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(464)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-384)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(0)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #496] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-128] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #256] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(384)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-144)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(480)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-368)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(16)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #128] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-496] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-112] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #272] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(400)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(496)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-352)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(32)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #144] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-480] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-96] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #288] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(416)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-336)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(48)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #160] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-464] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-80] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #304] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(432)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-320)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(64)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #176] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-448] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-64] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #320] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-80)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-304)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(80)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #192] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-432] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-48] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #336] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(464)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-288)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(96)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #208] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-416] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-32] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #352] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-272)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(112)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #224] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-400] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-16] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #368] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(496)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-256)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(128)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #240] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-384] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #0] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #384] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-16)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-240)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(144)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #256] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-368] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #16] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #400] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(0)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-224)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(160)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #272] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-352] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #32] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #416] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-464)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(16)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-208)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(176)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #288] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-336] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #48] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #432] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-192)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(192)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #304] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-320] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #64] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #448] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-432)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(48)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-176)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(208)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #320] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-304] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #80] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #464] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-416)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-160)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(224)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #336] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-288] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #96] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #480] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-400)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(80)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-304)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-144)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(240)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #352] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-272] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #112] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #496] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-384)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-288)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-128)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(256)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #368] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-256] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #128] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #-496] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-368)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(112)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-272)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-112)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(272)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r11,#(-352)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r14,#(-256)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(128)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_832.s b/tests/saber/auto/poly_u16_toom4_fwd_832.s deleted file mode 100644 index 271e629..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_832.s +++ /dev/null @@ -1,562 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_832_mve, %function -.global poly_u16_toom4_fwd_832_mve -poly_u16_toom4_fwd_832_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #416] -vldrw.u32 Q2, [r14, #-176] -vldrw.u32 Q3, [r14, #240] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-352)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(64)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #432] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-160] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #256] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(416)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-336)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(80)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #448] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-144] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #272] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(496)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(432)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-320)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(96)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #464] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-128] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #288] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-144)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(448)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-304)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(112)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #480] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-112] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #304] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(464)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-288)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(128)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #496] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-96] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #320] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-464)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(480)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-272)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(144)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-496] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-80] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #336] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(496)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-256)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(160)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-480] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #-64] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #352] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-432)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-80)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-496)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-240)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(176)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #128] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-464] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #-48] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #368] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-416)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-480)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-224)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(192)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #144] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-448] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #-32] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #384] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-400)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-464)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-208)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(208)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #160] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-432] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #-16] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #400] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-384)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-448)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-192)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(224)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #176] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-416] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #0] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #416] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-368)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-16)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-432)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-176)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(240)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #192] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-400] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #16] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #432] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-352)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(0)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-416)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-160)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(256)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #208] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-384] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #32] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #448] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-336)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(16)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-400)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-144)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(272)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #224] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-368] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #48] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #464] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-320)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-384)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-128)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(288)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #240] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-352] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #64] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #480] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-304)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(48)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-368)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-112)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(304)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #256] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-336] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #80] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #496] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-288)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-352)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-96)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(320)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #272] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-320] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #96] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #-496] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-272)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(80)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-336)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-80)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(336)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #288] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-304] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #112] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r12, #-480] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-256)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-320)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-64)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(352)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #304] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-288] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #128] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #-464] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-240)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(112)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-304)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-48)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(368)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #320] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-272] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #144] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r12, #-448] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-224)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-288)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-32)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(384)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #336] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-256] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #160] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #-432] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-208)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(144)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-272)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-16)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(400)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #352] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #-240] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #176] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r12, #-416] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-192)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-256)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(0)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(416)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #368] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #-224] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #192] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #-400] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-176)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(176)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-240)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(16)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(432)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #384] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #-208] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #208] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r12, #-384] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-160)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(192)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-224)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(32)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(448)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #400] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #-192] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #224] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #-368] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-144)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(208)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-208)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(48)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(464)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vshl.u16 Q5, Q5, #1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q4, [r14,#(-192)] -vadd.u16 Q5, Q5, Q7 -vstrw.u32 Q5, [r14,#(224)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s b/tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s deleted file mode 100644 index feae195..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve, %function -.global poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve -poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -add r12, r0, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #128] -vldrw.u32 Q2, [r0, #256] -vldrw.u32 Q3, [r0, #384] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(144)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-240)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #272] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #400] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-48)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(384)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #160] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #416] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(400)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(176)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-208)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #176] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #304] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #432] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-16)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(416)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-192)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #192] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #320] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #448] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(0)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(432)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(240)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(208)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #336] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #464] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(16)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(448)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(256)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #224] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #480] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(464)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(272)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(240)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-144)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #240] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #368] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #496] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(48)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(480)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(288)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(256)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-128)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(64)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(304)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(496)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s b/tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s deleted file mode 100644 index 92f06fb..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s +++ /dev/null @@ -1,200 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve, %function -.global poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve -poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -add r12, r14, #1008 -add r11, r0, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #128] -vldrw.u32 Q2, [r0, #256] -vldrw.u32 Q3, [r0, #384] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r14,#(-144)] -vmla.s16 Q6, Q5, r9 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r8 -vstrw.u32 Q3, [r12,#(-288)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #272] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #400] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(432)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(288)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r14,#(-128)] -vmla.s16 Q4, Q7, r9 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r8 -vstrw.u32 Q3, [r12,#(-272)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #160] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #416] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(304)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r14,#(-112)] -vmla.s16 Q5, Q6, r9 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r8 -vstrw.u32 Q3, [r12,#(-256)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #176] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #304] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #432] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(464)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(320)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r14,#(-96)] -vmla.s16 Q7, Q4, r9 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r8 -vstrw.u32 Q3, [r12,#(-240)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #192] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #320] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #448] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(336)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r14,#(-48)] -vmla.s16 Q6, Q5, r9 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q5, Q1, r8 -vstrw.u32 Q3, [r12,#(-192)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(240)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #336] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #464] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r14,#(-32)] -vmla.s16 Q4, Q7, r9 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q7, Q1, r8 -vstrw.u32 Q3, [r12,#(-176)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(256)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #224] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #480] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-320)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r14,#(-16)] -vmla.s16 Q5, Q6, r9 -vstrw.u32 Q0, [r1,#(128)] -vmla.s16 Q6, Q1, r8 -vstrw.u32 Q3, [r12,#(-160)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(272)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #240] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #368] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #496] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-304)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r14,#(0)] -vmla.s16 Q7, Q4, r9 -vstrw.u32 Q0, [r1,#(144)] -vmla.s16 Q4, Q1, r8 -vstrw.u32 Q3, [r12,#(-144)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(288)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-432)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(432)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-288)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_fwd_oop_256.s b/tests/saber/auto/poly_u16_toom4_fwd_oop_256.s deleted file mode 100644 index 99d1bea..0000000 --- a/tests/saber/auto/poly_u16_toom4_fwd_oop_256.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_oop_256_mve, %function -.global poly_u16_toom4_fwd_oop_256_mve -poly_u16_toom4_fwd_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -add r12, r0, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #0] -vldrw.u32 Q1, [r0, #128] -vldrw.u32 Q2, [r0, #256] -vldrw.u32 Q3, [r0, #384] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #16] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #272] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #400] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #32] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #160] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #416] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #48] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #176] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #304] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #432] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #64] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #192] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #320] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #448] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #80] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #336] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #464] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #96] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #224] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #480] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #112] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #240] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #368] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #496] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_dual_bottom_256.s b/tests/saber/auto/poly_u16_toom4_inv_dual_bottom_256.s deleted file mode 100644 index 988da90..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_dual_bottom_256.s +++ /dev/null @@ -1,381 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_bottom_256_mve, %function -.global poly_u16_toom4_inv_dual_bottom_256_mve -poly_u16_toom4_inv_dual_bottom_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -mov r14, #0 -mov r12, #0 -mov r11, #0 -mov r10, #21840 -mov r9, #45 -mov r8, #43691 -mov r7, #8 -mov r6, #-30 -mov r5, #4369 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -mov r1, #-1 -vldrw.u32 Q4, [r0, #-384] -vldrw.u32 Q5, [r0, #48] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r2 -vldrw.u32 Q6, [r0, #-368] -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #-352] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-320] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-336] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #-368] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #-336] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-352] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #-384] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-352] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-368] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #-400] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #-368] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-384] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #-416] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-384] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-400] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #-432] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #-400] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-416] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #-448] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-416] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-432] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #-464] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vst43.u16 {Q0,Q1,Q2,Q3}, [r0] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r11 -vmov.u16 Q0[1], r12 -vmov.u16 Q0[2], r14 -vldrw.u32 Q1, [r0, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s b/tests/saber/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s deleted file mode 100644 index 4758980..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_bottom_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_bottom_oop_256_mve -poly_u16_toom4_inv_dual_bottom_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -mov r14, #0 -mov r12, #0 -mov r11, #0 -mov r10, #21840 -mov r9, #45 -mov r8, #43691 -mov r7, #8 -mov r6, #-30 -mov r5, #4369 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q4, [r0, #-384] -vldrw.u32 Q5, [r0, #48] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r2 -vldrw.u32 Q6, [r0, #-368] -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #-352] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-320] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-336] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #96] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #80] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #-304] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #64] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #-272] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-288] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #176] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #160] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #-256] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #128] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-224] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-240] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #240] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #224] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #-208] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #192] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #-176] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-192] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #304] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #272] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #-160] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #256] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-128] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-144] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #368] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #336] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #-112] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #320] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #-80] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-96] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #432] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #416] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #400] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #-64] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #384] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #-32] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #-48] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #496] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #480] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #464] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #-16] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #448] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r11 -vmov.u16 Q0[1], r12 -vmov.u16 Q0[2], r14 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s b/tests/saber/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s deleted file mode 100644 index bee584a..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve -poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -vldrw.u32 Q4, [r14, #-496] -vldrw.u32 Q5, [r0, #384] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q6, [r14, #-368] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #256] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #128] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-240] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-352] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-480] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #400] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #272] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-224] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #16] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #-336] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-464] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #416] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #160] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-208] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #32] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-320] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-448] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #432] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #304] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #176] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-192] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #48] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #-304] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-432] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #448] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #320] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #192] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-176] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #64] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-288] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-416] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #464] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #336] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-160] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #80] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #-272] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-400] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #480] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #224] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-144] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #96] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-256] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-384] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #496] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #368] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #240] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-128] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #112] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_dual_top_256.s b/tests/saber/auto/poly_u16_toom4_inv_dual_top_256.s deleted file mode 100644 index e31e79b..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_dual_top_256.s +++ /dev/null @@ -1,381 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_top_256_mve, %function -.global poly_u16_toom4_inv_dual_top_256_mve -poly_u16_toom4_inv_dual_top_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -mov r1, #1 -vldrw.u32 Q4, [r14, #-496] -vldrw.u32 Q5, [r0, #48] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r1 -vldrw.u32 Q6, [r14, #-480] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-464] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #-432] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-448] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-416] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #-384] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-400] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-368] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #-336] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-352] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-320] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #-288] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-304] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-272] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #-240] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-256] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-224] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #-192] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-208] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-176] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #-144] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-160] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-128] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vst43.u16 {Q0,Q1,Q2,Q3}, [r0] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r0, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_dual_top_oop_256.s b/tests/saber/auto/poly_u16_toom4_inv_dual_top_oop_256.s deleted file mode 100644 index 95de04d..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_dual_top_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_top_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_top_oop_256_mve -poly_u16_toom4_inv_dual_top_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -vldrw.u32 Q4, [r14, #-496] -vldrw.u32 Q5, [r0, #48] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q6, [r14, #-480] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #32] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #16] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-464] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #0] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-432] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-448] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #112] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #96] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #80] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-416] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #64] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #-384] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-400] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #176] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #160] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #144] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-368] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #128] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-336] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-352] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #240] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #224] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #208] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-320] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #192] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #-288] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-304] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #304] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #288] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #272] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-272] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #256] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-240] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-256] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #368] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #352] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #336] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-224] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #320] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #-192] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-208] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #432] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #416] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #400] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #-176] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #384] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #-144] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #-160] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #496] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #480] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #464] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #-128] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #448] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_full_256.s b/tests/saber/auto/poly_u16_toom4_inv_full_256.s deleted file mode 100644 index 30c5259..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_full_256.s +++ /dev/null @@ -1,765 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_256_mve, %function -.global poly_u16_toom4_inv_256_mve -poly_u16_toom4_inv_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r7, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [r14, #-232] -vldrw.u32 Q1, [r0, #8] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #-248] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #-488] -vldrw.u32 Q4, [r0, #264] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #-216] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [r0, #280] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #24] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r7 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-472] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #296] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #40] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-456] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #56] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #312] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #56] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-440] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #328] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #72] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-424] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #344] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #88] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-408] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #360] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #104] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-392] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #376] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #120] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-376] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #136] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #392] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #136] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-360] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #152] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-88] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-248] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-248)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #408] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #264] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-232] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #152] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-488] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #8] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #24] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-344] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #168] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-72] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-232] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-232)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #424] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #280] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-216] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #168] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #-472] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #24] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #40] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-328] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #184] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-216] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-216)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #440] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #296] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-200] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #184] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-456] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #40] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #56] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-312] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #200] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-40] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-200] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-200)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #456] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #312] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #200] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #-440] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #56] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #72] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-296] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #216] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-24] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-184] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-184)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #472] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #328] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-168] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #216] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-424] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #72] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #88] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-280] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #232] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-168] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-168)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #488] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #344] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-152] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #232] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #-408] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #88] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #104] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-264] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #248] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #-152] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-152)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #504] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #360] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #-136] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #248] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #-392] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #104] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #120] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-248] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #264] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r0, #-136] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(-136)] -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q3, [r0, #376] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #-120] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r14, #-376] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r14,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q1, [r0, #120] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #136] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(136)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_full_512.s b/tests/saber/auto/poly_u16_toom4_inv_full_512.s deleted file mode 100644 index 5d91b9c..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_full_512.s +++ /dev/null @@ -1,1511 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_512_mve, %function -.global poly_u16_toom4_inv_512_mve -poly_u16_toom4_inv_512_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #-64 -mov r9, #45 -mov r8, #-8 -mov r7, #43691 -mov r6, #16 -mov r5, #30 -mov r4, #61167 -mov r3, #-65 -mov r2, #36409 -mov r1, #1 -vldrw.u32 Q0, [r12, #40] -vldrw.u32 Q1, [r14, #-488] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #8] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #-472] -vldrw.u32 Q4, [r14, #24] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r11, #-456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r3 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #56] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r8 -vldrw.u32 Q5, [r14, #40] -vmla.s16 Q0, Q3, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r6 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(-472)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r2 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #-472] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r5 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r4 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #72] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #56] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-456] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #88] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #72] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-440] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #88] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-424] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #104] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-408] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #120] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-392] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-376] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #136] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-376] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-360] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #152] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-360] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-344] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #168] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-344] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-328] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #184] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-328] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-312] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #216] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #200] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-312] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #216] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-296] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #232] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-280] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #264] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #248] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-264] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #280] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #264] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-248] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #296] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #280] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-232] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #312] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #8] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(8)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #296] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #40] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-216] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-472] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-488] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-456] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #328] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #24] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(24)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #312] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #56] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-200] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-456] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-472] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-440] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #344] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #40] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(40)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #328] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #72] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-184] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-440] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-456] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-424] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #360] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #56] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(56)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #344] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #88] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-168] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-424] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-440] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-408] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #376] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #72] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(72)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #360] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #104] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-152] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-408] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-424] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-392] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #392] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #88] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(88)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #376] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #120] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-136] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-392] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-408] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-376] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #408] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #104] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(104)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #392] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #136] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-120] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-376] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-392] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-360] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #424] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #120] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(120)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #408] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #152] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-104] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-360] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-344] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #440] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #136] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(136)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #424] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #152] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #168] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-88] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-344] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-360] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-328] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-56] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #456] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #152] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(152)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #440] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #168] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-72] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-328] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-344] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-312] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-88] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-40] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #472] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #168] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(168)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #456] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #184] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #200] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-56] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-312] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-328] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-296] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-72] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #-24] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #488] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #184] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(184)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #472] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #200] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #216] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-40] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-296] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-312] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-280] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-24] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-56] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #-8] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #504] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #200] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(200)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #488] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #216] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #232] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-24] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-280] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-296] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-264] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-8] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-40] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #8] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #-488] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #216] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(216)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #504] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #232] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #248] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-8] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #-264] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-280] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-248] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #8] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-24] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #-472] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #232] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(232)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r12, #-488] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #248] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #264] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #8] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #-248] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #-264] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-264)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #-232] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #504] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-8] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r0, #248] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(248)] -vmla.s16 Q1, Q2, r8 -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q3, [r14, #264] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #280] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r12,#(280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r12, #-232] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q1, [r14, #-248] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #-216] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-216)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_full_768.s b/tests/saber/auto/poly_u16_toom4_inv_full_768.s deleted file mode 100644 index 6bd186d..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_full_768.s +++ /dev/null @@ -1,2303 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_768_mve, %function -.global poly_u16_toom4_inv_768_mve -poly_u16_toom4_inv_768_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r7, #45 -mov r6, #-8 -mov r5, #43691 -mov r4, #16 -mov r3, #30 -mov r2, #61167 -mov r1, #-65 -vldrw.u32 Q0, [r11, #312] -vldrw.u32 Q1, [r14, #24] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #264] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #-456] -vldrw.u32 Q4, [r12, #-216] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r1 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r11, #328] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r6 -vldrw.u32 Q5, [r12, #-200] -vmla.s16 Q0, Q3, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(24)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r4 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-456)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #40] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r3 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-216)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r2 -vstrw.u32 Q2, [r0,#(264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r11,#(312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-440] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #-184] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #56] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-424] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #360] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #-168] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #72] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-408] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #376] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #-152] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #88] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-392] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #392] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #-136] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #104] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-376] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #408] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #-120] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #120] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-360] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #424] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #-104] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #136] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-344] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #440] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #-88] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #152] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-328] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #456] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #-72] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #168] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-312] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #472] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #-56] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-296] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #488] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #-40] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-280] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #504] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #-24] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-264] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-488] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #-8] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-248] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-472] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #8] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-232] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-456] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #24] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #504] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-216] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-440] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #40] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-200] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-424] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #56] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-184] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-408] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #72] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-168] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-392] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #88] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-152] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-376] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #104] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-136] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-360] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #120] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-120] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #136] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-104] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-328] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #152] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-88] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-312] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #168] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-72] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-296] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #264] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(264)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #184] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-216] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #312] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-456] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #24] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #72] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-56] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-280] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #280] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(280)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #200] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-200] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #328] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-440] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #40] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(40)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #88] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-40] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-88] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-264] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #296] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(296)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #216] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-184] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #344] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-424] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #56] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #104] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-72] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #504] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-248] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #312] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(312)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #232] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-168] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #360] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-408] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #72] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(72)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #120] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-8] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-56] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-232] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #328] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #248] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-152] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #376] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-392] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #88] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #136] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #8] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-40] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-216] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #344] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #264] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #392] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #504] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-376] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #104] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(104)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #152] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #24] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-24] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-200] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #360] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #280] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #408] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-360] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #120] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #168] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-8] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-184] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #376] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #296] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #424] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-344] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #136] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(136)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #184] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #8] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-168] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #312] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #440] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-328] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #152] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #200] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #24] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-152] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #328] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #456] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-312] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #168] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #216] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #40] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-136] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #344] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #472] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-296] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #184] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #232] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #56] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-120] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #360] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #488] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-280] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #200] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #248] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #72] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-104] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #376] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #504] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-264] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #216] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #264] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #88] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-88] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #392] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-488] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-248] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #232] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #280] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #104] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-72] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #408] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-472] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-232] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #248] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #296] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #120] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #504] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(504)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #424] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-456] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-216] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #264] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #312] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #136] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-40] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #440] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-440] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-200] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #280] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #328] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #152] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-24] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #456] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-424] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-184] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #296] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #344] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #168] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #472] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-408] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-168] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #312] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #360] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #184] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #8] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #488] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-392] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-152] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #328] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #376] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #200] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #24] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #504] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-376] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-136] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #344] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #392] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #216] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #40] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-488] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-360] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-120] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #360] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #408] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #232] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-472] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-344] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-104] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #424] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #248] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #-376] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-376)] -vmla.s16 Q1, Q2, r6 -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q3, [r12, #152] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #-328] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #-88] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q1, [r14, #392] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #440] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(440)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_full_832.s b/tests/saber/auto/poly_u16_toom4_inv_full_832.s deleted file mode 100644 index fc9e7e1..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_full_832.s +++ /dev/null @@ -1,2493 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_832_mve, %function -.global poly_u16_toom4_inv_832_mve -poly_u16_toom4_inv_832_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r7, #45 -mov r6, #-8 -mov r5, #43691 -mov r4, #16 -mov r3, #30 -mov r2, #61167 -mov r1, #-65 -vldrw.u32 Q0, [r10, #-376] -vldrw.u32 Q1, [r14, #152] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #328] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #-200] -vldrw.u32 Q4, [r12, #-24] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r1 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r10, #-360] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r6 -vldrw.u32 Q5, [r12, #-8] -vmla.s16 Q0, Q3, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r4 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-200)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #168] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r3 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r2 -vstrw.u32 Q2, [r0,#(328)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-184] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #8] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-168] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-328] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #24] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-152] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #504] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-312] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #40] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-136] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-488] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-296] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #56] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-120] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-472] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-280] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #72] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-104] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-456] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-264] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #88] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-88] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-440] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #104] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-72] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-424] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #120] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-56] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-408] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-216] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #136] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-40] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-392] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #152] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #504] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #168] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-488] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #-8] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #184] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-472] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #8] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #200] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-456] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #24] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #216] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-440] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #232] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-424] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #248] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(408)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-408] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-88] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #264] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(424)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-392] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-72] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #280] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(440)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-376] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-56] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #296] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-40] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #312] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #-24] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #328] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #504] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #-8] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #344] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(504)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-488] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #8] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #360] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-472] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-8)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #24] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #376] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-456] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #40] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #392] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-440] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-88] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #56] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #328] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #408] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-376] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-424] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-200] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #152] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #456] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-72] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #72] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #344] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #424] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #-8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-360] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-408] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-184] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #168] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #472] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-56] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #88] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #360] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #440] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #8] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-344] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-392] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-168] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #184] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #488] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-40] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #104] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #376] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #456] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #24] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-328] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-376] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-152] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #200] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #504] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(504)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-24] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #120] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #472] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #40] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-312] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-360] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-136] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #216] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-488] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-8] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-56] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #136] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #488] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #56] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-296] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-344] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-120] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #232] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-472] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #312] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #8] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-40] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #152] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #504] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #72] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-280] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-328] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-104] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #248] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-456] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #328] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #24] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #-24] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #168] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-488] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #88] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-264] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-312] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-88] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #264] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-440] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #344] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #40] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #-8] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #184] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-472] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #104] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-248] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-296] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-72] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #280] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-424] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #360] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #56] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #8] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #200] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-456] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #120] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-232] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-280] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-56] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #296] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-408] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #376] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #72] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #216] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-440] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #136] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-216] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-264] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-40] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #312] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-392] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #88] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #232] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #504] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(504)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-424] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #152] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-200] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-248] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #-24] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #328] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-376] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #104] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #56] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #248] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-488] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-408] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #168] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-184] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-232] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #-8] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #344] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-360] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #120] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #264] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-472] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-392] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #184] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-168] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #8] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #360] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-344] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #136] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #280] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-456] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-376] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #200] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-152] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #24] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #376] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-328] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #152] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #296] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-440] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-360] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #216] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-136] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #40] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #392] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-312] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #472] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #168] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #312] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-424] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-344] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #232] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-120] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-168] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #56] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #408] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-296] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #488] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #184] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #136] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #328] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-408] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-328] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #248] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-104] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-152] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #72] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #424] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(424)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-280] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #504] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #200] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #152] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #344] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-392] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-312] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #264] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-88] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-136] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #88] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #440] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-264] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-488] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #216] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #168] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #360] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-376] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-296] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #280] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-72] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-120] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #104] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #456] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-248] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-472] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #232] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #184] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #376] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-360] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-280] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #296] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-56] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-104] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #120] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #472] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-232] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #248] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #200] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #392] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-344] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-264] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #312] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-40] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-88] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #136] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #488] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-216] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #264] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #216] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #408] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-328] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-248] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #328] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-24] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-72] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #152] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #504] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-200] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #280] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #232] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #424] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-312] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #-232] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #344] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #-8] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-56] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #168] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r12, #-488] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-184] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #296] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #248] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #440] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #-296] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #-216] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #360] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #8] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #-40] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #184] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r12, #-472] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #-168] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #-392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #312] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #264] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #-280] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q1, Q2, r6 -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q3, [r12, #376] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #24] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #200] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q1, [r12, #-456] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #-152] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-152)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_half_256.s b/tests/saber/auto/poly_u16_toom4_inv_half_256.s deleted file mode 100644 index 64b31d5..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_half_256.s +++ /dev/null @@ -1,340 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_256_mve, %function -.global poly_u16_toom4_inv_half_256_mve -poly_u16_toom4_inv_half_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -mov r14, #-64 -mov r12, #45 -mov r11, #-8 -mov r10, #43691 -mov r9, #16 -mov r8, #30 -mov r7, #61167 -mov r6, #-65 -mov r5, #36409 -mov r4, #1 -vldrw.u32 Q0, [r0, #136] -vldrw.u32 Q1, [r0, #-248] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #-376] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r0, #8] -vldrw.u32 Q4, [r0, #-120] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r0, #264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r6 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r0, #152] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r11 -vldrw.u32 Q5, [r0, #-104] -vmla.s16 Q0, Q3, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-248)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r9 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r0,#(8)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #-232] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r8 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(-120)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r7 -vstrw.u32 Q2, [r0,#(-376)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r0,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-360] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #24] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #-88] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #-216] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-360)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-344] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #40] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #-72] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #-200] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-344)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-328] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #56] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #-56] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #-184] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-328)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #72] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #216] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #-40] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #-168] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-312)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #88] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #-24] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #-152] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-296)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #104] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #-8] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #-136] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-280)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #120] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-264)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(248)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_half_512.s b/tests/saber/auto/poly_u16_toom4_inv_half_512.s deleted file mode 100644 index f44a729..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_half_512.s +++ /dev/null @@ -1,661 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_512_mve, %function -.global poly_u16_toom4_inv_half_512_mve -poly_u16_toom4_inv_half_512_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r7, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [r14, #-232] -vldrw.u32 Q1, [r0, #8] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #-248] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #-488] -vldrw.u32 Q4, [r0, #264] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #-216] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [r0, #280] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #24] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r7 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-472] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #296] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #40] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-456] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #56] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #312] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #56] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-440] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #328] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #72] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-424] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #344] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #88] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-408] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #360] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #104] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-392] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #376] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #120] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-376] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #136] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #392] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #136] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-360] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #152] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-88] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #408] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #152] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-120)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-344] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #168] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-72] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #424] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #168] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-104)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-328] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #184] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-56] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #440] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #184] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-88)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-312] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #200] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-40] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #456] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #200] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-296] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #216] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #-24] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #472] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #216] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-280] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #232] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #-8] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #488] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #232] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-264] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #248] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #8] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #504] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #248] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #-248] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #264] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(504)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(8)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_half_768.s b/tests/saber/auto/poly_u16_toom4_inv_half_768.s deleted file mode 100644 index e78c981..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_half_768.s +++ /dev/null @@ -1,982 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_768_mve, %function -.global poly_u16_toom4_inv_half_768_mve -poly_u16_toom4_inv_half_768_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-64 -mov r10, #45 -mov r9, #-8 -mov r8, #43691 -mov r7, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r14, #408] -vldrw.u32 Q1, [r0, #264] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #-120] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #24] -vldrw.u32 Q4, [r14, #-360] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r12, #-216] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #424] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r9 -vldrw.u32 Q5, [r14, #-344] -vmla.s16 Q0, Q3, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(264)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r7 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(24)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #280] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(-360)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [r0,#(-120)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #40] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-200] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #440] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-328] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #296] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-104)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #56] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #-184] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #456] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-312] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #312] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-88)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #72] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-168] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #472] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-296] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #328] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #88] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #-152] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #488] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-280] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #344] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #104] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-136] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #504] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-264] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #360] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #120] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #-120] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-488] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-248] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #376] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #136] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-104] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-472] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-232] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #392] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #152] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #-88] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-456] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-216] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #408] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-72] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-440] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-200] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #424] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #-56] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-424] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-184] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #440] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-40] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-408] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-168] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #456] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #-24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-392] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-152] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #472] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-8] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-376] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-136] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #488] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #8] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-360] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-120] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #504] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #24] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-104] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(504)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-488] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #40] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-328] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-88] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-472] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #56] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-312] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-72] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-456] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #312] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-296] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-56] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-440] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #328] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-280] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-40] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-424] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #344] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-264] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-24] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-408] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #360] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-8] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-392] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #376] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #8] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-376] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-232)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/auto/poly_u16_toom4_inv_half_832.s b/tests/saber/auto/poly_u16_toom4_inv_half_832.s deleted file mode 100644 index 13bf65c..0000000 --- a/tests/saber/auto/poly_u16_toom4_inv_half_832.s +++ /dev/null @@ -1,1062 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_832_mve, %function -.global poly_u16_toom4_inv_half_832_mve -poly_u16_toom4_inv_half_832_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-64 -mov r10, #45 -mov r9, #-8 -mov r8, #43691 -mov r7, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r12, #-440] -vldrw.u32 Q1, [r0, #328] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #-88] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #152] -vldrw.u32 Q4, [r14, #-264] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #-504] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r12, #-24] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #-424] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r9 -vldrw.u32 Q5, [r14, #-248] -vmla.s16 Q0, Q3, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(328)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r7 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #344] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(-264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [r0,#(-88)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #168] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-488] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #-8] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-408] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-232] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #360] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #184] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-472] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #8] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-392] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-216] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #376] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #200] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-456] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #24] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-376] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-200] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #392] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #-24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #216] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-440] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #40] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-360] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-184] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #408] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #-8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #232] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-424] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #56] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-344] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-168] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #424] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #8] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #248] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-408] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #72] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-328] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-152] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #440] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #24] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #264] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-392] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #88] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-312] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-136] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #456] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #40] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #280] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-376] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #104] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-296] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-120] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #472] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #56] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #296] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-360] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #120] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-280] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-104] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #488] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #72] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #312] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-344] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #136] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-264] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-88] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #504] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #88] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #328] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-328] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #152] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-248] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-72] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(504)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-488] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #104] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #344] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-312] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #168] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-232] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-56] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-472] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #120] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #360] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-296] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #184] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-216] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-40] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-456] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #136] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #376] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-280] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #200] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-200] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #-24] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-440] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #152] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #392] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-264] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #216] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-184] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #-8] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-424] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #168] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #408] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-248] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #232] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-168] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #8] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-408] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #184] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #424] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-232] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #248] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-152] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #24] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-392] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #200] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #440] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-216] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #264] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-136] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #40] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-376] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #216] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-200] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #280] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-120] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #56] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-360] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #232] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #472] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-184] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #296] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-104] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #72] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-344] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #248] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #488] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-168] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #312] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-88] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #88] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-328] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #264] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #504] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-152] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #328] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-72] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #104] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-312] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(264)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #280] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-488] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-136] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #344] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #-56] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #120] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-296] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #296] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-472] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #-120] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #360] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #-40] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #136] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #-280] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #312] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #-456] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #-104] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #376] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-40)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/saber/manual/karatsuba.h b/tests/saber/karatsuba.h similarity index 100% rename from tests/saber/manual/karatsuba.h rename to tests/saber/karatsuba.h diff --git a/tests/saber/manual/karatsuba.s b/tests/saber/karatsuba.s similarity index 100% rename from tests/saber/manual/karatsuba.s rename to tests/saber/karatsuba.s diff --git a/tests/saber/manual/karatsuba_const.h b/tests/saber/karatsuba_const.h similarity index 100% rename from tests/saber/manual/karatsuba_const.h rename to tests/saber/karatsuba_const.h diff --git a/tests/saber/misc.c b/tests/saber/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/saber/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/saber/misc.h b/tests/saber/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/saber/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/saber/manual/montgomery.h b/tests/saber/montgomery.h similarity index 100% rename from tests/saber/manual/montgomery.h rename to tests/saber/montgomery.h diff --git a/tests/saber/manual/montgomery.s b/tests/saber/montgomery.s similarity index 100% rename from tests/saber/manual/montgomery.s rename to tests/saber/montgomery.s diff --git a/tests/saber/manual/montgomery_const.h b/tests/saber/montgomery_const.h similarity index 100% rename from tests/saber/manual/montgomery_const.h rename to tests/saber/montgomery_const.h diff --git a/tests/saber/poly.c b/tests/saber/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/saber/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/saber/poly.h b/tests/saber/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/saber/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/saber/rng.c b/tests/saber/rng.c old mode 100755 new mode 100644 diff --git a/tests/saber/rng.h b/tests/saber/rng.h old mode 100755 new mode 100644 index cccf3b8..a7aeeeb --- a/tests/saber/rng.h +++ b/tests/saber/rng.h @@ -12,6 +12,7 @@ void randombytes_init(const int i); +#define randombytes SABER_randombytes int randombytes(unsigned char *x, unsigned long long xlen); #endif /* rng_h */ diff --git a/tests/saber/saber.mk b/tests/saber/saber.mk new file mode 100644 index 0000000..08f9329 --- /dev/null +++ b/tests/saber/saber.mk @@ -0,0 +1,28 @@ +# Test name - needs to match the directory name +TESTS += saber + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +SABER_PLATFORMS += m55-an547 +SABER_PLATFORMS += m85-an555 + +# C sources required for this test +SABER_SOURCES += main.c +SABER_SOURCES += misc.c +SABER_SOURCES += kem.c +SABER_SOURCES += fips202.c +SABER_SOURCES += verify.c +SABER_SOURCES += SABER_indcpa.c +SABER_SOURCES += pack_unpack.c +SABER_SOURCES += poly_ntt.c +SABER_SOURCES += cbd.c +SABER_SOURCES += rng.c + + +# Assembly sources required for this test +SABER_ASMS += saber_round.s +SABER_ASMS += montgomery.s +SABER_ASMS += auto/inv_ntt_u32_33556993_28678040_incomplete.s +SABER_ASMS += auto/ntt_u32_33556993_28678040_incomplete.s +SABER_ASMS += auto/ntt_u32_33556993_28678040_incomplete_double.s diff --git a/tests/schoolbook/auto/poly_u16_mul_128_mve_comba.s b/tests/schoolbook/auto/poly_u16_mul_128_mve_comba.s deleted file mode 100644 index e5e2ffa..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_128_mve_comba.s +++ /dev/null @@ -1,6664 +0,0 @@ -.syntax unified -.type poly_u16_mul_128_comba_mve, %function -.global poly_u16_mul_128_comba_mve -poly_u16_mul_128_comba_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #28] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #-12] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #4] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #-10] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #6] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #28] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #10] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #-4] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #12] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #28] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #0] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #16] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #44] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #18] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #4] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #20] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #-10] -vldrw.u32 Q7, [Q2, #44] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #6] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #-8] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #8] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #24] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #-6] -vldrw.u32 Q1, [Q2, #44] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #10] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #26] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #-4] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #44] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #0] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #60] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #4] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #20] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #36] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #60] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #-8] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #8] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #-6] -vldrw.u32 Q1, [Q2, #60] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #10] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #26] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #12] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #60] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #30] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #0] -strh r12, [r0,#+50] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #16] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #76] -strh r10, [r0,#+52] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #18] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #50] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -vldrw.u32 Q0, [Q2, #76] -strh r8, [r0,#+54] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #76] -strh r6, [r0,#+56] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-8] -vldrw.u32 Q7, [Q2, #76] -strh r4, [r0,#+58] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #8] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #24] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #40] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #-6] -vldrw.u32 Q3, [Q2, #76] -strh r14, [r0,#+60] -vmladavx.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #10] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #26] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-4] -vldrw.u32 Q6, [Q2, #76] -strh r12, [r0,#+62] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #-2] -vldrw.u32 Q1, [Q2, #76] -strh r10, [r0,#+64] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #46] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #62] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #76] -strh r8, [r0,#+66] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #64] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #92] -strh r6, [r0,#+68] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #18] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #34] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -vldrw.u32 Q6, [Q2, #92] -strh r4, [r0,#+70] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #36] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #52] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #68] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #92] -strh r14, [r0,#+72] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #-8] -vldrw.u32 Q1, [Q2, #92] -strh r12, [r0,#+74] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #8] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-6] -vldrw.u32 Q7, [Q2, #92] -strh r10, [r0,#+76] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #-4] -vldrw.u32 Q5, [Q2, #92] -strh r8, [r0,#+78] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #12] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #28] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #60] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #76] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #-2] -vldrw.u32 Q3, [Q2, #92] -strh r6, [r0,#+80] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #46] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #62] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #78] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #92] -strh r4, [r0,#+82] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #48] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #108] -strh r14, [r0,#+84] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -vldrw.u32 Q6, [Q2, #108] -strh r12, [r0,#+86] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #36] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #52] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #68] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #-10] -vldrw.u32 Q6, [Q2, #108] -strh r10, [r0,#+88] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #6] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #38] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #54] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #108] -strh r8, [r0,#+90] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #24] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #40] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #88] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-6] -vldrw.u32 Q6, [Q2, #108] -strh r6, [r0,#+92] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #10] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #26] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #74] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #90] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-4] -vldrw.u32 Q6, [Q2, #108] -strh r4, [r0,#+94] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #76] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #92] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #-2] -vldrw.u32 Q6, [Q2, #108] -strh r14, [r0,#+96] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #108] -strh r12, [r0,#+98] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #124] -strh r10, [r0,#+100] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -vldrw.u32 Q0, [Q2, #124] -strh r8, [r0,#+102] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #-10] -vldrw.u32 Q3, [Q2, #124] -strh r6, [r0,#+104] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #6] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #22] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #54] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #102] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #-8] -vldrw.u32 Q5, [Q2, #124] -strh r4, [r0,#+106] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #8] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #24] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #40] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #88] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #104] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #-6] -vldrw.u32 Q7, [Q2, #124] -strh r14, [r0,#+108] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #90] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #106] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -vldrw.u32 Q1, [Q2, #124] -strh r12, [r0,#+110] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #12] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #124] -strh r10, [r0,#+112] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+114] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #140] -strh r6, [r0,#+116] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #18] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #34] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #82] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -vldrw.u32 Q5, [Q2, #140] -strh r4, [r0,#+118] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #4] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #20] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #36] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #52] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #68] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #-10] -vldrw.u32 Q1, [Q2, #140] -strh r14, [r0,#+120] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #6] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #22] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #38] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #54] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #70] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #86] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #102] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #118] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #140] -strh r12, [r0,#+122] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #24] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #40] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #88] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #104] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #120] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #-6] -vldrw.u32 Q3, [Q2, #140] -strh r10, [r0,#+124] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #10] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #26] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #74] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #90] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #106] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #122] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #-4] -vldrw.u32 Q7, [Q2, #140] -strh r8, [r0,#+126] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #12] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #60] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #76] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #92] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #108] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #124] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #140] -strh r6, [r0,#+128] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #140] -strh r4, [r0,#+130] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #48] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #96] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #156] -strh r14, [r0,#+132] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #18] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #50] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #66] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #82] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #98] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #130] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vldrw.u32 Q4, [Q2, #156] -strh r12, [r0,#+134] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #4] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #20] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #36] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #52] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #68] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #100] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #116] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #132] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #-10] -vldrw.u32 Q3, [Q2, #156] -strh r10, [r0,#+136] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #6] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #22] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #54] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #102] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #118] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #134] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #-8] -vldrw.u32 Q1, [Q2, #156] -strh r8, [r0,#+138] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #8] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #88] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #104] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #120] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #136] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #-6] -vldrw.u32 Q0, [Q2, #156] -strh r6, [r0,#+140] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #10] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #26] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #74] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #90] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #106] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #122] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #138] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-4] -vldrw.u32 Q7, [Q2, #156] -strh r4, [r0,#+142] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #12] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #60] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #76] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #92] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #108] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #124] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #140] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #-2] -vldrw.u32 Q6, [Q2, #156] -strh r14, [r0,#+144] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #110] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #156] -strh r12, [r0,#+146] -vmladavx.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #64] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #80] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #96] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #128] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #144] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #-14] -vldrw.u32 Q4, [Q2, #172] -strh r10, [r0,#+148] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #2] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #18] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #34] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #50] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #66] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #82] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #130] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #146] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -vldrw.u32 Q5, [Q2, #172] -strh r8, [r0,#+150] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #4] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #20] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #36] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #52] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #68] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #132] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #148] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-10] -vldrw.u32 Q6, [Q2, #172] -strh r6, [r0,#+152] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #6] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #38] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #54] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #102] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #118] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #134] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #150] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-8] -vldrw.u32 Q7, [Q2, #172] -strh r4, [r0,#+154] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #8] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #24] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #40] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #88] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #104] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #120] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #136] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #152] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #-6] -vldrw.u32 Q0, [Q2, #172] -strh r14, [r0,#+156] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #10] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #26] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #74] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #90] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #106] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #122] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #138] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -vldrw.u32 Q1, [Q2, #172] -strh r12, [r0,#+158] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #12] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #-2] -vldrw.u32 Q3, [Q2, #172] -strh r10, [r0,#+160] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #46] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #62] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #78] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #94] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #110] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #142] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #158] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #0] -vldrw.u32 Q4, [Q2, #172] -strh r8, [r0,#+162] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #32] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #48] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #64] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #80] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #96] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #128] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #144] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #160] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #188] -strh r6, [r0,#+164] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #18] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #50] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #66] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #82] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #98] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #130] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #146] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #162] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -vldrw.u32 Q0, [Q2, #188] -strh r4, [r0,#+166] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #116] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #132] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #148] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #164] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #188] -strh r14, [r0,#+168] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #102] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #118] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #134] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #150] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #166] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #-8] -vldrw.u32 Q7, [Q2, #188] -strh r12, [r0,#+170] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #8] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #24] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #40] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #88] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #104] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #120] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #136] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #152] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #168] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #-6] -vldrw.u32 Q3, [Q2, #188] -strh r10, [r0,#+172] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #10] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #26] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #74] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #90] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #106] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #122] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #138] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #154] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #-4] -vldrw.u32 Q6, [Q2, #188] -strh r8, [r0,#+174] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #76] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #92] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #108] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #124] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #172] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #-2] -vldrw.u32 Q1, [Q2, #188] -strh r6, [r0,#+176] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #46] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #62] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #78] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #94] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #110] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #142] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #158] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #174] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #188] -strh r4, [r0,#+178] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #64] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #80] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #96] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #128] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #144] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #160] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #176] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #204] -strh r14, [r0,#+180] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #18] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #34] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #82] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #130] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #146] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #162] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #178] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -vldrw.u32 Q6, [Q2, #204] -strh r12, [r0,#+182] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #36] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #52] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #68] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #100] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #116] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #132] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #148] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #164] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #180] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #204] -strh r10, [r0,#+184] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #102] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #118] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #134] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #150] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #166] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #-8] -vldrw.u32 Q1, [Q2, #204] -strh r8, [r0,#+186] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #8] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #88] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #104] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #120] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #136] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #152] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #168] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #184] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #-6] -vldrw.u32 Q7, [Q2, #204] -strh r6, [r0,#+188] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #90] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #106] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #122] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #138] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #154] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #170] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #186] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #-4] -vldrw.u32 Q5, [Q2, #204] -strh r4, [r0,#+190] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #12] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #28] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #60] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #76] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #92] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #108] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #124] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #140] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #156] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #172] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #188] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #-2] -vldrw.u32 Q3, [Q2, #204] -strh r14, [r0,#+192] -vmladavx.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #46] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #62] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #78] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #94] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #110] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #142] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #158] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #174] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #190] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #204] -strh r12, [r0,#+194] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #48] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #96] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #144] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #160] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #176] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #192] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #220] -strh r10, [r0,#+196] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #114] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #130] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #146] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #162] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #178] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #194] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -vldrw.u32 Q6, [Q2, #220] -strh r8, [r0,#+198] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #36] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #52] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #68] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #100] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #116] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #132] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #148] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #164] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #180] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #196] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-10] -vldrw.u32 Q6, [Q2, #220] -strh r6, [r0,#+200] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #6] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #38] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #54] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #102] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #118] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #134] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #150] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #166] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #182] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #198] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #220] -strh r4, [r0,#+202] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #24] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #40] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #88] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #104] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #120] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #136] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #152] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #168] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #184] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #200] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #-6] -vldrw.u32 Q6, [Q2, #220] -strh r14, [r0,#+204] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #10] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #26] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #74] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #90] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #106] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #122] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #138] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #154] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #170] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #186] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #202] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-4] -vldrw.u32 Q6, [Q2, #220] -strh r12, [r0,#+206] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #76] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #92] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #108] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #124] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #172] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #188] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #204] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #-2] -vldrw.u32 Q6, [Q2, #220] -strh r10, [r0,#+208] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #110] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #158] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #174] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #190] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #206] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #220] -strh r8, [r0,#+210] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #128] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #144] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #160] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #176] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #192] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #208] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #236] -strh r6, [r0,#+212] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #114] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #130] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #146] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #162] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #178] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #194] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #210] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -vldrw.u32 Q0, [Q2, #236] -strh r4, [r0,#+214] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #116] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #132] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #148] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #164] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #180] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #-10] -vldrw.u32 Q3, [Q2, #236] -strh r14, [r0,#+216] -vmladavx.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #6] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #22] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #54] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #102] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #118] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #134] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #150] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #166] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #182] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #198] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #214] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #-8] -vldrw.u32 Q5, [Q2, #236] -strh r12, [r0,#+218] -vmladavx.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #8] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #24] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #40] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #88] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #104] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #120] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #136] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #152] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #168] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #184] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #200] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #216] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-6] -vldrw.u32 Q7, [Q2, #236] -strh r10, [r0,#+220] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #90] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #106] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #122] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #138] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #154] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #170] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #186] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #202] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #218] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -vldrw.u32 Q1, [Q2, #236] -strh r8, [r0,#+222] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #12] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #172] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #188] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #204] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #220] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #236] -strh r6, [r0,#+224] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #142] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #158] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #174] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #190] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #206] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #222] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #236] -strh r4, [r0,#+226] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #128] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #144] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #160] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #176] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #192] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #208] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #224] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #252] -strh r14, [r0,#+228] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #18] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #34] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #82] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #130] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #146] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #162] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #178] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #194] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #210] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #226] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -vldrw.u32 Q5, [Q2, #252] -strh r12, [r0,#+230] -vmladavx.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #4] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #20] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #36] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #52] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #68] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #132] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #148] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #164] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #180] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #228] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #-10] -vldrw.u32 Q1, [Q2, #252] -strh r10, [r0,#+232] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #6] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #22] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #38] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #54] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #70] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #86] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #102] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #118] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #134] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #150] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #166] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #182] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #198] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #214] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #230] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #252] -strh r8, [r0,#+234] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #24] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #40] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #88] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #104] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #120] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #136] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #152] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #168] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #184] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #200] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #216] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #232] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #-6] -vldrw.u32 Q3, [Q2, #252] -strh r6, [r0,#+236] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #10] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #26] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #74] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #90] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #106] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #122] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #138] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #154] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #186] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #202] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #218] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #234] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-4] -vldrw.u32 Q7, [Q2, #252] -strh r4, [r0,#+238] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #12] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #60] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #76] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #92] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #108] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #124] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #140] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #156] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #172] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #188] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #204] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #220] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #236] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #252] -strh r14, [r0,#+240] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #142] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #158] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #174] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #190] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #206] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #222] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #238] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #252] -strh r12, [r0,#+242] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #48] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #96] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #144] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #160] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #176] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #192] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #208] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #224] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #240] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -vldrw.u32 Q5, [Q2, #252] -strh r10, [r0,#+244] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #18] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #34] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #50] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #66] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #82] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #114] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #130] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #146] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #162] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #178] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #194] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #210] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #226] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #242] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #4] -vldrw.u32 Q1, [Q2, #252] -strh r8, [r0,#+246] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #20] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #36] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #52] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #68] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #84] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #100] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #116] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #132] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #148] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #164] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #180] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #196] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #212] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #228] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #244] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #252] -strh r6, [r0,#+248] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #102] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #118] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #134] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #150] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #166] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #198] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #214] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #230] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #246] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #8] -vldrw.u32 Q3, [Q2, #252] -strh r4, [r0,#+250] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #24] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #40] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #88] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #104] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #120] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #136] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #152] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #200] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #216] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #232] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #248] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #10] -vldrw.u32 Q7, [Q2, #252] -strh r14, [r0,#+252] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #26] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #42] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #58] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #74] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #90] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #106] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #122] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #138] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #154] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #170] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #186] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #202] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #218] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #234] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #250] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #12] -vldrw.u32 Q4, [Q2, #252] -strh r12, [r0,#+254] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #172] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #188] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #204] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #220] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #236] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #252] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #252] -strh r10, [r0,#+256] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #110] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #158] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #174] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #190] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #206] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #222] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #238] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #16] -vldrw.u32 Q5, [Q2, #252] -strh r8, [r0,#+258] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #32] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #48] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #64] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #80] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #96] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #112] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #128] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #144] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #160] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #176] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #192] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #208] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #18] -vldrw.u32 Q7, [Q2, #252] -strh r6, [r0,#+260] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #34] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #50] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #66] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #82] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #114] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #130] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #146] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #162] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #178] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #194] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #210] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #226] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #242] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #20] -vldrw.u32 Q1, [Q2, #252] -strh r4, [r0,#+262] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #36] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #52] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #68] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #132] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #148] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #164] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #180] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #228] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #244] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #22] -vldrw.u32 Q4, [Q2, #252] -strh r14, [r0,#+264] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #38] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #54] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #70] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #102] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #118] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #134] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #150] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #166] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #214] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #230] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #246] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #252] -strh r12, [r0,#+266] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #88] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #104] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #120] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #136] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #152] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #168] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #184] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #200] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #216] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #232] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #248] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #26] -vldrw.u32 Q0, [Q2, #252] -strh r10, [r0,#+268] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #42] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #58] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #74] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #90] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #106] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #122] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #138] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #154] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #170] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #186] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #202] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #218] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #234] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #250] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #252] -strh r8, [r0,#+270] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #76] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #92] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #108] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #124] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #172] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #188] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #204] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #220] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #236] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #252] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #30] -vldrw.u32 Q5, [Q2, #252] -strh r6, [r0,#+272] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #46] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #62] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #78] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #94] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #110] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #126] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #142] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #158] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #174] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #190] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #206] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #222] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #238] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #254] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #32] -vldrw.u32 Q7, [Q2, #252] -strh r4, [r0,#+274] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #48] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #64] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #80] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #96] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #112] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #128] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #144] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #160] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #176] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #192] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #208] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #34] -vldrw.u32 Q7, [Q2, #252] -strh r14, [r0,#+276] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #82] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #130] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #146] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #162] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #178] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #194] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #210] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #226] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #242] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #252] -strh r12, [r0,#+278] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #116] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #132] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #148] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #164] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #180] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #228] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #244] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #38] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+280] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #54] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #70] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #86] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #102] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #118] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #134] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #150] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #166] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #182] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #198] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #214] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #230] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #246] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #40] -vldrw.u32 Q7, [Q2, #252] -strh r8, [r0,#+282] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #88] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #104] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #120] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #136] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #152] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #200] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #216] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #232] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #248] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #252] -strh r6, [r0,#+284] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #74] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #90] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #106] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #122] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #138] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #170] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #186] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #202] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #218] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #234] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #250] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #44] -vldrw.u32 Q7, [Q2, #252] -strh r4, [r0,#+286] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #60] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #76] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #92] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #108] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #124] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #140] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #156] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #172] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #188] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #204] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #220] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #236] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #46] -vldrw.u32 Q7, [Q2, #252] -strh r14, [r0,#+288] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #62] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #78] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #94] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #110] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #126] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #142] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #158] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #174] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #190] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #206] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #222] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #238] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #254] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #48] -vldrw.u32 Q7, [Q2, #252] -strh r12, [r0,#+290] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #96] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #144] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #160] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #176] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #192] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #208] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #224] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #240] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #50] -vldrw.u32 Q5, [Q2, #252] -strh r10, [r0,#+292] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #66] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #82] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #130] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #146] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #162] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #178] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #194] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #226] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #242] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #52] -vldrw.u32 Q3, [Q2, #252] -strh r8, [r0,#+294] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #68] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #84] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #100] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #116] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #132] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #148] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #164] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #180] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #196] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #212] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #228] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #244] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #54] -vldrw.u32 Q0, [Q2, #252] -strh r6, [r0,#+296] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #70] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #102] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #118] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #134] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #150] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #166] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #214] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #230] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #246] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #252] -strh r4, [r0,#+298] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #88] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #104] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #120] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #136] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #152] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #168] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #184] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #200] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #216] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #232] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #248] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vldrw.u32 Q4, [Q2, #252] -strh r14, [r0,#+300] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #74] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #90] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #106] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #122] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #138] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #154] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #186] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #202] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #218] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #234] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #250] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #60] -vldrw.u32 Q1, [Q2, #252] -strh r12, [r0,#+302] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #76] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #92] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #108] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #124] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #140] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #156] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #172] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #188] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #204] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #220] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #236] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+304] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #110] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #158] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #174] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #190] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #206] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #222] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #238] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #64] -vldrw.u32 Q5, [Q2, #252] -strh r8, [r0,#+306] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #80] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #96] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #128] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #144] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #160] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #176] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #192] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #208] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #224] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #240] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #66] -vldrw.u32 Q0, [Q2, #252] -strh r6, [r0,#+308] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #82] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #98] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #130] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #146] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #162] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #178] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #194] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #210] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #226] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #242] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #252] -strh r4, [r0,#+310] -vmladavx.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #116] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #132] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #148] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #164] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #180] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #228] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #244] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #252] -strh r14, [r0,#+312] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #102] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #118] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #134] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #150] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #166] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #198] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #214] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #230] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #246] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -vldrw.u32 Q3, [Q2, #252] -strh r12, [r0,#+314] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #88] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #104] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #120] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #136] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #152] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #168] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #184] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #200] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #216] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #232] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #248] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #74] -vldrw.u32 Q6, [Q2, #252] -strh r10, [r0,#+316] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #90] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #106] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #122] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #138] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #154] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #186] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #202] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #218] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #234] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #250] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #76] -vldrw.u32 Q1, [Q2, #252] -strh r8, [r0,#+318] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #92] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #108] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #124] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #172] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #188] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #204] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #220] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #236] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #252] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #78] -vldrw.u32 Q5, [Q2, #252] -strh r6, [r0,#+320] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #94] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #110] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #142] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #158] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #174] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #190] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #206] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #222] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #238] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #254] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #80] -vldrw.u32 Q0, [Q2, #252] -strh r4, [r0,#+322] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #96] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #128] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #144] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #160] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #176] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #192] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #208] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #224] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #240] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #82] -vldrw.u32 Q1, [Q2, #252] -strh r14, [r0,#+324] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #130] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #146] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #162] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #178] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #194] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #226] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #242] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #252] -strh r12, [r0,#+326] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #132] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #148] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #164] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #180] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #228] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #244] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -vldrw.u32 Q4, [Q2, #252] -strh r10, [r0,#+328] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #102] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #118] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #134] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #150] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #166] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #182] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #198] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #214] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #230] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #246] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #88] -vldrw.u32 Q5, [Q2, #252] -strh r8, [r0,#+330] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #104] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #120] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #136] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #152] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #168] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #184] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #200] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #216] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #232] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #248] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #90] -vldrw.u32 Q6, [Q2, #252] -strh r6, [r0,#+332] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #106] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #122] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #138] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #170] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #186] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #202] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #218] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #234] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #250] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #252] -strh r4, [r0,#+334] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #172] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #188] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #204] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #220] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #236] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #252] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #94] -vldrw.u32 Q0, [Q2, #252] -strh r14, [r0,#+336] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #110] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #142] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #158] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #174] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #190] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #206] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #222] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #238] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #254] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #96] -vldrw.u32 Q1, [Q2, #252] -strh r12, [r0,#+338] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #128] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #144] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #160] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #176] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #192] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #208] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #224] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #240] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #252] -strh r10, [r0,#+340] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #130] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #146] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #162] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #178] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #194] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #210] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #226] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #242] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #100] -vldrw.u32 Q7, [Q2, #252] -strh r8, [r0,#+342] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #116] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #132] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #148] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #164] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #180] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #196] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #212] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #228] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #244] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #102] -vldrw.u32 Q6, [Q2, #252] -strh r6, [r0,#+344] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #118] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #134] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #150] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #166] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #182] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #198] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #214] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #230] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #246] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #104] -vldrw.u32 Q5, [Q2, #252] -strh r4, [r0,#+346] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #120] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #136] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #152] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #168] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #184] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #200] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #216] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #232] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #248] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #106] -vldrw.u32 Q4, [Q2, #252] -strh r14, [r0,#+348] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #122] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #138] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #154] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #170] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #186] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #202] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #218] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #234] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #250] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #108] -vldrw.u32 Q3, [Q2, #252] -strh r12, [r0,#+350] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #124] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #140] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #156] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #172] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #188] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #204] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #220] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #236] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #252] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #110] -vldrw.u32 Q1, [Q2, #252] -strh r10, [r0,#+352] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #142] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #158] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #174] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #190] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #206] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #222] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #238] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #254] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #252] -strh r8, [r0,#+354] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #144] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #160] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #176] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #192] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #208] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #224] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #240] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #114] -vldrw.u32 Q5, [Q2, #252] -strh r6, [r0,#+356] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #130] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #146] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #162] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #178] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #194] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #210] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #226] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #242] -vldrw.u32 Q7, [Q2, #124] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #116] -vldrw.u32 Q1, [Q2, #252] -strh r4, [r0,#+358] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #132] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #148] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #164] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #180] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #196] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #212] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #228] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #244] -vldrw.u32 Q4, [Q2, #124] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #118] -vldrw.u32 Q6, [Q2, #252] -strh r14, [r0,#+360] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #134] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #150] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #166] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #198] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #214] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #230] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #246] -vldrw.u32 Q0, [Q2, #124] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #120] -vldrw.u32 Q3, [Q2, #252] -strh r12, [r0,#+362] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #136] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #152] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #200] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #216] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #232] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #248] -vldrw.u32 Q5, [Q2, #124] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #122] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+364] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #138] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #154] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #170] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #186] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #202] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #218] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #234] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #250] -vldrw.u32 Q1, [Q2, #124] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #252] -strh r8, [r0,#+366] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #172] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #188] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #204] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #220] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #236] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #252] -vldrw.u32 Q6, [Q2, #124] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #252] -strh r6, [r0,#+368] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #158] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #174] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #190] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #206] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #222] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #238] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -vldrw.u32 Q3, [Q2, #124] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #128] -vldrw.u32 Q5, [Q2, #252] -strh r4, [r0,#+370] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #144] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #160] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #176] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #192] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #208] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #130] -vldrw.u32 Q7, [Q2, #252] -strh r14, [r0,#+372] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #146] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #162] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #178] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #194] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #210] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #226] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #242] -vldrw.u32 Q7, [Q2, #140] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #132] -vldrw.u32 Q1, [Q2, #252] -strh r12, [r0,#+374] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #148] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #164] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #180] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #228] -vldrw.u32 Q7, [Q2, #156] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #244] -vldrw.u32 Q1, [Q2, #140] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #134] -vldrw.u32 Q4, [Q2, #252] -strh r10, [r0,#+376] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #150] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #166] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #214] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #230] -vldrw.u32 Q1, [Q2, #156] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #246] -vldrw.u32 Q4, [Q2, #140] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #136] -vldrw.u32 Q6, [Q2, #252] -strh r8, [r0,#+378] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #152] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #168] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #184] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #200] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #216] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #232] -vldrw.u32 Q4, [Q2, #156] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #248] -vldrw.u32 Q6, [Q2, #140] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #138] -vldrw.u32 Q0, [Q2, #252] -strh r6, [r0,#+380] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #154] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #170] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #186] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #202] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #218] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #234] -vldrw.u32 Q6, [Q2, #156] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #250] -vldrw.u32 Q0, [Q2, #140] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #252] -strh r4, [r0,#+382] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #172] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #188] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #204] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #220] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #236] -vldrw.u32 Q0, [Q2, #156] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #252] -vldrw.u32 Q3, [Q2, #140] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #142] -vldrw.u32 Q5, [Q2, #252] -strh r14, [r0,#+384] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #158] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #174] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #190] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #206] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #222] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #238] -vldrw.u32 Q3, [Q2, #156] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #254] -vldrw.u32 Q5, [Q2, #140] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #144] -vldrw.u32 Q7, [Q2, #252] -strh r12, [r0,#+386] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #160] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #176] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #192] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #208] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #146] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+388] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #162] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #178] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #194] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #210] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #226] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #242] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #148] -vldrw.u32 Q7, [Q2, #252] -strh r8, [r0,#+390] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #164] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #180] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #228] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #244] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #150] -vldrw.u32 Q7, [Q2, #252] -strh r6, [r0,#+392] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #166] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #182] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #198] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #214] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #230] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #246] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #152] -vldrw.u32 Q7, [Q2, #252] -strh r4, [r0,#+394] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #200] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #216] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #232] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #248] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #252] -strh r14, [r0,#+396] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #170] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #186] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #202] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #218] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #234] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #250] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #156] -vldrw.u32 Q7, [Q2, #252] -strh r12, [r0,#+398] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #172] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #188] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #204] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #220] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #236] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #158] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+400] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #174] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #190] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #206] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #222] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #238] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #254] -vldrw.u32 Q5, [Q2, #156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #160] -vldrw.u32 Q7, [Q2, #252] -strh r8, [r0,#+402] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #176] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #192] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #208] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #224] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #240] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #162] -vldrw.u32 Q5, [Q2, #252] -strh r6, [r0,#+404] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #178] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #194] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #226] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #242] -vldrw.u32 Q0, [Q2, #172] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #164] -vldrw.u32 Q3, [Q2, #252] -strh r4, [r0,#+406] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #180] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #196] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #212] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #228] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #244] -vldrw.u32 Q6, [Q2, #172] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #166] -vldrw.u32 Q0, [Q2, #252] -strh r14, [r0,#+408] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #214] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #230] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #246] -vldrw.u32 Q4, [Q2, #172] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #168] -vldrw.u32 Q6, [Q2, #252] -strh r12, [r0,#+410] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #184] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #200] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #216] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #232] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #248] -vldrw.u32 Q1, [Q2, #172] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -vldrw.u32 Q4, [Q2, #252] -strh r10, [r0,#+412] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #186] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #202] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #218] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #234] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #250] -vldrw.u32 Q7, [Q2, #172] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #172] -vldrw.u32 Q1, [Q2, #252] -strh r8, [r0,#+414] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #188] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #204] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #220] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #236] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #172] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #174] -vldrw.u32 Q7, [Q2, #252] -strh r6, [r0,#+416] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #190] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #206] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #222] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #238] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -vldrw.u32 Q3, [Q2, #172] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #176] -vldrw.u32 Q5, [Q2, #252] -strh r4, [r0,#+418] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #192] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #208] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #224] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #240] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #178] -vldrw.u32 Q0, [Q2, #252] -strh r14, [r0,#+420] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #194] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #210] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #226] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #242] -vldrw.u32 Q1, [Q2, #188] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #180] -vldrw.u32 Q4, [Q2, #252] -strh r12, [r0,#+422] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #228] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #244] -vldrw.u32 Q5, [Q2, #188] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+424] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #198] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #214] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #230] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #246] -vldrw.u32 Q0, [Q2, #188] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #184] -vldrw.u32 Q3, [Q2, #252] -strh r8, [r0,#+426] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #200] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #216] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #232] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #248] -vldrw.u32 Q4, [Q2, #188] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #186] -vldrw.u32 Q6, [Q2, #252] -strh r6, [r0,#+428] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #202] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #218] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #234] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #250] -vldrw.u32 Q7, [Q2, #188] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #188] -vldrw.u32 Q1, [Q2, #252] -strh r4, [r0,#+430] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #204] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #220] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #236] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #252] -vldrw.u32 Q3, [Q2, #188] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #190] -vldrw.u32 Q5, [Q2, #252] -strh r14, [r0,#+432] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #206] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #222] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #238] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #254] -vldrw.u32 Q6, [Q2, #188] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #192] -vldrw.u32 Q0, [Q2, #252] -strh r12, [r0,#+434] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #208] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #224] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #240] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #194] -vldrw.u32 Q1, [Q2, #252] -strh r10, [r0,#+436] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #226] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #242] -vldrw.u32 Q0, [Q2, #204] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #252] -strh r8, [r0,#+438] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #228] -vldrw.u32 Q7, [Q2, #220] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #244] -vldrw.u32 Q1, [Q2, #204] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #198] -vldrw.u32 Q4, [Q2, #252] -strh r6, [r0,#+440] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #214] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #230] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #246] -vldrw.u32 Q3, [Q2, #204] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #200] -vldrw.u32 Q5, [Q2, #252] -strh r4, [r0,#+442] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #216] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #232] -vldrw.u32 Q1, [Q2, #220] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #248] -vldrw.u32 Q4, [Q2, #204] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #202] -vldrw.u32 Q6, [Q2, #252] -strh r14, [r0,#+444] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #218] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #234] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #250] -vldrw.u32 Q5, [Q2, #204] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #204] -vldrw.u32 Q7, [Q2, #252] -strh r12, [r0,#+446] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #220] -vldrw.u32 Q1, [Q2, #236] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #236] -vldrw.u32 Q4, [Q2, #220] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #252] -vldrw.u32 Q6, [Q2, #204] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #206] -vldrw.u32 Q0, [Q2, #252] -strh r10, [r0,#+448] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #222] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #238] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #254] -vldrw.u32 Q7, [Q2, #204] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #208] -vldrw.u32 Q1, [Q2, #252] -strh r8, [r0,#+450] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #224] -vldrw.u32 Q4, [Q2, #236] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #240] -vldrw.u32 Q6, [Q2, #220] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #210] -strh r6, [r0,#+452] -vmladavx.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #226] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #242] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #212] -vldrw.u32 Q4, [Q2, #252] -strh r4, [r0,#+454] -vmladavx.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #228] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #244] -vldrw.u32 Q0, [Q2, #220] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #214] -strh r14, [r0,#+456] -vmladavx.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #230] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #246] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #216] -vldrw.u32 Q6, [Q2, #252] -strh r12, [r0,#+458] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #232] -vldrw.u32 Q0, [Q2, #236] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #248] -vldrw.u32 Q3, [Q2, #220] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #218] -strh r10, [r0,#+460] -vmladavx.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #234] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #250] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #220] -vldrw.u32 Q0, [Q2, #252] -strh r8, [r0,#+462] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #236] -vldrw.u32 Q3, [Q2, #236] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #220] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #222] -strh r6, [r0,#+464] -vmladavx.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #238] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #254] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #252] -strh r4, [r0,#+466] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -vldrw.u32 Q5, [Q2, #236] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #226] -strh r14, [r0,#+468] -vmladavx.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #242] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #228] -strh r12, [r0,#+470] -vmladavx.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #244] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #230] -vldrw.u32 Q4, [Q2, #252] -strh r10, [r0,#+472] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #246] -vldrw.u32 Q6, [Q2, #236] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #232] -strh r8, [r0,#+474] -vmladavx.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #248] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #234] -strh r6, [r0,#+476] -vmladavx.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #250] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #236] -vldrw.u32 Q5, [Q2, #252] -strh r4, [r0,#+478] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #252] -vldrw.u32 Q7, [Q2, #236] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #238] -strh r14, [r0,#+480] -vmladavx.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #254] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #240] -strh r12, [r0,#+482] -vmladavx.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #242] -strh r10, [r0,#+484] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #244] -vldrw.u32 Q6, [Q2, #252] -strh r8, [r0,#+486] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #246] -strh r6, [r0,#+488] -vmladavx.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #248] -strh r4, [r0,#+490] -vmladavx.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #250] -strh r14, [r0,#+492] -vmladavx.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #252] -strh r12, [r0,#+494] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #254] -strh r10, [r0,#+496] -vmladavx.s16 r10, Q4, Q6 -strh r10, [r0,#+508] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_128_mve_schoolbook.s b/tests/schoolbook/auto/poly_u16_mul_128_mve_schoolbook.s deleted file mode 100644 index f94a673..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_128_mve_schoolbook.s +++ /dev/null @@ -1,8713 +0,0 @@ -.syntax unified -.type poly_u16_mul_128_schoolbook_mve, %function -.global poly_u16_mul_128_schoolbook_mve -poly_u16_mul_128_schoolbook_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #12] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #12] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #32] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #34] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #36] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #38] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #12] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #46] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #48] -strh r12, [r0,#+50] -vmladavx.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #50] -strh r10, [r0,#+52] -vmladavx.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #52] -strh r8, [r0,#+54] -vmladavx.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #54] -strh r6, [r0,#+56] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #12] -strh r4, [r0,#+58] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -strh r14, [r0,#+60] -vmladavx.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #60] -strh r12, [r0,#+62] -vmladavx.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #62] -strh r10, [r0,#+64] -vmladavx.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #64] -strh r8, [r0,#+66] -vmladavx.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #66] -strh r6, [r0,#+68] -vmladavx.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #68] -strh r4, [r0,#+70] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #70] -vldrw.u32 Q0, [Q2, #12] -strh r14, [r0,#+72] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -strh r12, [r0,#+74] -vmladavx.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #74] -strh r10, [r0,#+76] -vmladavx.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #76] -strh r8, [r0,#+78] -vmladavx.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #78] -strh r6, [r0,#+80] -vmladavx.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #80] -strh r4, [r0,#+82] -vmladavx.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #82] -strh r14, [r0,#+84] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #12] -strh r12, [r0,#+86] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -strh r10, [r0,#+88] -vmladavx.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #88] -strh r8, [r0,#+90] -vmladavx.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #90] -strh r6, [r0,#+92] -vmladavx.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #92] -strh r4, [r0,#+94] -vmladavx.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #94] -strh r14, [r0,#+96] -vmladavx.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #96] -strh r12, [r0,#+98] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #12] -strh r10, [r0,#+100] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -strh r8, [r0,#+102] -vmladavx.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #102] -strh r6, [r0,#+104] -vmladavx.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #104] -strh r4, [r0,#+106] -vmladavx.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #106] -strh r14, [r0,#+108] -vmladavx.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #108] -strh r12, [r0,#+110] -vmladavx.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #110] -strh r10, [r0,#+112] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #12] -strh r8, [r0,#+114] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -strh r6, [r0,#+116] -vmladavx.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #116] -strh r4, [r0,#+118] -vmladavx.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #118] -strh r14, [r0,#+120] -vmladavx.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #120] -strh r12, [r0,#+122] -vmladavx.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #122] -strh r10, [r0,#+124] -vmladavx.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #124] -strh r8, [r0,#+126] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #12] -strh r6, [r0,#+128] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #128] -strh r4, [r0,#+130] -vmladavx.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #130] -strh r14, [r0,#+132] -vmladavx.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #132] -strh r12, [r0,#+134] -vmladavx.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #134] -strh r10, [r0,#+136] -vmladavx.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #136] -strh r8, [r0,#+138] -vmladavx.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #138] -strh r6, [r0,#+140] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #12] -strh r4, [r0,#+142] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #142] -strh r14, [r0,#+144] -vmladavx.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #144] -strh r12, [r0,#+146] -vmladavx.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #146] -strh r10, [r0,#+148] -vmladavx.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #148] -strh r8, [r0,#+150] -vmladavx.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #150] -strh r6, [r0,#+152] -vmladavx.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #152] -strh r4, [r0,#+154] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #12] -strh r14, [r0,#+156] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #156] -strh r12, [r0,#+158] -vmladavx.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #158] -strh r10, [r0,#+160] -vmladavx.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #160] -strh r8, [r0,#+162] -vmladavx.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #162] -strh r6, [r0,#+164] -vmladavx.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #164] -strh r4, [r0,#+166] -vmladavx.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #166] -strh r14, [r0,#+168] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #168] -vldrw.u32 Q0, [Q2, #12] -strh r12, [r0,#+170] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #170] -strh r10, [r0,#+172] -vmladavx.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #172] -strh r8, [r0,#+174] -vmladavx.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #174] -strh r6, [r0,#+176] -vmladavx.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #176] -strh r4, [r0,#+178] -vmladavx.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #178] -strh r14, [r0,#+180] -vmladavx.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #180] -strh r12, [r0,#+182] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #182] -vldrw.u32 Q1, [Q2, #12] -strh r10, [r0,#+184] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -strh r8, [r0,#+186] -vmladavx.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #186] -strh r6, [r0,#+188] -vmladavx.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #188] -strh r4, [r0,#+190] -vmladavx.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #190] -strh r14, [r0,#+192] -vmladavx.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #192] -strh r12, [r0,#+194] -vmladavx.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #194] -strh r10, [r0,#+196] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #12] -strh r8, [r0,#+198] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -strh r6, [r0,#+200] -vmladavx.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #200] -strh r4, [r0,#+202] -vmladavx.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #202] -strh r14, [r0,#+204] -vmladavx.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #204] -strh r12, [r0,#+206] -vmladavx.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #206] -strh r10, [r0,#+208] -vmladavx.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #208] -strh r8, [r0,#+210] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #12] -strh r6, [r0,#+212] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #212] -strh r4, [r0,#+214] -vmladavx.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #214] -strh r14, [r0,#+216] -vmladavx.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #216] -strh r12, [r0,#+218] -vmladavx.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #218] -strh r10, [r0,#+220] -vmladavx.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #220] -strh r8, [r0,#+222] -vmladavx.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #222] -strh r6, [r0,#+224] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #224] -vldrw.u32 Q5, [Q2, #12] -strh r4, [r0,#+226] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #226] -strh r14, [r0,#+228] -vmladavx.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #228] -strh r12, [r0,#+230] -vmladavx.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #230] -strh r10, [r0,#+232] -vmladavx.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #232] -strh r8, [r0,#+234] -vmladavx.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #234] -strh r6, [r0,#+236] -vmladavx.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #236] -strh r4, [r0,#+238] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #238] -vldrw.u32 Q6, [Q2, #12] -strh r14, [r0,#+240] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #240] -strh r12, [r0,#+242] -vmladavx.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #242] -strh r10, [r0,#+244] -vmladavx.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #244] -strh r8, [r0,#+246] -vmladavx.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #246] -strh r6, [r0,#+248] -vmladavx.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #248] -strh r4, [r0,#+250] -vmladavx.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #250] -strh r14, [r0,#+252] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #252] -vldrw.u32 Q7, [Q2, #12] -strh r12, [r0,#+254] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #254] -strh r10, [r0,#+256] -vmladavx.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #-14] -vldrw.u32 Q3, [Q2, #28] -strh r8, [r0,#+258] -ldrh r8, [r0,#+16] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -strh r6, [r0,#+260] -ldrh r6, [r0,#+18] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #-10] -strh r4, [r0,#+262] -ldrh r4, [r0,#+20] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #-8] -strh r14, [r0,#+264] -ldrh r14, [r0,#+22] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #-6] -strh r12, [r0,#+266] -ldrh r12, [r0,#+24] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #-4] -strh r10, [r0,#+268] -ldrh r10, [r0,#+26] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #-2] -strh r8, [r0,#+16] -ldrh r8, [r0,#+28] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #0] -vldrw.u32 Q4, [Q2, #28] -strh r6, [r0,#+18] -ldrh r6, [r0,#+30] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #2] -strh r4, [r0,#+20] -ldrh r4, [r0,#+32] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #4] -strh r14, [r0,#+22] -ldrh r14, [r0,#+34] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #6] -strh r12, [r0,#+24] -ldrh r12, [r0,#+36] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #8] -strh r10, [r0,#+26] -ldrh r10, [r0,#+38] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #10] -strh r8, [r0,#+28] -ldrh r8, [r0,#+40] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #12] -strh r6, [r0,#+30] -ldrh r6, [r0,#+42] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #28] -strh r4, [r0,#+32] -ldrh r4, [r0,#+44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -strh r14, [r0,#+34] -ldrh r14, [r0,#+46] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #18] -strh r12, [r0,#+36] -ldrh r12, [r0,#+48] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #20] -strh r10, [r0,#+38] -ldrh r10, [r0,#+50] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #22] -strh r8, [r0,#+40] -ldrh r8, [r0,#+52] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #24] -strh r6, [r0,#+42] -ldrh r6, [r0,#+54] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #26] -strh r4, [r0,#+44] -ldrh r4, [r0,#+56] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #28] -strh r14, [r0,#+46] -ldrh r14, [r0,#+58] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -strh r12, [r0,#+48] -ldrh r12, [r0,#+60] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #32] -strh r10, [r0,#+50] -ldrh r10, [r0,#+62] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #34] -strh r8, [r0,#+52] -ldrh r8, [r0,#+64] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #36] -strh r6, [r0,#+54] -ldrh r6, [r0,#+66] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #38] -strh r4, [r0,#+56] -ldrh r4, [r0,#+68] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #40] -strh r14, [r0,#+58] -ldrh r14, [r0,#+70] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #28] -strh r12, [r0,#+60] -ldrh r12, [r0,#+72] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #44] -strh r10, [r0,#+62] -ldrh r10, [r0,#+74] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #46] -strh r8, [r0,#+64] -ldrh r8, [r0,#+76] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #48] -strh r6, [r0,#+66] -ldrh r6, [r0,#+78] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #50] -strh r4, [r0,#+68] -ldrh r4, [r0,#+80] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #52] -strh r14, [r0,#+70] -ldrh r14, [r0,#+82] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #54] -strh r12, [r0,#+72] -ldrh r12, [r0,#+84] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #28] -strh r10, [r0,#+74] -ldrh r10, [r0,#+86] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #58] -strh r8, [r0,#+76] -ldrh r8, [r0,#+88] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #60] -strh r6, [r0,#+78] -ldrh r6, [r0,#+90] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #62] -strh r4, [r0,#+80] -ldrh r4, [r0,#+92] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #64] -strh r14, [r0,#+82] -ldrh r14, [r0,#+94] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #66] -strh r12, [r0,#+84] -ldrh r12, [r0,#+96] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #68] -strh r10, [r0,#+86] -ldrh r10, [r0,#+98] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #28] -strh r8, [r0,#+88] -ldrh r8, [r0,#+100] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -strh r6, [r0,#+90] -ldrh r6, [r0,#+102] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #74] -strh r4, [r0,#+92] -ldrh r4, [r0,#+104] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #76] -strh r14, [r0,#+94] -ldrh r14, [r0,#+106] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #78] -strh r12, [r0,#+96] -ldrh r12, [r0,#+108] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #80] -strh r10, [r0,#+98] -ldrh r10, [r0,#+110] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #82] -strh r8, [r0,#+100] -ldrh r8, [r0,#+112] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #28] -strh r6, [r0,#+102] -ldrh r6, [r0,#+114] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -strh r4, [r0,#+104] -ldrh r4, [r0,#+116] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #88] -strh r14, [r0,#+106] -ldrh r14, [r0,#+118] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #90] -strh r12, [r0,#+108] -ldrh r12, [r0,#+120] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #92] -strh r10, [r0,#+110] -ldrh r10, [r0,#+122] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #94] -strh r8, [r0,#+112] -ldrh r8, [r0,#+124] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #96] -strh r6, [r0,#+114] -ldrh r6, [r0,#+126] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #28] -strh r4, [r0,#+116] -ldrh r4, [r0,#+128] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #100] -strh r14, [r0,#+118] -ldrh r14, [r0,#+130] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #102] -strh r12, [r0,#+120] -ldrh r12, [r0,#+132] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #104] -strh r10, [r0,#+122] -ldrh r10, [r0,#+134] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #106] -strh r8, [r0,#+124] -ldrh r8, [r0,#+136] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #108] -strh r6, [r0,#+126] -ldrh r6, [r0,#+138] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #110] -strh r4, [r0,#+128] -ldrh r4, [r0,#+140] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #28] -strh r14, [r0,#+130] -ldrh r14, [r0,#+142] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -strh r12, [r0,#+132] -ldrh r12, [r0,#+144] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #116] -strh r10, [r0,#+134] -ldrh r10, [r0,#+146] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #118] -strh r8, [r0,#+136] -ldrh r8, [r0,#+148] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #120] -strh r6, [r0,#+138] -ldrh r6, [r0,#+150] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #122] -strh r4, [r0,#+140] -ldrh r4, [r0,#+152] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #124] -strh r14, [r0,#+142] -ldrh r14, [r0,#+154] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #28] -strh r12, [r0,#+144] -ldrh r12, [r0,#+156] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #128] -strh r10, [r0,#+146] -ldrh r10, [r0,#+158] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #130] -strh r8, [r0,#+148] -ldrh r8, [r0,#+160] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #132] -strh r6, [r0,#+150] -ldrh r6, [r0,#+162] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #134] -strh r4, [r0,#+152] -ldrh r4, [r0,#+164] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #136] -strh r14, [r0,#+154] -ldrh r14, [r0,#+166] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #138] -strh r12, [r0,#+156] -ldrh r12, [r0,#+168] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #140] -vldrw.u32 Q7, [Q2, #28] -strh r10, [r0,#+158] -ldrh r10, [r0,#+170] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #142] -strh r8, [r0,#+160] -ldrh r8, [r0,#+172] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #144] -strh r6, [r0,#+162] -ldrh r6, [r0,#+174] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #146] -strh r4, [r0,#+164] -ldrh r4, [r0,#+176] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #148] -strh r14, [r0,#+166] -ldrh r14, [r0,#+178] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #150] -strh r12, [r0,#+168] -ldrh r12, [r0,#+180] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #152] -strh r10, [r0,#+170] -ldrh r10, [r0,#+182] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #154] -vldrw.u32 Q0, [Q2, #28] -strh r8, [r0,#+172] -ldrh r8, [r0,#+184] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #156] -strh r6, [r0,#+174] -ldrh r6, [r0,#+186] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #158] -strh r4, [r0,#+176] -ldrh r4, [r0,#+188] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #160] -strh r14, [r0,#+178] -ldrh r14, [r0,#+190] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #162] -strh r12, [r0,#+180] -ldrh r12, [r0,#+192] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #164] -strh r10, [r0,#+182] -ldrh r10, [r0,#+194] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #166] -strh r8, [r0,#+184] -ldrh r8, [r0,#+196] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #28] -strh r6, [r0,#+186] -ldrh r6, [r0,#+198] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -strh r4, [r0,#+188] -ldrh r4, [r0,#+200] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #172] -strh r14, [r0,#+190] -ldrh r14, [r0,#+202] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #174] -strh r12, [r0,#+192] -ldrh r12, [r0,#+204] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #176] -strh r10, [r0,#+194] -ldrh r10, [r0,#+206] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #178] -strh r8, [r0,#+196] -ldrh r8, [r0,#+208] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #180] -strh r6, [r0,#+198] -ldrh r6, [r0,#+210] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #28] -strh r4, [r0,#+200] -ldrh r4, [r0,#+212] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #184] -strh r14, [r0,#+202] -ldrh r14, [r0,#+214] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #186] -strh r12, [r0,#+204] -ldrh r12, [r0,#+216] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #188] -strh r10, [r0,#+206] -ldrh r10, [r0,#+218] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #190] -strh r8, [r0,#+208] -ldrh r8, [r0,#+220] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #192] -strh r6, [r0,#+210] -ldrh r6, [r0,#+222] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #194] -strh r4, [r0,#+212] -ldrh r4, [r0,#+224] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #196] -vldrw.u32 Q4, [Q2, #28] -strh r14, [r0,#+214] -ldrh r14, [r0,#+226] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #198] -strh r12, [r0,#+216] -ldrh r12, [r0,#+228] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #200] -strh r10, [r0,#+218] -ldrh r10, [r0,#+230] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #202] -strh r8, [r0,#+220] -ldrh r8, [r0,#+232] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #204] -strh r6, [r0,#+222] -ldrh r6, [r0,#+234] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #206] -strh r4, [r0,#+224] -ldrh r4, [r0,#+236] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #208] -strh r14, [r0,#+226] -ldrh r14, [r0,#+238] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #210] -vldrw.u32 Q5, [Q2, #28] -strh r12, [r0,#+228] -ldrh r12, [r0,#+240] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #212] -strh r10, [r0,#+230] -ldrh r10, [r0,#+242] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #214] -strh r8, [r0,#+232] -ldrh r8, [r0,#+244] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #216] -strh r6, [r0,#+234] -ldrh r6, [r0,#+246] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #218] -strh r4, [r0,#+236] -ldrh r4, [r0,#+248] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #220] -strh r14, [r0,#+238] -ldrh r14, [r0,#+250] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #222] -strh r12, [r0,#+240] -ldrh r12, [r0,#+252] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #224] -vldrw.u32 Q6, [Q2, #28] -strh r10, [r0,#+242] -ldrh r10, [r0,#+254] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #226] -strh r8, [r0,#+244] -ldrh r8, [r0,#+256] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #228] -strh r6, [r0,#+246] -ldrh r6, [r0,#+258] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #230] -strh r4, [r0,#+248] -ldrh r4, [r0,#+260] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #232] -strh r14, [r0,#+250] -ldrh r14, [r0,#+262] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #234] -strh r12, [r0,#+252] -ldrh r12, [r0,#+264] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #236] -strh r10, [r0,#+254] -ldrh r10, [r0,#+266] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #238] -vldrw.u32 Q7, [Q2, #28] -strh r8, [r0,#+256] -ldrh r8, [r0,#+268] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #240] -strh r6, [r0,#+258] -vmladavx.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #242] -strh r4, [r0,#+260] -vmladavx.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #244] -strh r14, [r0,#+262] -vmladavx.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #246] -strh r12, [r0,#+264] -vmladavx.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #248] -strh r10, [r0,#+266] -vmladavx.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #250] -strh r8, [r0,#+268] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #252] -vldrw.u32 Q0, [Q2, #28] -strh r6, [r0,#+270] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -strh r4, [r0,#+272] -vmladavx.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #-14] -vldrw.u32 Q4, [Q2, #44] -strh r14, [r0,#+274] -ldrh r14, [r0,#+32] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r12, [r0,#+276] -ldrh r12, [r0,#+34] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #-10] -strh r10, [r0,#+278] -ldrh r10, [r0,#+36] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #-8] -strh r8, [r0,#+280] -ldrh r8, [r0,#+38] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r6, [r0,#+282] -ldrh r6, [r0,#+40] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #-4] -strh r4, [r0,#+284] -ldrh r4, [r0,#+42] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #-2] -strh r14, [r0,#+32] -ldrh r14, [r0,#+44] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #44] -strh r12, [r0,#+34] -ldrh r12, [r0,#+46] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -strh r10, [r0,#+36] -ldrh r10, [r0,#+48] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #4] -strh r8, [r0,#+38] -ldrh r8, [r0,#+50] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #6] -strh r6, [r0,#+40] -ldrh r6, [r0,#+52] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #8] -strh r4, [r0,#+42] -ldrh r4, [r0,#+54] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #10] -strh r14, [r0,#+44] -ldrh r14, [r0,#+56] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #12] -strh r12, [r0,#+46] -ldrh r12, [r0,#+58] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #44] -strh r10, [r0,#+48] -ldrh r10, [r0,#+60] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -strh r8, [r0,#+50] -ldrh r8, [r0,#+62] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #18] -strh r6, [r0,#+52] -ldrh r6, [r0,#+64] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #20] -strh r4, [r0,#+54] -ldrh r4, [r0,#+66] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #22] -strh r14, [r0,#+56] -ldrh r14, [r0,#+68] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #24] -strh r12, [r0,#+58] -ldrh r12, [r0,#+70] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #26] -strh r10, [r0,#+60] -ldrh r10, [r0,#+72] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #28] -vldrw.u32 Q7, [Q2, #44] -strh r8, [r0,#+62] -ldrh r8, [r0,#+74] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #30] -strh r6, [r0,#+64] -ldrh r6, [r0,#+76] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #32] -strh r4, [r0,#+66] -ldrh r4, [r0,#+78] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #34] -strh r14, [r0,#+68] -ldrh r14, [r0,#+80] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #36] -strh r12, [r0,#+70] -ldrh r12, [r0,#+82] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #38] -strh r10, [r0,#+72] -ldrh r10, [r0,#+84] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #40] -strh r8, [r0,#+74] -ldrh r8, [r0,#+86] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #42] -vldrw.u32 Q0, [Q2, #44] -strh r6, [r0,#+76] -ldrh r6, [r0,#+88] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #44] -strh r4, [r0,#+78] -ldrh r4, [r0,#+90] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #46] -strh r14, [r0,#+80] -ldrh r14, [r0,#+92] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #48] -strh r12, [r0,#+82] -ldrh r12, [r0,#+94] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #50] -strh r10, [r0,#+84] -ldrh r10, [r0,#+96] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #52] -strh r8, [r0,#+86] -ldrh r8, [r0,#+98] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #54] -strh r6, [r0,#+88] -ldrh r6, [r0,#+100] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #44] -strh r4, [r0,#+90] -ldrh r4, [r0,#+102] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -strh r14, [r0,#+92] -ldrh r14, [r0,#+104] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #60] -strh r12, [r0,#+94] -ldrh r12, [r0,#+106] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #62] -strh r10, [r0,#+96] -ldrh r10, [r0,#+108] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #64] -strh r8, [r0,#+98] -ldrh r8, [r0,#+110] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #66] -strh r6, [r0,#+100] -ldrh r6, [r0,#+112] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #68] -strh r4, [r0,#+102] -ldrh r4, [r0,#+114] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #70] -vldrw.u32 Q3, [Q2, #44] -strh r14, [r0,#+104] -ldrh r14, [r0,#+116] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -strh r12, [r0,#+106] -ldrh r12, [r0,#+118] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #74] -strh r10, [r0,#+108] -ldrh r10, [r0,#+120] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #76] -strh r8, [r0,#+110] -ldrh r8, [r0,#+122] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #78] -strh r6, [r0,#+112] -ldrh r6, [r0,#+124] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #80] -strh r4, [r0,#+114] -ldrh r4, [r0,#+126] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #82] -strh r14, [r0,#+116] -ldrh r14, [r0,#+128] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #44] -strh r12, [r0,#+118] -ldrh r12, [r0,#+130] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #86] -strh r10, [r0,#+120] -ldrh r10, [r0,#+132] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #88] -strh r8, [r0,#+122] -ldrh r8, [r0,#+134] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #90] -strh r6, [r0,#+124] -ldrh r6, [r0,#+136] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #92] -strh r4, [r0,#+126] -ldrh r4, [r0,#+138] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #94] -strh r14, [r0,#+128] -ldrh r14, [r0,#+140] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #96] -strh r12, [r0,#+130] -ldrh r12, [r0,#+142] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #98] -vldrw.u32 Q5, [Q2, #44] -strh r10, [r0,#+132] -ldrh r10, [r0,#+144] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #100] -strh r8, [r0,#+134] -ldrh r8, [r0,#+146] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #102] -strh r6, [r0,#+136] -ldrh r6, [r0,#+148] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #104] -strh r4, [r0,#+138] -ldrh r4, [r0,#+150] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #106] -strh r14, [r0,#+140] -ldrh r14, [r0,#+152] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #108] -strh r12, [r0,#+142] -ldrh r12, [r0,#+154] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #110] -strh r10, [r0,#+144] -ldrh r10, [r0,#+156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #44] -strh r8, [r0,#+146] -ldrh r8, [r0,#+158] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #114] -strh r6, [r0,#+148] -ldrh r6, [r0,#+160] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #116] -strh r4, [r0,#+150] -ldrh r4, [r0,#+162] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #118] -strh r14, [r0,#+152] -ldrh r14, [r0,#+164] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #120] -strh r12, [r0,#+154] -ldrh r12, [r0,#+166] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #122] -strh r10, [r0,#+156] -ldrh r10, [r0,#+168] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #124] -strh r8, [r0,#+158] -ldrh r8, [r0,#+170] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #126] -vldrw.u32 Q7, [Q2, #44] -strh r6, [r0,#+160] -ldrh r6, [r0,#+172] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #128] -strh r4, [r0,#+162] -ldrh r4, [r0,#+174] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #130] -strh r14, [r0,#+164] -ldrh r14, [r0,#+176] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #132] -strh r12, [r0,#+166] -ldrh r12, [r0,#+178] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #134] -strh r10, [r0,#+168] -ldrh r10, [r0,#+180] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #136] -strh r8, [r0,#+170] -ldrh r8, [r0,#+182] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #138] -strh r6, [r0,#+172] -ldrh r6, [r0,#+184] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #140] -vldrw.u32 Q0, [Q2, #44] -strh r4, [r0,#+174] -ldrh r4, [r0,#+186] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -strh r14, [r0,#+176] -ldrh r14, [r0,#+188] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #144] -strh r12, [r0,#+178] -ldrh r12, [r0,#+190] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #146] -strh r10, [r0,#+180] -ldrh r10, [r0,#+192] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #148] -strh r8, [r0,#+182] -ldrh r8, [r0,#+194] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #150] -strh r6, [r0,#+184] -ldrh r6, [r0,#+196] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #152] -strh r4, [r0,#+186] -ldrh r4, [r0,#+198] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #154] -vldrw.u32 Q1, [Q2, #44] -strh r14, [r0,#+188] -ldrh r14, [r0,#+200] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #156] -strh r12, [r0,#+190] -ldrh r12, [r0,#+202] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #158] -strh r10, [r0,#+192] -ldrh r10, [r0,#+204] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #160] -strh r8, [r0,#+194] -ldrh r8, [r0,#+206] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #162] -strh r6, [r0,#+196] -ldrh r6, [r0,#+208] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #164] -strh r4, [r0,#+198] -ldrh r4, [r0,#+210] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #166] -strh r14, [r0,#+200] -ldrh r14, [r0,#+212] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #168] -vldrw.u32 Q3, [Q2, #44] -strh r12, [r0,#+202] -ldrh r12, [r0,#+214] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #170] -strh r10, [r0,#+204] -ldrh r10, [r0,#+216] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #172] -strh r8, [r0,#+206] -ldrh r8, [r0,#+218] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #174] -strh r6, [r0,#+208] -ldrh r6, [r0,#+220] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #176] -strh r4, [r0,#+210] -ldrh r4, [r0,#+222] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #178] -strh r14, [r0,#+212] -ldrh r14, [r0,#+224] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #180] -strh r12, [r0,#+214] -ldrh r12, [r0,#+226] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #182] -vldrw.u32 Q4, [Q2, #44] -strh r10, [r0,#+216] -ldrh r10, [r0,#+228] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #184] -strh r8, [r0,#+218] -ldrh r8, [r0,#+230] -vmladavax.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #186] -strh r6, [r0,#+220] -ldrh r6, [r0,#+232] -vmladavax.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #188] -strh r4, [r0,#+222] -ldrh r4, [r0,#+234] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #190] -strh r14, [r0,#+224] -ldrh r14, [r0,#+236] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #192] -strh r12, [r0,#+226] -ldrh r12, [r0,#+238] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #194] -strh r10, [r0,#+228] -ldrh r10, [r0,#+240] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #196] -vldrw.u32 Q5, [Q2, #44] -strh r8, [r0,#+230] -ldrh r8, [r0,#+242] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #198] -strh r6, [r0,#+232] -ldrh r6, [r0,#+244] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #200] -strh r4, [r0,#+234] -ldrh r4, [r0,#+246] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #202] -strh r14, [r0,#+236] -ldrh r14, [r0,#+248] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #204] -strh r12, [r0,#+238] -ldrh r12, [r0,#+250] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #206] -strh r10, [r0,#+240] -ldrh r10, [r0,#+252] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #208] -strh r8, [r0,#+242] -ldrh r8, [r0,#+254] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #210] -vldrw.u32 Q6, [Q2, #44] -strh r6, [r0,#+244] -ldrh r6, [r0,#+256] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -strh r4, [r0,#+246] -ldrh r4, [r0,#+258] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #214] -strh r14, [r0,#+248] -ldrh r14, [r0,#+260] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #216] -strh r12, [r0,#+250] -ldrh r12, [r0,#+262] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #218] -strh r10, [r0,#+252] -ldrh r10, [r0,#+264] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #220] -strh r8, [r0,#+254] -ldrh r8, [r0,#+266] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #222] -strh r6, [r0,#+256] -ldrh r6, [r0,#+268] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #224] -vldrw.u32 Q7, [Q2, #44] -strh r4, [r0,#+258] -ldrh r4, [r0,#+270] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #226] -strh r14, [r0,#+260] -ldrh r14, [r0,#+272] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #228] -strh r12, [r0,#+262] -ldrh r12, [r0,#+274] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #230] -strh r10, [r0,#+264] -ldrh r10, [r0,#+276] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #232] -strh r8, [r0,#+266] -ldrh r8, [r0,#+278] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #234] -strh r6, [r0,#+268] -ldrh r6, [r0,#+280] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #236] -strh r4, [r0,#+270] -ldrh r4, [r0,#+282] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #238] -vldrw.u32 Q0, [Q2, #44] -strh r14, [r0,#+272] -ldrh r14, [r0,#+284] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #240] -strh r12, [r0,#+274] -vmladavx.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #242] -strh r10, [r0,#+276] -vmladavx.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #244] -strh r8, [r0,#+278] -vmladavx.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #246] -strh r6, [r0,#+280] -vmladavx.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #248] -strh r4, [r0,#+282] -vmladavx.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #250] -strh r14, [r0,#+284] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #252] -vldrw.u32 Q1, [Q2, #44] -strh r12, [r0,#+286] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #254] -strh r10, [r0,#+288] -vmladavx.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #60] -strh r8, [r0,#+290] -ldrh r8, [r0,#+48] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #-12] -strh r6, [r0,#+292] -ldrh r6, [r0,#+50] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #-10] -strh r4, [r0,#+294] -ldrh r4, [r0,#+52] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #-8] -strh r14, [r0,#+296] -ldrh r14, [r0,#+54] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #-6] -strh r12, [r0,#+298] -ldrh r12, [r0,#+56] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #-4] -strh r10, [r0,#+300] -ldrh r10, [r0,#+58] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #-2] -strh r8, [r0,#+48] -ldrh r8, [r0,#+60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #60] -strh r6, [r0,#+50] -ldrh r6, [r0,#+62] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -strh r4, [r0,#+52] -ldrh r4, [r0,#+64] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #4] -strh r14, [r0,#+54] -ldrh r14, [r0,#+66] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #6] -strh r12, [r0,#+56] -ldrh r12, [r0,#+68] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #8] -strh r10, [r0,#+58] -ldrh r10, [r0,#+70] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #10] -strh r8, [r0,#+60] -ldrh r8, [r0,#+72] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #12] -strh r6, [r0,#+62] -ldrh r6, [r0,#+74] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #14] -vldrw.u32 Q7, [Q2, #60] -strh r4, [r0,#+64] -ldrh r4, [r0,#+76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #16] -strh r14, [r0,#+66] -ldrh r14, [r0,#+78] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #18] -strh r12, [r0,#+68] -ldrh r12, [r0,#+80] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #20] -strh r10, [r0,#+70] -ldrh r10, [r0,#+82] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #22] -strh r8, [r0,#+72] -ldrh r8, [r0,#+84] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #24] -strh r6, [r0,#+74] -ldrh r6, [r0,#+86] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #26] -strh r4, [r0,#+76] -ldrh r4, [r0,#+88] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #28] -vldrw.u32 Q0, [Q2, #60] -strh r14, [r0,#+78] -ldrh r14, [r0,#+90] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -strh r12, [r0,#+80] -ldrh r12, [r0,#+92] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #32] -strh r10, [r0,#+82] -ldrh r10, [r0,#+94] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #34] -strh r8, [r0,#+84] -ldrh r8, [r0,#+96] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #36] -strh r6, [r0,#+86] -ldrh r6, [r0,#+98] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #38] -strh r4, [r0,#+88] -ldrh r4, [r0,#+100] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #40] -strh r14, [r0,#+90] -ldrh r14, [r0,#+102] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #60] -strh r12, [r0,#+92] -ldrh r12, [r0,#+104] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -strh r10, [r0,#+94] -ldrh r10, [r0,#+106] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #46] -strh r8, [r0,#+96] -ldrh r8, [r0,#+108] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #48] -strh r6, [r0,#+98] -ldrh r6, [r0,#+110] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #50] -strh r4, [r0,#+100] -ldrh r4, [r0,#+112] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #52] -strh r14, [r0,#+102] -ldrh r14, [r0,#+114] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #54] -strh r12, [r0,#+104] -ldrh r12, [r0,#+116] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #60] -strh r10, [r0,#+106] -ldrh r10, [r0,#+118] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #58] -strh r8, [r0,#+108] -ldrh r8, [r0,#+120] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #60] -strh r6, [r0,#+110] -ldrh r6, [r0,#+122] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #62] -strh r4, [r0,#+112] -ldrh r4, [r0,#+124] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #64] -strh r14, [r0,#+114] -ldrh r14, [r0,#+126] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #66] -strh r12, [r0,#+116] -ldrh r12, [r0,#+128] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #68] -strh r10, [r0,#+118] -ldrh r10, [r0,#+130] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #70] -vldrw.u32 Q4, [Q2, #60] -strh r8, [r0,#+120] -ldrh r8, [r0,#+132] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #72] -strh r6, [r0,#+122] -ldrh r6, [r0,#+134] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #74] -strh r4, [r0,#+124] -ldrh r4, [r0,#+136] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #76] -strh r14, [r0,#+126] -ldrh r14, [r0,#+138] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #78] -strh r12, [r0,#+128] -ldrh r12, [r0,#+140] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #80] -strh r10, [r0,#+130] -ldrh r10, [r0,#+142] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #82] -strh r8, [r0,#+132] -ldrh r8, [r0,#+144] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #84] -vldrw.u32 Q5, [Q2, #60] -strh r6, [r0,#+134] -ldrh r6, [r0,#+146] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #86] -strh r4, [r0,#+136] -ldrh r4, [r0,#+148] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #88] -strh r14, [r0,#+138] -ldrh r14, [r0,#+150] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #90] -strh r12, [r0,#+140] -ldrh r12, [r0,#+152] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #92] -strh r10, [r0,#+142] -ldrh r10, [r0,#+154] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #94] -strh r8, [r0,#+144] -ldrh r8, [r0,#+156] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #96] -strh r6, [r0,#+146] -ldrh r6, [r0,#+158] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #60] -strh r4, [r0,#+148] -ldrh r4, [r0,#+160] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -strh r14, [r0,#+150] -ldrh r14, [r0,#+162] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #102] -strh r12, [r0,#+152] -ldrh r12, [r0,#+164] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #104] -strh r10, [r0,#+154] -ldrh r10, [r0,#+166] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #106] -strh r8, [r0,#+156] -ldrh r8, [r0,#+168] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #108] -strh r6, [r0,#+158] -ldrh r6, [r0,#+170] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #110] -strh r4, [r0,#+160] -ldrh r4, [r0,#+172] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #112] -vldrw.u32 Q7, [Q2, #60] -strh r14, [r0,#+162] -ldrh r14, [r0,#+174] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #114] -strh r12, [r0,#+164] -ldrh r12, [r0,#+176] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #116] -strh r10, [r0,#+166] -ldrh r10, [r0,#+178] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #118] -strh r8, [r0,#+168] -ldrh r8, [r0,#+180] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #120] -strh r6, [r0,#+170] -ldrh r6, [r0,#+182] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #122] -strh r4, [r0,#+172] -ldrh r4, [r0,#+184] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #124] -strh r14, [r0,#+174] -ldrh r14, [r0,#+186] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #60] -strh r12, [r0,#+176] -ldrh r12, [r0,#+188] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -strh r10, [r0,#+178] -ldrh r10, [r0,#+190] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #130] -strh r8, [r0,#+180] -ldrh r8, [r0,#+192] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #132] -strh r6, [r0,#+182] -ldrh r6, [r0,#+194] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #134] -strh r4, [r0,#+184] -ldrh r4, [r0,#+196] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #136] -strh r14, [r0,#+186] -ldrh r14, [r0,#+198] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #138] -strh r12, [r0,#+188] -ldrh r12, [r0,#+200] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #140] -vldrw.u32 Q1, [Q2, #60] -strh r10, [r0,#+190] -ldrh r10, [r0,#+202] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #142] -strh r8, [r0,#+192] -ldrh r8, [r0,#+204] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #144] -strh r6, [r0,#+194] -ldrh r6, [r0,#+206] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #146] -strh r4, [r0,#+196] -ldrh r4, [r0,#+208] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #148] -strh r14, [r0,#+198] -ldrh r14, [r0,#+210] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #150] -strh r12, [r0,#+200] -ldrh r12, [r0,#+212] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #152] -strh r10, [r0,#+202] -ldrh r10, [r0,#+214] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #154] -vldrw.u32 Q3, [Q2, #60] -strh r8, [r0,#+204] -ldrh r8, [r0,#+216] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -strh r6, [r0,#+206] -ldrh r6, [r0,#+218] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #158] -strh r4, [r0,#+208] -ldrh r4, [r0,#+220] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #160] -strh r14, [r0,#+210] -ldrh r14, [r0,#+222] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #162] -strh r12, [r0,#+212] -ldrh r12, [r0,#+224] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #164] -strh r10, [r0,#+214] -ldrh r10, [r0,#+226] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #166] -strh r8, [r0,#+216] -ldrh r8, [r0,#+228] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #168] -vldrw.u32 Q4, [Q2, #60] -strh r6, [r0,#+218] -ldrh r6, [r0,#+230] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #170] -strh r4, [r0,#+220] -ldrh r4, [r0,#+232] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #172] -strh r14, [r0,#+222] -ldrh r14, [r0,#+234] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #174] -strh r12, [r0,#+224] -ldrh r12, [r0,#+236] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #176] -strh r10, [r0,#+226] -ldrh r10, [r0,#+238] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #178] -strh r8, [r0,#+228] -ldrh r8, [r0,#+240] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #180] -strh r6, [r0,#+230] -ldrh r6, [r0,#+242] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #182] -vldrw.u32 Q5, [Q2, #60] -strh r4, [r0,#+232] -ldrh r4, [r0,#+244] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #184] -strh r14, [r0,#+234] -ldrh r14, [r0,#+246] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #186] -strh r12, [r0,#+236] -ldrh r12, [r0,#+248] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #188] -strh r10, [r0,#+238] -ldrh r10, [r0,#+250] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #190] -strh r8, [r0,#+240] -ldrh r8, [r0,#+252] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #192] -strh r6, [r0,#+242] -ldrh r6, [r0,#+254] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #194] -strh r4, [r0,#+244] -ldrh r4, [r0,#+256] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #60] -strh r14, [r0,#+246] -ldrh r14, [r0,#+258] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #198] -strh r12, [r0,#+248] -ldrh r12, [r0,#+260] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #200] -strh r10, [r0,#+250] -ldrh r10, [r0,#+262] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #202] -strh r8, [r0,#+252] -ldrh r8, [r0,#+264] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #204] -strh r6, [r0,#+254] -ldrh r6, [r0,#+266] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #206] -strh r4, [r0,#+256] -ldrh r4, [r0,#+268] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #208] -strh r14, [r0,#+258] -ldrh r14, [r0,#+270] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #210] -vldrw.u32 Q7, [Q2, #60] -strh r12, [r0,#+260] -ldrh r12, [r0,#+272] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #212] -strh r10, [r0,#+262] -ldrh r10, [r0,#+274] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #214] -strh r8, [r0,#+264] -ldrh r8, [r0,#+276] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #216] -strh r6, [r0,#+266] -ldrh r6, [r0,#+278] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #218] -strh r4, [r0,#+268] -ldrh r4, [r0,#+280] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #220] -strh r14, [r0,#+270] -ldrh r14, [r0,#+282] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #222] -strh r12, [r0,#+272] -ldrh r12, [r0,#+284] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #224] -vldrw.u32 Q0, [Q2, #60] -strh r10, [r0,#+274] -ldrh r10, [r0,#+286] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #226] -strh r8, [r0,#+276] -ldrh r8, [r0,#+288] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #228] -strh r6, [r0,#+278] -ldrh r6, [r0,#+290] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #230] -strh r4, [r0,#+280] -ldrh r4, [r0,#+292] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #232] -strh r14, [r0,#+282] -ldrh r14, [r0,#+294] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #234] -strh r12, [r0,#+284] -ldrh r12, [r0,#+296] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #236] -strh r10, [r0,#+286] -ldrh r10, [r0,#+298] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #238] -vldrw.u32 Q1, [Q2, #60] -strh r8, [r0,#+288] -ldrh r8, [r0,#+300] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #240] -strh r6, [r0,#+290] -vmladavx.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #242] -strh r4, [r0,#+292] -vmladavx.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #244] -strh r14, [r0,#+294] -vmladavx.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #246] -strh r12, [r0,#+296] -vmladavx.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #248] -strh r10, [r0,#+298] -vmladavx.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #250] -strh r8, [r0,#+300] -vmladavx.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #252] -vldrw.u32 Q3, [Q2, #60] -strh r6, [r0,#+302] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #254] -strh r4, [r0,#+304] -vmladavx.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #76] -strh r14, [r0,#+306] -ldrh r14, [r0,#+64] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -strh r12, [r0,#+308] -ldrh r12, [r0,#+66] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #-10] -strh r10, [r0,#+310] -ldrh r10, [r0,#+68] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #-8] -strh r8, [r0,#+312] -ldrh r8, [r0,#+70] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #-6] -strh r6, [r0,#+314] -ldrh r6, [r0,#+72] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #-4] -strh r4, [r0,#+316] -ldrh r4, [r0,#+74] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #-2] -strh r14, [r0,#+64] -ldrh r14, [r0,#+76] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #0] -vldrw.u32 Q7, [Q2, #76] -strh r12, [r0,#+66] -ldrh r12, [r0,#+78] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #2] -strh r10, [r0,#+68] -ldrh r10, [r0,#+80] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #4] -strh r8, [r0,#+70] -ldrh r8, [r0,#+82] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #6] -strh r6, [r0,#+72] -ldrh r6, [r0,#+84] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #8] -strh r4, [r0,#+74] -ldrh r4, [r0,#+86] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #10] -strh r14, [r0,#+76] -ldrh r14, [r0,#+88] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #12] -strh r12, [r0,#+78] -ldrh r12, [r0,#+90] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #76] -strh r10, [r0,#+80] -ldrh r10, [r0,#+92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -strh r8, [r0,#+82] -ldrh r8, [r0,#+94] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #18] -strh r6, [r0,#+84] -ldrh r6, [r0,#+96] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #20] -strh r4, [r0,#+86] -ldrh r4, [r0,#+98] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #22] -strh r14, [r0,#+88] -ldrh r14, [r0,#+100] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #24] -strh r12, [r0,#+90] -ldrh r12, [r0,#+102] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #26] -strh r10, [r0,#+92] -ldrh r10, [r0,#+104] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #28] -vldrw.u32 Q1, [Q2, #76] -strh r8, [r0,#+94] -ldrh r8, [r0,#+106] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #30] -strh r6, [r0,#+96] -ldrh r6, [r0,#+108] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #32] -strh r4, [r0,#+98] -ldrh r4, [r0,#+110] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #34] -strh r14, [r0,#+100] -ldrh r14, [r0,#+112] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #36] -strh r12, [r0,#+102] -ldrh r12, [r0,#+114] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #38] -strh r10, [r0,#+104] -ldrh r10, [r0,#+116] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #40] -strh r8, [r0,#+106] -ldrh r8, [r0,#+118] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #42] -vldrw.u32 Q3, [Q2, #76] -strh r6, [r0,#+108] -ldrh r6, [r0,#+120] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -strh r4, [r0,#+110] -ldrh r4, [r0,#+122] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #46] -strh r14, [r0,#+112] -ldrh r14, [r0,#+124] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #48] -strh r12, [r0,#+114] -ldrh r12, [r0,#+126] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #50] -strh r10, [r0,#+116] -ldrh r10, [r0,#+128] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #52] -strh r8, [r0,#+118] -ldrh r8, [r0,#+130] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #54] -strh r6, [r0,#+120] -ldrh r6, [r0,#+132] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #56] -vldrw.u32 Q4, [Q2, #76] -strh r4, [r0,#+122] -ldrh r4, [r0,#+134] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #58] -strh r14, [r0,#+124] -ldrh r14, [r0,#+136] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #60] -strh r12, [r0,#+126] -ldrh r12, [r0,#+138] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #62] -strh r10, [r0,#+128] -ldrh r10, [r0,#+140] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #64] -strh r8, [r0,#+130] -ldrh r8, [r0,#+142] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #66] -strh r6, [r0,#+132] -ldrh r6, [r0,#+144] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #68] -strh r4, [r0,#+134] -ldrh r4, [r0,#+146] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #70] -vldrw.u32 Q5, [Q2, #76] -strh r14, [r0,#+136] -ldrh r14, [r0,#+148] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #72] -strh r12, [r0,#+138] -ldrh r12, [r0,#+150] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #74] -strh r10, [r0,#+140] -ldrh r10, [r0,#+152] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #76] -strh r8, [r0,#+142] -ldrh r8, [r0,#+154] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #78] -strh r6, [r0,#+144] -ldrh r6, [r0,#+156] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #80] -strh r4, [r0,#+146] -ldrh r4, [r0,#+158] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #82] -strh r14, [r0,#+148] -ldrh r14, [r0,#+160] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #76] -strh r12, [r0,#+150] -ldrh r12, [r0,#+162] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -strh r10, [r0,#+152] -ldrh r10, [r0,#+164] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #88] -strh r8, [r0,#+154] -ldrh r8, [r0,#+166] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #90] -strh r6, [r0,#+156] -ldrh r6, [r0,#+168] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #92] -strh r4, [r0,#+158] -ldrh r4, [r0,#+170] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #94] -strh r14, [r0,#+160] -ldrh r14, [r0,#+172] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #96] -strh r12, [r0,#+162] -ldrh r12, [r0,#+174] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #98] -vldrw.u32 Q7, [Q2, #76] -strh r10, [r0,#+164] -ldrh r10, [r0,#+176] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #100] -strh r8, [r0,#+166] -ldrh r8, [r0,#+178] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #102] -strh r6, [r0,#+168] -ldrh r6, [r0,#+180] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #104] -strh r4, [r0,#+170] -ldrh r4, [r0,#+182] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #106] -strh r14, [r0,#+172] -ldrh r14, [r0,#+184] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #108] -strh r12, [r0,#+174] -ldrh r12, [r0,#+186] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #110] -strh r10, [r0,#+176] -ldrh r10, [r0,#+188] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #76] -strh r8, [r0,#+178] -ldrh r8, [r0,#+190] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -strh r6, [r0,#+180] -ldrh r6, [r0,#+192] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #116] -strh r4, [r0,#+182] -ldrh r4, [r0,#+194] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #118] -strh r14, [r0,#+184] -ldrh r14, [r0,#+196] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #120] -strh r12, [r0,#+186] -ldrh r12, [r0,#+198] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #122] -strh r10, [r0,#+188] -ldrh r10, [r0,#+200] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #124] -strh r8, [r0,#+190] -ldrh r8, [r0,#+202] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #126] -vldrw.u32 Q1, [Q2, #76] -strh r6, [r0,#+192] -ldrh r6, [r0,#+204] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #128] -strh r4, [r0,#+194] -ldrh r4, [r0,#+206] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #130] -strh r14, [r0,#+196] -ldrh r14, [r0,#+208] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #132] -strh r12, [r0,#+198] -ldrh r12, [r0,#+210] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #134] -strh r10, [r0,#+200] -ldrh r10, [r0,#+212] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #136] -strh r8, [r0,#+202] -ldrh r8, [r0,#+214] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #138] -strh r6, [r0,#+204] -ldrh r6, [r0,#+216] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #76] -strh r4, [r0,#+206] -ldrh r4, [r0,#+218] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #142] -strh r14, [r0,#+208] -ldrh r14, [r0,#+220] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #144] -strh r12, [r0,#+210] -ldrh r12, [r0,#+222] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #146] -strh r10, [r0,#+212] -ldrh r10, [r0,#+224] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #148] -strh r8, [r0,#+214] -ldrh r8, [r0,#+226] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #150] -strh r6, [r0,#+216] -ldrh r6, [r0,#+228] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #152] -strh r4, [r0,#+218] -ldrh r4, [r0,#+230] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #154] -vldrw.u32 Q4, [Q2, #76] -strh r14, [r0,#+220] -ldrh r14, [r0,#+232] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #156] -strh r12, [r0,#+222] -ldrh r12, [r0,#+234] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #158] -strh r10, [r0,#+224] -ldrh r10, [r0,#+236] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #160] -strh r8, [r0,#+226] -ldrh r8, [r0,#+238] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #162] -strh r6, [r0,#+228] -ldrh r6, [r0,#+240] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #164] -strh r4, [r0,#+230] -ldrh r4, [r0,#+242] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #166] -strh r14, [r0,#+232] -ldrh r14, [r0,#+244] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #168] -vldrw.u32 Q5, [Q2, #76] -strh r12, [r0,#+234] -ldrh r12, [r0,#+246] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #170] -strh r10, [r0,#+236] -ldrh r10, [r0,#+248] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #172] -strh r8, [r0,#+238] -ldrh r8, [r0,#+250] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #174] -strh r6, [r0,#+240] -ldrh r6, [r0,#+252] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #176] -strh r4, [r0,#+242] -ldrh r4, [r0,#+254] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #178] -strh r14, [r0,#+244] -ldrh r14, [r0,#+256] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #180] -strh r12, [r0,#+246] -ldrh r12, [r0,#+258] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #182] -vldrw.u32 Q6, [Q2, #76] -strh r10, [r0,#+248] -ldrh r10, [r0,#+260] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #184] -strh r8, [r0,#+250] -ldrh r8, [r0,#+262] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #186] -strh r6, [r0,#+252] -ldrh r6, [r0,#+264] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #188] -strh r4, [r0,#+254] -ldrh r4, [r0,#+266] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #190] -strh r14, [r0,#+256] -ldrh r14, [r0,#+268] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #192] -strh r12, [r0,#+258] -ldrh r12, [r0,#+270] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #194] -strh r10, [r0,#+260] -ldrh r10, [r0,#+272] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #196] -vldrw.u32 Q7, [Q2, #76] -strh r8, [r0,#+262] -ldrh r8, [r0,#+274] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #198] -strh r6, [r0,#+264] -ldrh r6, [r0,#+276] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #200] -strh r4, [r0,#+266] -ldrh r4, [r0,#+278] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #202] -strh r14, [r0,#+268] -ldrh r14, [r0,#+280] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #204] -strh r12, [r0,#+270] -ldrh r12, [r0,#+282] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #206] -strh r10, [r0,#+272] -ldrh r10, [r0,#+284] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #208] -strh r8, [r0,#+274] -ldrh r8, [r0,#+286] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #210] -vldrw.u32 Q0, [Q2, #76] -strh r6, [r0,#+276] -ldrh r6, [r0,#+288] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #212] -strh r4, [r0,#+278] -ldrh r4, [r0,#+290] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #214] -strh r14, [r0,#+280] -ldrh r14, [r0,#+292] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #216] -strh r12, [r0,#+282] -ldrh r12, [r0,#+294] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #218] -strh r10, [r0,#+284] -ldrh r10, [r0,#+296] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #220] -strh r8, [r0,#+286] -ldrh r8, [r0,#+298] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #222] -strh r6, [r0,#+288] -ldrh r6, [r0,#+300] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #224] -vldrw.u32 Q1, [Q2, #76] -strh r4, [r0,#+290] -ldrh r4, [r0,#+302] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #226] -strh r14, [r0,#+292] -ldrh r14, [r0,#+304] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #228] -strh r12, [r0,#+294] -ldrh r12, [r0,#+306] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #230] -strh r10, [r0,#+296] -ldrh r10, [r0,#+308] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #232] -strh r8, [r0,#+298] -ldrh r8, [r0,#+310] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #234] -strh r6, [r0,#+300] -ldrh r6, [r0,#+312] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #236] -strh r4, [r0,#+302] -ldrh r4, [r0,#+314] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #238] -vldrw.u32 Q3, [Q2, #76] -strh r14, [r0,#+304] -ldrh r14, [r0,#+316] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -strh r12, [r0,#+306] -vmladavx.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #242] -strh r10, [r0,#+308] -vmladavx.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #244] -strh r8, [r0,#+310] -vmladavx.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #246] -strh r6, [r0,#+312] -vmladavx.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #248] -strh r4, [r0,#+314] -vmladavx.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #250] -strh r14, [r0,#+316] -vmladavx.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #252] -vldrw.u32 Q4, [Q2, #76] -strh r12, [r0,#+318] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #254] -strh r10, [r0,#+320] -vmladavx.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #-14] -vldrw.u32 Q7, [Q2, #92] -strh r8, [r0,#+322] -ldrh r8, [r0,#+80] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #-12] -strh r6, [r0,#+324] -ldrh r6, [r0,#+82] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #-10] -strh r4, [r0,#+326] -ldrh r4, [r0,#+84] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #-8] -strh r14, [r0,#+328] -ldrh r14, [r0,#+86] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #-6] -strh r12, [r0,#+330] -ldrh r12, [r0,#+88] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #-4] -strh r10, [r0,#+332] -ldrh r10, [r0,#+90] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #-2] -strh r8, [r0,#+80] -ldrh r8, [r0,#+92] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #92] -strh r6, [r0,#+82] -ldrh r6, [r0,#+94] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -strh r4, [r0,#+84] -ldrh r4, [r0,#+96] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #4] -strh r14, [r0,#+86] -ldrh r14, [r0,#+98] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #6] -strh r12, [r0,#+88] -ldrh r12, [r0,#+100] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #8] -strh r10, [r0,#+90] -ldrh r10, [r0,#+102] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #10] -strh r8, [r0,#+92] -ldrh r8, [r0,#+104] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #12] -strh r6, [r0,#+94] -ldrh r6, [r0,#+106] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #14] -vldrw.u32 Q1, [Q2, #92] -strh r4, [r0,#+96] -ldrh r4, [r0,#+108] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #16] -strh r14, [r0,#+98] -ldrh r14, [r0,#+110] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #18] -strh r12, [r0,#+100] -ldrh r12, [r0,#+112] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #20] -strh r10, [r0,#+102] -ldrh r10, [r0,#+114] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #22] -strh r8, [r0,#+104] -ldrh r8, [r0,#+116] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #24] -strh r6, [r0,#+106] -ldrh r6, [r0,#+118] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #26] -strh r4, [r0,#+108] -ldrh r4, [r0,#+120] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #92] -strh r14, [r0,#+110] -ldrh r14, [r0,#+122] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #30] -strh r12, [r0,#+112] -ldrh r12, [r0,#+124] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #32] -strh r10, [r0,#+114] -ldrh r10, [r0,#+126] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #34] -strh r8, [r0,#+116] -ldrh r8, [r0,#+128] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #36] -strh r6, [r0,#+118] -ldrh r6, [r0,#+130] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #38] -strh r4, [r0,#+120] -ldrh r4, [r0,#+132] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #40] -strh r14, [r0,#+122] -ldrh r14, [r0,#+134] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #42] -vldrw.u32 Q4, [Q2, #92] -strh r12, [r0,#+124] -ldrh r12, [r0,#+136] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -strh r10, [r0,#+126] -ldrh r10, [r0,#+138] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #46] -strh r8, [r0,#+128] -ldrh r8, [r0,#+140] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #48] -strh r6, [r0,#+130] -ldrh r6, [r0,#+142] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #50] -strh r4, [r0,#+132] -ldrh r4, [r0,#+144] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #52] -strh r14, [r0,#+134] -ldrh r14, [r0,#+146] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #54] -strh r12, [r0,#+136] -ldrh r12, [r0,#+148] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #56] -vldrw.u32 Q5, [Q2, #92] -strh r10, [r0,#+138] -ldrh r10, [r0,#+150] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -strh r8, [r0,#+140] -ldrh r8, [r0,#+152] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #60] -strh r6, [r0,#+142] -ldrh r6, [r0,#+154] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #62] -strh r4, [r0,#+144] -ldrh r4, [r0,#+156] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #64] -strh r14, [r0,#+146] -ldrh r14, [r0,#+158] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #66] -strh r12, [r0,#+148] -ldrh r12, [r0,#+160] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #68] -strh r10, [r0,#+150] -ldrh r10, [r0,#+162] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #92] -strh r8, [r0,#+152] -ldrh r8, [r0,#+164] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -strh r6, [r0,#+154] -ldrh r6, [r0,#+166] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #74] -strh r4, [r0,#+156] -ldrh r4, [r0,#+168] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #76] -strh r14, [r0,#+158] -ldrh r14, [r0,#+170] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #78] -strh r12, [r0,#+160] -ldrh r12, [r0,#+172] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #80] -strh r10, [r0,#+162] -ldrh r10, [r0,#+174] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #82] -strh r8, [r0,#+164] -ldrh r8, [r0,#+176] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #84] -vldrw.u32 Q7, [Q2, #92] -strh r6, [r0,#+166] -ldrh r6, [r0,#+178] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -strh r4, [r0,#+168] -ldrh r4, [r0,#+180] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #88] -strh r14, [r0,#+170] -ldrh r14, [r0,#+182] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #90] -strh r12, [r0,#+172] -ldrh r12, [r0,#+184] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #92] -strh r10, [r0,#+174] -ldrh r10, [r0,#+186] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #94] -strh r8, [r0,#+176] -ldrh r8, [r0,#+188] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #96] -strh r6, [r0,#+178] -ldrh r6, [r0,#+190] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #92] -strh r4, [r0,#+180] -ldrh r4, [r0,#+192] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #100] -strh r14, [r0,#+182] -ldrh r14, [r0,#+194] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #102] -strh r12, [r0,#+184] -ldrh r12, [r0,#+196] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #104] -strh r10, [r0,#+186] -ldrh r10, [r0,#+198] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #106] -strh r8, [r0,#+188] -ldrh r8, [r0,#+200] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #108] -strh r6, [r0,#+190] -ldrh r6, [r0,#+202] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #110] -strh r4, [r0,#+192] -ldrh r4, [r0,#+204] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #112] -vldrw.u32 Q1, [Q2, #92] -strh r14, [r0,#+194] -ldrh r14, [r0,#+206] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #114] -strh r12, [r0,#+196] -ldrh r12, [r0,#+208] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #116] -strh r10, [r0,#+198] -ldrh r10, [r0,#+210] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #118] -strh r8, [r0,#+200] -ldrh r8, [r0,#+212] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #120] -strh r6, [r0,#+202] -ldrh r6, [r0,#+214] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #122] -strh r4, [r0,#+204] -ldrh r4, [r0,#+216] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #124] -strh r14, [r0,#+206] -ldrh r14, [r0,#+218] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #126] -vldrw.u32 Q3, [Q2, #92] -strh r12, [r0,#+208] -ldrh r12, [r0,#+220] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #128] -strh r10, [r0,#+210] -ldrh r10, [r0,#+222] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #130] -strh r8, [r0,#+212] -ldrh r8, [r0,#+224] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #132] -strh r6, [r0,#+214] -ldrh r6, [r0,#+226] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #134] -strh r4, [r0,#+216] -ldrh r4, [r0,#+228] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #136] -strh r14, [r0,#+218] -ldrh r14, [r0,#+230] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #138] -strh r12, [r0,#+220] -ldrh r12, [r0,#+232] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #140] -vldrw.u32 Q4, [Q2, #92] -strh r10, [r0,#+222] -ldrh r10, [r0,#+234] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #142] -strh r8, [r0,#+224] -ldrh r8, [r0,#+236] -vmladavax.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #144] -strh r6, [r0,#+226] -ldrh r6, [r0,#+238] -vmladavax.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #146] -strh r4, [r0,#+228] -ldrh r4, [r0,#+240] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #148] -strh r14, [r0,#+230] -ldrh r14, [r0,#+242] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #150] -strh r12, [r0,#+232] -ldrh r12, [r0,#+244] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #152] -strh r10, [r0,#+234] -ldrh r10, [r0,#+246] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #154] -vldrw.u32 Q5, [Q2, #92] -strh r8, [r0,#+236] -ldrh r8, [r0,#+248] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #156] -strh r6, [r0,#+238] -ldrh r6, [r0,#+250] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #158] -strh r4, [r0,#+240] -ldrh r4, [r0,#+252] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #160] -strh r14, [r0,#+242] -ldrh r14, [r0,#+254] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #162] -strh r12, [r0,#+244] -ldrh r12, [r0,#+256] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #164] -strh r10, [r0,#+246] -ldrh r10, [r0,#+258] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #166] -strh r8, [r0,#+248] -ldrh r8, [r0,#+260] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #168] -vldrw.u32 Q6, [Q2, #92] -strh r6, [r0,#+250] -ldrh r6, [r0,#+262] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #170] -strh r4, [r0,#+252] -ldrh r4, [r0,#+264] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #172] -strh r14, [r0,#+254] -ldrh r14, [r0,#+266] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #174] -strh r12, [r0,#+256] -ldrh r12, [r0,#+268] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #176] -strh r10, [r0,#+258] -ldrh r10, [r0,#+270] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #178] -strh r8, [r0,#+260] -ldrh r8, [r0,#+272] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #180] -strh r6, [r0,#+262] -ldrh r6, [r0,#+274] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #92] -strh r4, [r0,#+264] -ldrh r4, [r0,#+276] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #184] -strh r14, [r0,#+266] -ldrh r14, [r0,#+278] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #186] -strh r12, [r0,#+268] -ldrh r12, [r0,#+280] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #188] -strh r10, [r0,#+270] -ldrh r10, [r0,#+282] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #190] -strh r8, [r0,#+272] -ldrh r8, [r0,#+284] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #192] -strh r6, [r0,#+274] -ldrh r6, [r0,#+286] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #194] -strh r4, [r0,#+276] -ldrh r4, [r0,#+288] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #196] -vldrw.u32 Q0, [Q2, #92] -strh r14, [r0,#+278] -ldrh r14, [r0,#+290] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #198] -strh r12, [r0,#+280] -ldrh r12, [r0,#+292] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #200] -strh r10, [r0,#+282] -ldrh r10, [r0,#+294] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #202] -strh r8, [r0,#+284] -ldrh r8, [r0,#+296] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #204] -strh r6, [r0,#+286] -ldrh r6, [r0,#+298] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #206] -strh r4, [r0,#+288] -ldrh r4, [r0,#+300] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #208] -strh r14, [r0,#+290] -ldrh r14, [r0,#+302] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #210] -vldrw.u32 Q1, [Q2, #92] -strh r12, [r0,#+292] -ldrh r12, [r0,#+304] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #212] -strh r10, [r0,#+294] -ldrh r10, [r0,#+306] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #214] -strh r8, [r0,#+296] -ldrh r8, [r0,#+308] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #216] -strh r6, [r0,#+298] -ldrh r6, [r0,#+310] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #218] -strh r4, [r0,#+300] -ldrh r4, [r0,#+312] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #220] -strh r14, [r0,#+302] -ldrh r14, [r0,#+314] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #222] -strh r12, [r0,#+304] -ldrh r12, [r0,#+316] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #92] -strh r10, [r0,#+306] -ldrh r10, [r0,#+318] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #226] -strh r8, [r0,#+308] -ldrh r8, [r0,#+320] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #228] -strh r6, [r0,#+310] -ldrh r6, [r0,#+322] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #230] -strh r4, [r0,#+312] -ldrh r4, [r0,#+324] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #232] -strh r14, [r0,#+314] -ldrh r14, [r0,#+326] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #234] -strh r12, [r0,#+316] -ldrh r12, [r0,#+328] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #236] -strh r10, [r0,#+318] -ldrh r10, [r0,#+330] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #238] -vldrw.u32 Q4, [Q2, #92] -strh r8, [r0,#+320] -ldrh r8, [r0,#+332] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #240] -strh r6, [r0,#+322] -vmladavx.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #242] -strh r4, [r0,#+324] -vmladavx.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #244] -strh r14, [r0,#+326] -vmladavx.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #246] -strh r12, [r0,#+328] -vmladavx.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #248] -strh r10, [r0,#+330] -vmladavx.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #250] -strh r8, [r0,#+332] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #92] -strh r6, [r0,#+334] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #254] -strh r4, [r0,#+336] -vmladavx.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #108] -strh r14, [r0,#+338] -ldrh r14, [r0,#+96] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r12, [r0,#+340] -ldrh r12, [r0,#+98] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #-10] -strh r10, [r0,#+342] -ldrh r10, [r0,#+100] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #-8] -strh r8, [r0,#+344] -ldrh r8, [r0,#+102] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #-6] -strh r6, [r0,#+346] -ldrh r6, [r0,#+104] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #-4] -strh r4, [r0,#+348] -ldrh r4, [r0,#+106] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #-2] -strh r14, [r0,#+96] -ldrh r14, [r0,#+108] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #0] -vldrw.u32 Q1, [Q2, #108] -strh r12, [r0,#+98] -ldrh r12, [r0,#+110] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #2] -strh r10, [r0,#+100] -ldrh r10, [r0,#+112] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #4] -strh r8, [r0,#+102] -ldrh r8, [r0,#+114] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #6] -strh r6, [r0,#+104] -ldrh r6, [r0,#+116] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #8] -strh r4, [r0,#+106] -ldrh r4, [r0,#+118] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #10] -strh r14, [r0,#+108] -ldrh r14, [r0,#+120] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #12] -strh r12, [r0,#+110] -ldrh r12, [r0,#+122] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #14] -vldrw.u32 Q3, [Q2, #108] -strh r10, [r0,#+112] -ldrh r10, [r0,#+124] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #16] -strh r8, [r0,#+114] -ldrh r8, [r0,#+126] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #18] -strh r6, [r0,#+116] -ldrh r6, [r0,#+128] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #20] -strh r4, [r0,#+118] -ldrh r4, [r0,#+130] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #22] -strh r14, [r0,#+120] -ldrh r14, [r0,#+132] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #24] -strh r12, [r0,#+122] -ldrh r12, [r0,#+134] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #26] -strh r10, [r0,#+124] -ldrh r10, [r0,#+136] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #108] -strh r8, [r0,#+126] -ldrh r8, [r0,#+138] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -strh r6, [r0,#+128] -ldrh r6, [r0,#+140] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #32] -strh r4, [r0,#+130] -ldrh r4, [r0,#+142] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #34] -strh r14, [r0,#+132] -ldrh r14, [r0,#+144] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #36] -strh r12, [r0,#+134] -ldrh r12, [r0,#+146] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #38] -strh r10, [r0,#+136] -ldrh r10, [r0,#+148] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #40] -strh r8, [r0,#+138] -ldrh r8, [r0,#+150] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #108] -strh r6, [r0,#+140] -ldrh r6, [r0,#+152] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #44] -strh r4, [r0,#+142] -ldrh r4, [r0,#+154] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #46] -strh r14, [r0,#+144] -ldrh r14, [r0,#+156] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #48] -strh r12, [r0,#+146] -ldrh r12, [r0,#+158] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #50] -strh r10, [r0,#+148] -ldrh r10, [r0,#+160] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #52] -strh r8, [r0,#+150] -ldrh r8, [r0,#+162] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #54] -strh r6, [r0,#+152] -ldrh r6, [r0,#+164] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #108] -strh r4, [r0,#+154] -ldrh r4, [r0,#+166] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -strh r14, [r0,#+156] -ldrh r14, [r0,#+168] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #60] -strh r12, [r0,#+158] -ldrh r12, [r0,#+170] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #62] -strh r10, [r0,#+160] -ldrh r10, [r0,#+172] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #64] -strh r8, [r0,#+162] -ldrh r8, [r0,#+174] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #66] -strh r6, [r0,#+164] -ldrh r6, [r0,#+176] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #68] -strh r4, [r0,#+166] -ldrh r4, [r0,#+178] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #108] -strh r14, [r0,#+168] -ldrh r14, [r0,#+180] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -strh r12, [r0,#+170] -ldrh r12, [r0,#+182] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #74] -strh r10, [r0,#+172] -ldrh r10, [r0,#+184] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #76] -strh r8, [r0,#+174] -ldrh r8, [r0,#+186] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #78] -strh r6, [r0,#+176] -ldrh r6, [r0,#+188] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #80] -strh r4, [r0,#+178] -ldrh r4, [r0,#+190] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #82] -strh r14, [r0,#+180] -ldrh r14, [r0,#+192] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #84] -vldrw.u32 Q0, [Q2, #108] -strh r12, [r0,#+182] -ldrh r12, [r0,#+194] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #86] -strh r10, [r0,#+184] -ldrh r10, [r0,#+196] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #88] -strh r8, [r0,#+186] -ldrh r8, [r0,#+198] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #90] -strh r6, [r0,#+188] -ldrh r6, [r0,#+200] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #92] -strh r4, [r0,#+190] -ldrh r4, [r0,#+202] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #94] -strh r14, [r0,#+192] -ldrh r14, [r0,#+204] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #96] -strh r12, [r0,#+194] -ldrh r12, [r0,#+206] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #98] -vldrw.u32 Q1, [Q2, #108] -strh r10, [r0,#+196] -ldrh r10, [r0,#+208] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #100] -strh r8, [r0,#+198] -ldrh r8, [r0,#+210] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #102] -strh r6, [r0,#+200] -ldrh r6, [r0,#+212] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #104] -strh r4, [r0,#+202] -ldrh r4, [r0,#+214] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #106] -strh r14, [r0,#+204] -ldrh r14, [r0,#+216] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #108] -strh r12, [r0,#+206] -ldrh r12, [r0,#+218] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #110] -strh r10, [r0,#+208] -ldrh r10, [r0,#+220] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #112] -vldrw.u32 Q3, [Q2, #108] -strh r8, [r0,#+210] -ldrh r8, [r0,#+222] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #114] -strh r6, [r0,#+212] -ldrh r6, [r0,#+224] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #116] -strh r4, [r0,#+214] -ldrh r4, [r0,#+226] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #118] -strh r14, [r0,#+216] -ldrh r14, [r0,#+228] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #120] -strh r12, [r0,#+218] -ldrh r12, [r0,#+230] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #122] -strh r10, [r0,#+220] -ldrh r10, [r0,#+232] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #124] -strh r8, [r0,#+222] -ldrh r8, [r0,#+234] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #108] -strh r6, [r0,#+224] -ldrh r6, [r0,#+236] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #128] -strh r4, [r0,#+226] -ldrh r4, [r0,#+238] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #130] -strh r14, [r0,#+228] -ldrh r14, [r0,#+240] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #132] -strh r12, [r0,#+230] -ldrh r12, [r0,#+242] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #134] -strh r10, [r0,#+232] -ldrh r10, [r0,#+244] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #136] -strh r8, [r0,#+234] -ldrh r8, [r0,#+246] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #138] -strh r6, [r0,#+236] -ldrh r6, [r0,#+248] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #140] -vldrw.u32 Q5, [Q2, #108] -strh r4, [r0,#+238] -ldrh r4, [r0,#+250] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #142] -strh r14, [r0,#+240] -ldrh r14, [r0,#+252] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #144] -strh r12, [r0,#+242] -ldrh r12, [r0,#+254] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #146] -strh r10, [r0,#+244] -ldrh r10, [r0,#+256] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #148] -strh r8, [r0,#+246] -ldrh r8, [r0,#+258] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #150] -strh r6, [r0,#+248] -ldrh r6, [r0,#+260] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #152] -strh r4, [r0,#+250] -ldrh r4, [r0,#+262] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #154] -vldrw.u32 Q6, [Q2, #108] -strh r14, [r0,#+252] -ldrh r14, [r0,#+264] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -strh r12, [r0,#+254] -ldrh r12, [r0,#+266] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #158] -strh r10, [r0,#+256] -ldrh r10, [r0,#+268] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #160] -strh r8, [r0,#+258] -ldrh r8, [r0,#+270] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #162] -strh r6, [r0,#+260] -ldrh r6, [r0,#+272] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #164] -strh r4, [r0,#+262] -ldrh r4, [r0,#+274] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #166] -strh r14, [r0,#+264] -ldrh r14, [r0,#+276] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #168] -vldrw.u32 Q7, [Q2, #108] -strh r12, [r0,#+266] -ldrh r12, [r0,#+278] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #170] -strh r10, [r0,#+268] -ldrh r10, [r0,#+280] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #172] -strh r8, [r0,#+270] -ldrh r8, [r0,#+282] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #174] -strh r6, [r0,#+272] -ldrh r6, [r0,#+284] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #176] -strh r4, [r0,#+274] -ldrh r4, [r0,#+286] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #178] -strh r14, [r0,#+276] -ldrh r14, [r0,#+288] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #180] -strh r12, [r0,#+278] -ldrh r12, [r0,#+290] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #182] -vldrw.u32 Q0, [Q2, #108] -strh r10, [r0,#+280] -ldrh r10, [r0,#+292] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #184] -strh r8, [r0,#+282] -ldrh r8, [r0,#+294] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #186] -strh r6, [r0,#+284] -ldrh r6, [r0,#+296] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #188] -strh r4, [r0,#+286] -ldrh r4, [r0,#+298] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #190] -strh r14, [r0,#+288] -ldrh r14, [r0,#+300] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #192] -strh r12, [r0,#+290] -ldrh r12, [r0,#+302] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #194] -strh r10, [r0,#+292] -ldrh r10, [r0,#+304] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #196] -vldrw.u32 Q1, [Q2, #108] -strh r8, [r0,#+294] -ldrh r8, [r0,#+306] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #198] -strh r6, [r0,#+296] -ldrh r6, [r0,#+308] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #200] -strh r4, [r0,#+298] -ldrh r4, [r0,#+310] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #202] -strh r14, [r0,#+300] -ldrh r14, [r0,#+312] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #204] -strh r12, [r0,#+302] -ldrh r12, [r0,#+314] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #206] -strh r10, [r0,#+304] -ldrh r10, [r0,#+316] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #208] -strh r8, [r0,#+306] -ldrh r8, [r0,#+318] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #210] -vldrw.u32 Q3, [Q2, #108] -strh r6, [r0,#+308] -ldrh r6, [r0,#+320] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -strh r4, [r0,#+310] -ldrh r4, [r0,#+322] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #214] -strh r14, [r0,#+312] -ldrh r14, [r0,#+324] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #216] -strh r12, [r0,#+314] -ldrh r12, [r0,#+326] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #218] -strh r10, [r0,#+316] -ldrh r10, [r0,#+328] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #220] -strh r8, [r0,#+318] -ldrh r8, [r0,#+330] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #222] -strh r6, [r0,#+320] -ldrh r6, [r0,#+332] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #224] -vldrw.u32 Q4, [Q2, #108] -strh r4, [r0,#+322] -ldrh r4, [r0,#+334] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #226] -strh r14, [r0,#+324] -ldrh r14, [r0,#+336] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #228] -strh r12, [r0,#+326] -ldrh r12, [r0,#+338] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #230] -strh r10, [r0,#+328] -ldrh r10, [r0,#+340] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #232] -strh r8, [r0,#+330] -ldrh r8, [r0,#+342] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #234] -strh r6, [r0,#+332] -ldrh r6, [r0,#+344] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #236] -strh r4, [r0,#+334] -ldrh r4, [r0,#+346] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #238] -vldrw.u32 Q5, [Q2, #108] -strh r14, [r0,#+336] -ldrh r14, [r0,#+348] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #240] -strh r12, [r0,#+338] -vmladavx.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #242] -strh r10, [r0,#+340] -vmladavx.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #244] -strh r8, [r0,#+342] -vmladavx.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #246] -strh r6, [r0,#+344] -vmladavx.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #248] -strh r4, [r0,#+346] -vmladavx.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #250] -strh r14, [r0,#+348] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #252] -vldrw.u32 Q6, [Q2, #108] -strh r12, [r0,#+350] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #254] -strh r10, [r0,#+352] -vmladavx.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #-14] -vldrw.u32 Q1, [Q2, #124] -strh r8, [r0,#+354] -ldrh r8, [r0,#+112] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -strh r6, [r0,#+356] -ldrh r6, [r0,#+114] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -strh r4, [r0,#+358] -ldrh r4, [r0,#+116] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -strh r14, [r0,#+360] -ldrh r14, [r0,#+118] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -strh r12, [r0,#+362] -ldrh r12, [r0,#+120] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -strh r10, [r0,#+364] -ldrh r10, [r0,#+122] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r8, [r0,#+112] -ldrh r8, [r0,#+124] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #124] -strh r6, [r0,#+114] -ldrh r6, [r0,#+126] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r4, [r0,#+116] -ldrh r4, [r0,#+128] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r14, [r0,#+118] -ldrh r14, [r0,#+130] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r12, [r0,#+120] -ldrh r12, [r0,#+132] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r10, [r0,#+122] -ldrh r10, [r0,#+134] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r8, [r0,#+124] -ldrh r8, [r0,#+136] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r6, [r0,#+126] -ldrh r6, [r0,#+138] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #124] -strh r4, [r0,#+128] -ldrh r4, [r0,#+140] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r14, [r0,#+130] -ldrh r14, [r0,#+142] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r12, [r0,#+132] -ldrh r12, [r0,#+144] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r10, [r0,#+134] -ldrh r10, [r0,#+146] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r8, [r0,#+136] -ldrh r8, [r0,#+148] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r6, [r0,#+138] -ldrh r6, [r0,#+150] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r4, [r0,#+140] -ldrh r4, [r0,#+152] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #124] -strh r14, [r0,#+142] -ldrh r14, [r0,#+154] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r12, [r0,#+144] -ldrh r12, [r0,#+156] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #32] -strh r10, [r0,#+146] -ldrh r10, [r0,#+158] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #34] -strh r8, [r0,#+148] -ldrh r8, [r0,#+160] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #36] -strh r6, [r0,#+150] -ldrh r6, [r0,#+162] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #38] -strh r4, [r0,#+152] -ldrh r4, [r0,#+164] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -strh r14, [r0,#+154] -ldrh r14, [r0,#+166] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #124] -strh r12, [r0,#+156] -ldrh r12, [r0,#+168] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -strh r10, [r0,#+158] -ldrh r10, [r0,#+170] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #46] -strh r8, [r0,#+160] -ldrh r8, [r0,#+172] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #48] -strh r6, [r0,#+162] -ldrh r6, [r0,#+174] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #50] -strh r4, [r0,#+164] -ldrh r4, [r0,#+176] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #52] -strh r14, [r0,#+166] -ldrh r14, [r0,#+178] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #54] -strh r12, [r0,#+168] -ldrh r12, [r0,#+180] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #124] -strh r10, [r0,#+170] -ldrh r10, [r0,#+182] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -strh r8, [r0,#+172] -ldrh r8, [r0,#+184] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #60] -strh r6, [r0,#+174] -ldrh r6, [r0,#+186] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #62] -strh r4, [r0,#+176] -ldrh r4, [r0,#+188] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #64] -strh r14, [r0,#+178] -ldrh r14, [r0,#+190] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #66] -strh r12, [r0,#+180] -ldrh r12, [r0,#+192] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #68] -strh r10, [r0,#+182] -ldrh r10, [r0,#+194] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #70] -vldrw.u32 Q0, [Q2, #124] -strh r8, [r0,#+184] -ldrh r8, [r0,#+196] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -strh r6, [r0,#+186] -ldrh r6, [r0,#+198] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #74] -strh r4, [r0,#+188] -ldrh r4, [r0,#+200] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #76] -strh r14, [r0,#+190] -ldrh r14, [r0,#+202] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #78] -strh r12, [r0,#+192] -ldrh r12, [r0,#+204] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #80] -strh r10, [r0,#+194] -ldrh r10, [r0,#+206] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #82] -strh r8, [r0,#+196] -ldrh r8, [r0,#+208] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #124] -strh r6, [r0,#+198] -ldrh r6, [r0,#+210] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -strh r4, [r0,#+200] -ldrh r4, [r0,#+212] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #88] -strh r14, [r0,#+202] -ldrh r14, [r0,#+214] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #90] -strh r12, [r0,#+204] -ldrh r12, [r0,#+216] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #92] -strh r10, [r0,#+206] -ldrh r10, [r0,#+218] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #94] -strh r8, [r0,#+208] -ldrh r8, [r0,#+220] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #96] -strh r6, [r0,#+210] -ldrh r6, [r0,#+222] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #124] -strh r4, [r0,#+212] -ldrh r4, [r0,#+224] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -strh r14, [r0,#+214] -ldrh r14, [r0,#+226] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #102] -strh r12, [r0,#+216] -ldrh r12, [r0,#+228] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #104] -strh r10, [r0,#+218] -ldrh r10, [r0,#+230] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #106] -strh r8, [r0,#+220] -ldrh r8, [r0,#+232] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #108] -strh r6, [r0,#+222] -ldrh r6, [r0,#+234] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #110] -strh r4, [r0,#+224] -ldrh r4, [r0,#+236] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #124] -strh r14, [r0,#+226] -ldrh r14, [r0,#+238] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -strh r12, [r0,#+228] -ldrh r12, [r0,#+240] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #116] -strh r10, [r0,#+230] -ldrh r10, [r0,#+242] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #118] -strh r8, [r0,#+232] -ldrh r8, [r0,#+244] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #120] -strh r6, [r0,#+234] -ldrh r6, [r0,#+246] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #122] -strh r4, [r0,#+236] -ldrh r4, [r0,#+248] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #124] -strh r14, [r0,#+238] -ldrh r14, [r0,#+250] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #124] -strh r12, [r0,#+240] -ldrh r12, [r0,#+252] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #128] -strh r10, [r0,#+242] -ldrh r10, [r0,#+254] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #130] -strh r8, [r0,#+244] -ldrh r8, [r0,#+256] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #132] -strh r6, [r0,#+246] -ldrh r6, [r0,#+258] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #134] -strh r4, [r0,#+248] -ldrh r4, [r0,#+260] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #136] -strh r14, [r0,#+250] -ldrh r14, [r0,#+262] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #138] -strh r12, [r0,#+252] -ldrh r12, [r0,#+264] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #124] -strh r10, [r0,#+254] -ldrh r10, [r0,#+266] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #142] -strh r8, [r0,#+256] -ldrh r8, [r0,#+268] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #144] -strh r6, [r0,#+258] -ldrh r6, [r0,#+270] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #146] -strh r4, [r0,#+260] -ldrh r4, [r0,#+272] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #148] -strh r14, [r0,#+262] -ldrh r14, [r0,#+274] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #150] -strh r12, [r0,#+264] -ldrh r12, [r0,#+276] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #152] -strh r10, [r0,#+266] -ldrh r10, [r0,#+278] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #124] -strh r8, [r0,#+268] -ldrh r8, [r0,#+280] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #156] -strh r6, [r0,#+270] -ldrh r6, [r0,#+282] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #158] -strh r4, [r0,#+272] -ldrh r4, [r0,#+284] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #160] -strh r14, [r0,#+274] -ldrh r14, [r0,#+286] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #162] -strh r12, [r0,#+276] -ldrh r12, [r0,#+288] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #164] -strh r10, [r0,#+278] -ldrh r10, [r0,#+290] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #166] -strh r8, [r0,#+280] -ldrh r8, [r0,#+292] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #168] -vldrw.u32 Q0, [Q2, #124] -strh r6, [r0,#+282] -ldrh r6, [r0,#+294] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #170] -strh r4, [r0,#+284] -ldrh r4, [r0,#+296] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #172] -strh r14, [r0,#+286] -ldrh r14, [r0,#+298] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #174] -strh r12, [r0,#+288] -ldrh r12, [r0,#+300] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #176] -strh r10, [r0,#+290] -ldrh r10, [r0,#+302] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #178] -strh r8, [r0,#+292] -ldrh r8, [r0,#+304] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #180] -strh r6, [r0,#+294] -ldrh r6, [r0,#+306] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #182] -vldrw.u32 Q1, [Q2, #124] -strh r4, [r0,#+296] -ldrh r4, [r0,#+308] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -strh r14, [r0,#+298] -ldrh r14, [r0,#+310] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #186] -strh r12, [r0,#+300] -ldrh r12, [r0,#+312] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #188] -strh r10, [r0,#+302] -ldrh r10, [r0,#+314] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #190] -strh r8, [r0,#+304] -ldrh r8, [r0,#+316] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #192] -strh r6, [r0,#+306] -ldrh r6, [r0,#+318] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #194] -strh r4, [r0,#+308] -ldrh r4, [r0,#+320] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #124] -strh r14, [r0,#+310] -ldrh r14, [r0,#+322] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -strh r12, [r0,#+312] -ldrh r12, [r0,#+324] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #200] -strh r10, [r0,#+314] -ldrh r10, [r0,#+326] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #202] -strh r8, [r0,#+316] -ldrh r8, [r0,#+328] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #204] -strh r6, [r0,#+318] -ldrh r6, [r0,#+330] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #206] -strh r4, [r0,#+320] -ldrh r4, [r0,#+332] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #208] -strh r14, [r0,#+322] -ldrh r14, [r0,#+334] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #124] -strh r12, [r0,#+324] -ldrh r12, [r0,#+336] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #212] -strh r10, [r0,#+326] -ldrh r10, [r0,#+338] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #214] -strh r8, [r0,#+328] -ldrh r8, [r0,#+340] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #216] -strh r6, [r0,#+330] -ldrh r6, [r0,#+342] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #218] -strh r4, [r0,#+332] -ldrh r4, [r0,#+344] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #220] -strh r14, [r0,#+334] -ldrh r14, [r0,#+346] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #222] -strh r12, [r0,#+336] -ldrh r12, [r0,#+348] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #224] -vldrw.u32 Q5, [Q2, #124] -strh r10, [r0,#+338] -ldrh r10, [r0,#+350] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #226] -strh r8, [r0,#+340] -ldrh r8, [r0,#+352] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #228] -strh r6, [r0,#+342] -ldrh r6, [r0,#+354] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #230] -strh r4, [r0,#+344] -ldrh r4, [r0,#+356] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #232] -strh r14, [r0,#+346] -ldrh r14, [r0,#+358] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #234] -strh r12, [r0,#+348] -ldrh r12, [r0,#+360] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #236] -strh r10, [r0,#+350] -ldrh r10, [r0,#+362] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #238] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+352] -ldrh r8, [r0,#+364] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #240] -strh r6, [r0,#+354] -vmladavx.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #242] -strh r4, [r0,#+356] -vmladavx.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #244] -strh r14, [r0,#+358] -vmladavx.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #246] -strh r12, [r0,#+360] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #248] -strh r10, [r0,#+362] -vmladavx.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #250] -strh r8, [r0,#+364] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #252] -vldrw.u32 Q7, [Q2, #124] -strh r6, [r0,#+366] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #254] -strh r4, [r0,#+368] -vmladavx.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #-14] -vldrw.u32 Q3, [Q2, #140] -strh r14, [r0,#+370] -ldrh r14, [r0,#+128] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -strh r12, [r0,#+372] -ldrh r12, [r0,#+130] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #-10] -strh r10, [r0,#+374] -ldrh r10, [r0,#+132] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #-8] -strh r8, [r0,#+376] -ldrh r8, [r0,#+134] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #-6] -strh r6, [r0,#+378] -ldrh r6, [r0,#+136] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #-4] -strh r4, [r0,#+380] -ldrh r4, [r0,#+138] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #-2] -strh r14, [r0,#+128] -ldrh r14, [r0,#+140] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #0] -vldrw.u32 Q4, [Q2, #140] -strh r12, [r0,#+130] -ldrh r12, [r0,#+142] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #2] -strh r10, [r0,#+132] -ldrh r10, [r0,#+144] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #4] -strh r8, [r0,#+134] -ldrh r8, [r0,#+146] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #6] -strh r6, [r0,#+136] -ldrh r6, [r0,#+148] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #8] -strh r4, [r0,#+138] -ldrh r4, [r0,#+150] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #10] -strh r14, [r0,#+140] -ldrh r14, [r0,#+152] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #12] -strh r12, [r0,#+142] -ldrh r12, [r0,#+154] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #140] -strh r10, [r0,#+144] -ldrh r10, [r0,#+156] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -strh r8, [r0,#+146] -ldrh r8, [r0,#+158] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #18] -strh r6, [r0,#+148] -ldrh r6, [r0,#+160] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #20] -strh r4, [r0,#+150] -ldrh r4, [r0,#+162] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #22] -strh r14, [r0,#+152] -ldrh r14, [r0,#+164] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #24] -strh r12, [r0,#+154] -ldrh r12, [r0,#+166] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #26] -strh r10, [r0,#+156] -ldrh r10, [r0,#+168] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #140] -strh r8, [r0,#+158] -ldrh r8, [r0,#+170] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -strh r6, [r0,#+160] -ldrh r6, [r0,#+172] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #32] -strh r4, [r0,#+162] -ldrh r4, [r0,#+174] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #34] -strh r14, [r0,#+164] -ldrh r14, [r0,#+176] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #36] -strh r12, [r0,#+166] -ldrh r12, [r0,#+178] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #38] -strh r10, [r0,#+168] -ldrh r10, [r0,#+180] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #40] -strh r8, [r0,#+170] -ldrh r8, [r0,#+182] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #140] -strh r6, [r0,#+172] -ldrh r6, [r0,#+184] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #44] -strh r4, [r0,#+174] -ldrh r4, [r0,#+186] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #46] -strh r14, [r0,#+176] -ldrh r14, [r0,#+188] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #48] -strh r12, [r0,#+178] -ldrh r12, [r0,#+190] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #50] -strh r10, [r0,#+180] -ldrh r10, [r0,#+192] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #52] -strh r8, [r0,#+182] -ldrh r8, [r0,#+194] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #54] -strh r6, [r0,#+184] -ldrh r6, [r0,#+196] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #140] -strh r4, [r0,#+186] -ldrh r4, [r0,#+198] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #58] -strh r14, [r0,#+188] -ldrh r14, [r0,#+200] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #60] -strh r12, [r0,#+190] -ldrh r12, [r0,#+202] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #62] -strh r10, [r0,#+192] -ldrh r10, [r0,#+204] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #64] -strh r8, [r0,#+194] -ldrh r8, [r0,#+206] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #66] -strh r6, [r0,#+196] -ldrh r6, [r0,#+208] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #68] -strh r4, [r0,#+198] -ldrh r4, [r0,#+210] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #140] -strh r14, [r0,#+200] -ldrh r14, [r0,#+212] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -strh r12, [r0,#+202] -ldrh r12, [r0,#+214] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #74] -strh r10, [r0,#+204] -ldrh r10, [r0,#+216] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #76] -strh r8, [r0,#+206] -ldrh r8, [r0,#+218] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #78] -strh r6, [r0,#+208] -ldrh r6, [r0,#+220] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #80] -strh r4, [r0,#+210] -ldrh r4, [r0,#+222] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #82] -strh r14, [r0,#+212] -ldrh r14, [r0,#+224] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #140] -strh r12, [r0,#+214] -ldrh r12, [r0,#+226] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -strh r10, [r0,#+216] -ldrh r10, [r0,#+228] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #88] -strh r8, [r0,#+218] -ldrh r8, [r0,#+230] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #90] -strh r6, [r0,#+220] -ldrh r6, [r0,#+232] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #92] -strh r4, [r0,#+222] -ldrh r4, [r0,#+234] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #94] -strh r14, [r0,#+224] -ldrh r14, [r0,#+236] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #96] -strh r12, [r0,#+226] -ldrh r12, [r0,#+238] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #140] -strh r10, [r0,#+228] -ldrh r10, [r0,#+240] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #100] -strh r8, [r0,#+230] -ldrh r8, [r0,#+242] -vmladavax.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #102] -strh r6, [r0,#+232] -ldrh r6, [r0,#+244] -vmladavax.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #104] -strh r4, [r0,#+234] -ldrh r4, [r0,#+246] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #106] -strh r14, [r0,#+236] -ldrh r14, [r0,#+248] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #108] -strh r12, [r0,#+238] -ldrh r12, [r0,#+250] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #110] -strh r10, [r0,#+240] -ldrh r10, [r0,#+252] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #140] -strh r8, [r0,#+242] -ldrh r8, [r0,#+254] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -strh r6, [r0,#+244] -ldrh r6, [r0,#+256] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #116] -strh r4, [r0,#+246] -ldrh r4, [r0,#+258] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #118] -strh r14, [r0,#+248] -ldrh r14, [r0,#+260] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #120] -strh r12, [r0,#+250] -ldrh r12, [r0,#+262] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #122] -strh r10, [r0,#+252] -ldrh r10, [r0,#+264] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #124] -strh r8, [r0,#+254] -ldrh r8, [r0,#+266] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #140] -strh r6, [r0,#+256] -ldrh r6, [r0,#+268] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #128] -strh r4, [r0,#+258] -ldrh r4, [r0,#+270] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #130] -strh r14, [r0,#+260] -ldrh r14, [r0,#+272] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #132] -strh r12, [r0,#+262] -ldrh r12, [r0,#+274] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #134] -strh r10, [r0,#+264] -ldrh r10, [r0,#+276] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #136] -strh r8, [r0,#+266] -ldrh r8, [r0,#+278] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #138] -strh r6, [r0,#+268] -ldrh r6, [r0,#+280] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #140] -vldrw.u32 Q7, [Q2, #140] -strh r4, [r0,#+270] -ldrh r4, [r0,#+282] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #142] -strh r14, [r0,#+272] -ldrh r14, [r0,#+284] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #144] -strh r12, [r0,#+274] -ldrh r12, [r0,#+286] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #146] -strh r10, [r0,#+276] -ldrh r10, [r0,#+288] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #148] -strh r8, [r0,#+278] -ldrh r8, [r0,#+290] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #150] -strh r6, [r0,#+280] -ldrh r6, [r0,#+292] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #152] -strh r4, [r0,#+282] -ldrh r4, [r0,#+294] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #154] -vldrw.u32 Q0, [Q2, #140] -strh r14, [r0,#+284] -ldrh r14, [r0,#+296] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #156] -strh r12, [r0,#+286] -ldrh r12, [r0,#+298] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #158] -strh r10, [r0,#+288] -ldrh r10, [r0,#+300] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #160] -strh r8, [r0,#+290] -ldrh r8, [r0,#+302] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #162] -strh r6, [r0,#+292] -ldrh r6, [r0,#+304] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #164] -strh r4, [r0,#+294] -ldrh r4, [r0,#+306] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #166] -strh r14, [r0,#+296] -ldrh r14, [r0,#+308] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #140] -strh r12, [r0,#+298] -ldrh r12, [r0,#+310] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -strh r10, [r0,#+300] -ldrh r10, [r0,#+312] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #172] -strh r8, [r0,#+302] -ldrh r8, [r0,#+314] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #174] -strh r6, [r0,#+304] -ldrh r6, [r0,#+316] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #176] -strh r4, [r0,#+306] -ldrh r4, [r0,#+318] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #178] -strh r14, [r0,#+308] -ldrh r14, [r0,#+320] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #180] -strh r12, [r0,#+310] -ldrh r12, [r0,#+322] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #140] -strh r10, [r0,#+312] -ldrh r10, [r0,#+324] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #184] -strh r8, [r0,#+314] -ldrh r8, [r0,#+326] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #186] -strh r6, [r0,#+316] -ldrh r6, [r0,#+328] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #188] -strh r4, [r0,#+318] -ldrh r4, [r0,#+330] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #190] -strh r14, [r0,#+320] -ldrh r14, [r0,#+332] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #192] -strh r12, [r0,#+322] -ldrh r12, [r0,#+334] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #194] -strh r10, [r0,#+324] -ldrh r10, [r0,#+336] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #196] -vldrw.u32 Q4, [Q2, #140] -strh r8, [r0,#+326] -ldrh r8, [r0,#+338] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #198] -strh r6, [r0,#+328] -ldrh r6, [r0,#+340] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #200] -strh r4, [r0,#+330] -ldrh r4, [r0,#+342] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #202] -strh r14, [r0,#+332] -ldrh r14, [r0,#+344] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #204] -strh r12, [r0,#+334] -ldrh r12, [r0,#+346] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #206] -strh r10, [r0,#+336] -ldrh r10, [r0,#+348] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #208] -strh r8, [r0,#+338] -ldrh r8, [r0,#+350] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #210] -vldrw.u32 Q5, [Q2, #140] -strh r6, [r0,#+340] -ldrh r6, [r0,#+352] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #212] -strh r4, [r0,#+342] -ldrh r4, [r0,#+354] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #214] -strh r14, [r0,#+344] -ldrh r14, [r0,#+356] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #216] -strh r12, [r0,#+346] -ldrh r12, [r0,#+358] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #218] -strh r10, [r0,#+348] -ldrh r10, [r0,#+360] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #220] -strh r8, [r0,#+350] -ldrh r8, [r0,#+362] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #222] -strh r6, [r0,#+352] -ldrh r6, [r0,#+364] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #224] -vldrw.u32 Q6, [Q2, #140] -strh r4, [r0,#+354] -ldrh r4, [r0,#+366] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #226] -strh r14, [r0,#+356] -ldrh r14, [r0,#+368] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #228] -strh r12, [r0,#+358] -ldrh r12, [r0,#+370] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #230] -strh r10, [r0,#+360] -ldrh r10, [r0,#+372] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #232] -strh r8, [r0,#+362] -ldrh r8, [r0,#+374] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #234] -strh r6, [r0,#+364] -ldrh r6, [r0,#+376] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #236] -strh r4, [r0,#+366] -ldrh r4, [r0,#+378] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #238] -vldrw.u32 Q7, [Q2, #140] -strh r14, [r0,#+368] -ldrh r14, [r0,#+380] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #240] -strh r12, [r0,#+370] -vmladavx.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #242] -strh r10, [r0,#+372] -vmladavx.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #244] -strh r8, [r0,#+374] -vmladavx.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #246] -strh r6, [r0,#+376] -vmladavx.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #248] -strh r4, [r0,#+378] -vmladavx.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #250] -strh r14, [r0,#+380] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #252] -vldrw.u32 Q0, [Q2, #140] -strh r12, [r0,#+382] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -strh r10, [r0,#+384] -vmladavx.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #-14] -vldrw.u32 Q4, [Q2, #156] -strh r8, [r0,#+386] -ldrh r8, [r0,#+144] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r6, [r0,#+388] -ldrh r6, [r0,#+146] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #-10] -strh r4, [r0,#+390] -ldrh r4, [r0,#+148] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #-8] -strh r14, [r0,#+392] -ldrh r14, [r0,#+150] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r12, [r0,#+394] -ldrh r12, [r0,#+152] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #-4] -strh r10, [r0,#+396] -ldrh r10, [r0,#+154] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #-2] -strh r8, [r0,#+144] -ldrh r8, [r0,#+156] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #156] -strh r6, [r0,#+146] -ldrh r6, [r0,#+158] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -strh r4, [r0,#+148] -ldrh r4, [r0,#+160] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #4] -strh r14, [r0,#+150] -ldrh r14, [r0,#+162] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #6] -strh r12, [r0,#+152] -ldrh r12, [r0,#+164] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #8] -strh r10, [r0,#+154] -ldrh r10, [r0,#+166] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #10] -strh r8, [r0,#+156] -ldrh r8, [r0,#+168] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #12] -strh r6, [r0,#+158] -ldrh r6, [r0,#+170] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #156] -strh r4, [r0,#+160] -ldrh r4, [r0,#+172] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -strh r14, [r0,#+162] -ldrh r14, [r0,#+174] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #18] -strh r12, [r0,#+164] -ldrh r12, [r0,#+176] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #20] -strh r10, [r0,#+166] -ldrh r10, [r0,#+178] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #22] -strh r8, [r0,#+168] -ldrh r8, [r0,#+180] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #24] -strh r6, [r0,#+170] -ldrh r6, [r0,#+182] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #26] -strh r4, [r0,#+172] -ldrh r4, [r0,#+184] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #28] -vldrw.u32 Q7, [Q2, #156] -strh r14, [r0,#+174] -ldrh r14, [r0,#+186] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #30] -strh r12, [r0,#+176] -ldrh r12, [r0,#+188] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #32] -strh r10, [r0,#+178] -ldrh r10, [r0,#+190] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #34] -strh r8, [r0,#+180] -ldrh r8, [r0,#+192] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #36] -strh r6, [r0,#+182] -ldrh r6, [r0,#+194] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #38] -strh r4, [r0,#+184] -ldrh r4, [r0,#+196] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #40] -strh r14, [r0,#+186] -ldrh r14, [r0,#+198] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #42] -vldrw.u32 Q0, [Q2, #156] -strh r12, [r0,#+188] -ldrh r12, [r0,#+200] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #44] -strh r10, [r0,#+190] -ldrh r10, [r0,#+202] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #46] -strh r8, [r0,#+192] -ldrh r8, [r0,#+204] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #48] -strh r6, [r0,#+194] -ldrh r6, [r0,#+206] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #50] -strh r4, [r0,#+196] -ldrh r4, [r0,#+208] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #52] -strh r14, [r0,#+198] -ldrh r14, [r0,#+210] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #54] -strh r12, [r0,#+200] -ldrh r12, [r0,#+212] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #156] -strh r10, [r0,#+202] -ldrh r10, [r0,#+214] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -strh r8, [r0,#+204] -ldrh r8, [r0,#+216] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #60] -strh r6, [r0,#+206] -ldrh r6, [r0,#+218] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #62] -strh r4, [r0,#+208] -ldrh r4, [r0,#+220] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #64] -strh r14, [r0,#+210] -ldrh r14, [r0,#+222] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #66] -strh r12, [r0,#+212] -ldrh r12, [r0,#+224] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #68] -strh r10, [r0,#+214] -ldrh r10, [r0,#+226] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #70] -vldrw.u32 Q3, [Q2, #156] -strh r8, [r0,#+216] -ldrh r8, [r0,#+228] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -strh r6, [r0,#+218] -ldrh r6, [r0,#+230] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #74] -strh r4, [r0,#+220] -ldrh r4, [r0,#+232] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #76] -strh r14, [r0,#+222] -ldrh r14, [r0,#+234] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #78] -strh r12, [r0,#+224] -ldrh r12, [r0,#+236] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #80] -strh r10, [r0,#+226] -ldrh r10, [r0,#+238] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #82] -strh r8, [r0,#+228] -ldrh r8, [r0,#+240] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #156] -strh r6, [r0,#+230] -ldrh r6, [r0,#+242] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #86] -strh r4, [r0,#+232] -ldrh r4, [r0,#+244] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #88] -strh r14, [r0,#+234] -ldrh r14, [r0,#+246] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #90] -strh r12, [r0,#+236] -ldrh r12, [r0,#+248] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #92] -strh r10, [r0,#+238] -ldrh r10, [r0,#+250] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #94] -strh r8, [r0,#+240] -ldrh r8, [r0,#+252] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #96] -strh r6, [r0,#+242] -ldrh r6, [r0,#+254] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #98] -vldrw.u32 Q5, [Q2, #156] -strh r4, [r0,#+244] -ldrh r4, [r0,#+256] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #100] -strh r14, [r0,#+246] -ldrh r14, [r0,#+258] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #102] -strh r12, [r0,#+248] -ldrh r12, [r0,#+260] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #104] -strh r10, [r0,#+250] -ldrh r10, [r0,#+262] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #106] -strh r8, [r0,#+252] -ldrh r8, [r0,#+264] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #108] -strh r6, [r0,#+254] -ldrh r6, [r0,#+266] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #110] -strh r4, [r0,#+256] -ldrh r4, [r0,#+268] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #156] -strh r14, [r0,#+258] -ldrh r14, [r0,#+270] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #114] -strh r12, [r0,#+260] -ldrh r12, [r0,#+272] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #116] -strh r10, [r0,#+262] -ldrh r10, [r0,#+274] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #118] -strh r8, [r0,#+264] -ldrh r8, [r0,#+276] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #120] -strh r6, [r0,#+266] -ldrh r6, [r0,#+278] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #122] -strh r4, [r0,#+268] -ldrh r4, [r0,#+280] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #124] -strh r14, [r0,#+270] -ldrh r14, [r0,#+282] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #126] -vldrw.u32 Q7, [Q2, #156] -strh r12, [r0,#+272] -ldrh r12, [r0,#+284] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #128] -strh r10, [r0,#+274] -ldrh r10, [r0,#+286] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #130] -strh r8, [r0,#+276] -ldrh r8, [r0,#+288] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #132] -strh r6, [r0,#+278] -ldrh r6, [r0,#+290] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #134] -strh r4, [r0,#+280] -ldrh r4, [r0,#+292] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #136] -strh r14, [r0,#+282] -ldrh r14, [r0,#+294] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #138] -strh r12, [r0,#+284] -ldrh r12, [r0,#+296] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #140] -vldrw.u32 Q0, [Q2, #156] -strh r10, [r0,#+286] -ldrh r10, [r0,#+298] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #142] -strh r8, [r0,#+288] -ldrh r8, [r0,#+300] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #144] -strh r6, [r0,#+290] -ldrh r6, [r0,#+302] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #146] -strh r4, [r0,#+292] -ldrh r4, [r0,#+304] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #148] -strh r14, [r0,#+294] -ldrh r14, [r0,#+306] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #150] -strh r12, [r0,#+296] -ldrh r12, [r0,#+308] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #152] -strh r10, [r0,#+298] -ldrh r10, [r0,#+310] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #154] -vldrw.u32 Q1, [Q2, #156] -strh r8, [r0,#+300] -ldrh r8, [r0,#+312] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #156] -strh r6, [r0,#+302] -ldrh r6, [r0,#+314] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #158] -strh r4, [r0,#+304] -ldrh r4, [r0,#+316] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #160] -strh r14, [r0,#+306] -ldrh r14, [r0,#+318] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #162] -strh r12, [r0,#+308] -ldrh r12, [r0,#+320] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #164] -strh r10, [r0,#+310] -ldrh r10, [r0,#+322] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #166] -strh r8, [r0,#+312] -ldrh r8, [r0,#+324] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #168] -vldrw.u32 Q3, [Q2, #156] -strh r6, [r0,#+314] -ldrh r6, [r0,#+326] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #170] -strh r4, [r0,#+316] -ldrh r4, [r0,#+328] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #172] -strh r14, [r0,#+318] -ldrh r14, [r0,#+330] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #174] -strh r12, [r0,#+320] -ldrh r12, [r0,#+332] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #176] -strh r10, [r0,#+322] -ldrh r10, [r0,#+334] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #178] -strh r8, [r0,#+324] -ldrh r8, [r0,#+336] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #180] -strh r6, [r0,#+326] -ldrh r6, [r0,#+338] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #182] -vldrw.u32 Q4, [Q2, #156] -strh r4, [r0,#+328] -ldrh r4, [r0,#+340] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #184] -strh r14, [r0,#+330] -ldrh r14, [r0,#+342] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #186] -strh r12, [r0,#+332] -ldrh r12, [r0,#+344] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #188] -strh r10, [r0,#+334] -ldrh r10, [r0,#+346] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #190] -strh r8, [r0,#+336] -ldrh r8, [r0,#+348] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #192] -strh r6, [r0,#+338] -ldrh r6, [r0,#+350] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #194] -strh r4, [r0,#+340] -ldrh r4, [r0,#+352] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #196] -vldrw.u32 Q5, [Q2, #156] -strh r14, [r0,#+342] -ldrh r14, [r0,#+354] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #198] -strh r12, [r0,#+344] -ldrh r12, [r0,#+356] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #200] -strh r10, [r0,#+346] -ldrh r10, [r0,#+358] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #202] -strh r8, [r0,#+348] -ldrh r8, [r0,#+360] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #204] -strh r6, [r0,#+350] -ldrh r6, [r0,#+362] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #206] -strh r4, [r0,#+352] -ldrh r4, [r0,#+364] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #208] -strh r14, [r0,#+354] -ldrh r14, [r0,#+366] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #210] -vldrw.u32 Q6, [Q2, #156] -strh r12, [r0,#+356] -ldrh r12, [r0,#+368] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #212] -strh r10, [r0,#+358] -ldrh r10, [r0,#+370] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #214] -strh r8, [r0,#+360] -ldrh r8, [r0,#+372] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #216] -strh r6, [r0,#+362] -ldrh r6, [r0,#+374] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #218] -strh r4, [r0,#+364] -ldrh r4, [r0,#+376] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #220] -strh r14, [r0,#+366] -ldrh r14, [r0,#+378] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #222] -strh r12, [r0,#+368] -ldrh r12, [r0,#+380] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #224] -vldrw.u32 Q7, [Q2, #156] -strh r10, [r0,#+370] -ldrh r10, [r0,#+382] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #226] -strh r8, [r0,#+372] -ldrh r8, [r0,#+384] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #228] -strh r6, [r0,#+374] -ldrh r6, [r0,#+386] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #230] -strh r4, [r0,#+376] -ldrh r4, [r0,#+388] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #232] -strh r14, [r0,#+378] -ldrh r14, [r0,#+390] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #234] -strh r12, [r0,#+380] -ldrh r12, [r0,#+392] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #236] -strh r10, [r0,#+382] -ldrh r10, [r0,#+394] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #238] -vldrw.u32 Q0, [Q2, #156] -strh r8, [r0,#+384] -ldrh r8, [r0,#+396] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #240] -strh r6, [r0,#+386] -vmladavx.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #242] -strh r4, [r0,#+388] -vmladavx.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #244] -strh r14, [r0,#+390] -vmladavx.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #246] -strh r12, [r0,#+392] -vmladavx.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #248] -strh r10, [r0,#+394] -vmladavx.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #250] -strh r8, [r0,#+396] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #252] -vldrw.u32 Q1, [Q2, #156] -strh r6, [r0,#+398] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #254] -strh r4, [r0,#+400] -vmladavx.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #172] -strh r14, [r0,#+402] -ldrh r14, [r0,#+160] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #-12] -strh r12, [r0,#+404] -ldrh r12, [r0,#+162] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #-10] -strh r10, [r0,#+406] -ldrh r10, [r0,#+164] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #-8] -strh r8, [r0,#+408] -ldrh r8, [r0,#+166] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #-6] -strh r6, [r0,#+410] -ldrh r6, [r0,#+168] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #-4] -strh r4, [r0,#+412] -ldrh r4, [r0,#+170] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #-2] -strh r14, [r0,#+160] -ldrh r14, [r0,#+172] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #172] -strh r12, [r0,#+162] -ldrh r12, [r0,#+174] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -strh r10, [r0,#+164] -ldrh r10, [r0,#+176] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #4] -strh r8, [r0,#+166] -ldrh r8, [r0,#+178] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #6] -strh r6, [r0,#+168] -ldrh r6, [r0,#+180] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #8] -strh r4, [r0,#+170] -ldrh r4, [r0,#+182] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #10] -strh r14, [r0,#+172] -ldrh r14, [r0,#+184] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #12] -strh r12, [r0,#+174] -ldrh r12, [r0,#+186] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #14] -vldrw.u32 Q7, [Q2, #172] -strh r10, [r0,#+176] -ldrh r10, [r0,#+188] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #16] -strh r8, [r0,#+178] -ldrh r8, [r0,#+190] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #18] -strh r6, [r0,#+180] -ldrh r6, [r0,#+192] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #20] -strh r4, [r0,#+182] -ldrh r4, [r0,#+194] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #22] -strh r14, [r0,#+184] -ldrh r14, [r0,#+196] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #24] -strh r12, [r0,#+186] -ldrh r12, [r0,#+198] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #26] -strh r10, [r0,#+188] -ldrh r10, [r0,#+200] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #28] -vldrw.u32 Q0, [Q2, #172] -strh r8, [r0,#+190] -ldrh r8, [r0,#+202] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -strh r6, [r0,#+192] -ldrh r6, [r0,#+204] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #32] -strh r4, [r0,#+194] -ldrh r4, [r0,#+206] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #34] -strh r14, [r0,#+196] -ldrh r14, [r0,#+208] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #36] -strh r12, [r0,#+198] -ldrh r12, [r0,#+210] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #38] -strh r10, [r0,#+200] -ldrh r10, [r0,#+212] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #40] -strh r8, [r0,#+202] -ldrh r8, [r0,#+214] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #172] -strh r6, [r0,#+204] -ldrh r6, [r0,#+216] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -strh r4, [r0,#+206] -ldrh r4, [r0,#+218] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #46] -strh r14, [r0,#+208] -ldrh r14, [r0,#+220] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #48] -strh r12, [r0,#+210] -ldrh r12, [r0,#+222] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #50] -strh r10, [r0,#+212] -ldrh r10, [r0,#+224] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #52] -strh r8, [r0,#+214] -ldrh r8, [r0,#+226] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #54] -strh r6, [r0,#+216] -ldrh r6, [r0,#+228] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #172] -strh r4, [r0,#+218] -ldrh r4, [r0,#+230] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #58] -strh r14, [r0,#+220] -ldrh r14, [r0,#+232] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #60] -strh r12, [r0,#+222] -ldrh r12, [r0,#+234] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #62] -strh r10, [r0,#+224] -ldrh r10, [r0,#+236] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #64] -strh r8, [r0,#+226] -ldrh r8, [r0,#+238] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #66] -strh r6, [r0,#+228] -ldrh r6, [r0,#+240] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #68] -strh r4, [r0,#+230] -ldrh r4, [r0,#+242] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #70] -vldrw.u32 Q4, [Q2, #172] -strh r14, [r0,#+232] -ldrh r14, [r0,#+244] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #72] -strh r12, [r0,#+234] -ldrh r12, [r0,#+246] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #74] -strh r10, [r0,#+236] -ldrh r10, [r0,#+248] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #76] -strh r8, [r0,#+238] -ldrh r8, [r0,#+250] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #78] -strh r6, [r0,#+240] -ldrh r6, [r0,#+252] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #80] -strh r4, [r0,#+242] -ldrh r4, [r0,#+254] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #82] -strh r14, [r0,#+244] -ldrh r14, [r0,#+256] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #84] -vldrw.u32 Q5, [Q2, #172] -strh r12, [r0,#+246] -ldrh r12, [r0,#+258] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #86] -strh r10, [r0,#+248] -ldrh r10, [r0,#+260] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #88] -strh r8, [r0,#+250] -ldrh r8, [r0,#+262] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #90] -strh r6, [r0,#+252] -ldrh r6, [r0,#+264] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #92] -strh r4, [r0,#+254] -ldrh r4, [r0,#+266] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #94] -strh r14, [r0,#+256] -ldrh r14, [r0,#+268] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #96] -strh r12, [r0,#+258] -ldrh r12, [r0,#+270] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #172] -strh r10, [r0,#+260] -ldrh r10, [r0,#+272] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -strh r8, [r0,#+262] -ldrh r8, [r0,#+274] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #102] -strh r6, [r0,#+264] -ldrh r6, [r0,#+276] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #104] -strh r4, [r0,#+266] -ldrh r4, [r0,#+278] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #106] -strh r14, [r0,#+268] -ldrh r14, [r0,#+280] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #108] -strh r12, [r0,#+270] -ldrh r12, [r0,#+282] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #110] -strh r10, [r0,#+272] -ldrh r10, [r0,#+284] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #112] -vldrw.u32 Q7, [Q2, #172] -strh r8, [r0,#+274] -ldrh r8, [r0,#+286] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #114] -strh r6, [r0,#+276] -ldrh r6, [r0,#+288] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #116] -strh r4, [r0,#+278] -ldrh r4, [r0,#+290] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #118] -strh r14, [r0,#+280] -ldrh r14, [r0,#+292] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #120] -strh r12, [r0,#+282] -ldrh r12, [r0,#+294] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #122] -strh r10, [r0,#+284] -ldrh r10, [r0,#+296] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #124] -strh r8, [r0,#+286] -ldrh r8, [r0,#+298] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #172] -strh r6, [r0,#+288] -ldrh r6, [r0,#+300] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #128] -strh r4, [r0,#+290] -ldrh r4, [r0,#+302] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #130] -strh r14, [r0,#+292] -ldrh r14, [r0,#+304] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #132] -strh r12, [r0,#+294] -ldrh r12, [r0,#+306] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #134] -strh r10, [r0,#+296] -ldrh r10, [r0,#+308] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #136] -strh r8, [r0,#+298] -ldrh r8, [r0,#+310] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #138] -strh r6, [r0,#+300] -ldrh r6, [r0,#+312] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #140] -vldrw.u32 Q1, [Q2, #172] -strh r4, [r0,#+302] -ldrh r4, [r0,#+314] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #142] -strh r14, [r0,#+304] -ldrh r14, [r0,#+316] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #144] -strh r12, [r0,#+306] -ldrh r12, [r0,#+318] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #146] -strh r10, [r0,#+308] -ldrh r10, [r0,#+320] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #148] -strh r8, [r0,#+310] -ldrh r8, [r0,#+322] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #150] -strh r6, [r0,#+312] -ldrh r6, [r0,#+324] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #152] -strh r4, [r0,#+314] -ldrh r4, [r0,#+326] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #154] -vldrw.u32 Q3, [Q2, #172] -strh r14, [r0,#+316] -ldrh r14, [r0,#+328] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #156] -strh r12, [r0,#+318] -ldrh r12, [r0,#+330] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #158] -strh r10, [r0,#+320] -ldrh r10, [r0,#+332] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #160] -strh r8, [r0,#+322] -ldrh r8, [r0,#+334] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #162] -strh r6, [r0,#+324] -ldrh r6, [r0,#+336] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #164] -strh r4, [r0,#+326] -ldrh r4, [r0,#+338] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #166] -strh r14, [r0,#+328] -ldrh r14, [r0,#+340] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #168] -vldrw.u32 Q4, [Q2, #172] -strh r12, [r0,#+330] -ldrh r12, [r0,#+342] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #170] -strh r10, [r0,#+332] -ldrh r10, [r0,#+344] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #172] -strh r8, [r0,#+334] -ldrh r8, [r0,#+346] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #174] -strh r6, [r0,#+336] -ldrh r6, [r0,#+348] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #176] -strh r4, [r0,#+338] -ldrh r4, [r0,#+350] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #178] -strh r14, [r0,#+340] -ldrh r14, [r0,#+352] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #180] -strh r12, [r0,#+342] -ldrh r12, [r0,#+354] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #182] -vldrw.u32 Q5, [Q2, #172] -strh r10, [r0,#+344] -ldrh r10, [r0,#+356] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #184] -strh r8, [r0,#+346] -ldrh r8, [r0,#+358] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #186] -strh r6, [r0,#+348] -ldrh r6, [r0,#+360] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #188] -strh r4, [r0,#+350] -ldrh r4, [r0,#+362] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #190] -strh r14, [r0,#+352] -ldrh r14, [r0,#+364] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #192] -strh r12, [r0,#+354] -ldrh r12, [r0,#+366] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #194] -strh r10, [r0,#+356] -ldrh r10, [r0,#+368] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #196] -vldrw.u32 Q6, [Q2, #172] -strh r8, [r0,#+358] -ldrh r8, [r0,#+370] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #198] -strh r6, [r0,#+360] -ldrh r6, [r0,#+372] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #200] -strh r4, [r0,#+362] -ldrh r4, [r0,#+374] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #202] -strh r14, [r0,#+364] -ldrh r14, [r0,#+376] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #204] -strh r12, [r0,#+366] -ldrh r12, [r0,#+378] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #206] -strh r10, [r0,#+368] -ldrh r10, [r0,#+380] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #208] -strh r8, [r0,#+370] -ldrh r8, [r0,#+382] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #210] -vldrw.u32 Q7, [Q2, #172] -strh r6, [r0,#+372] -ldrh r6, [r0,#+384] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #212] -strh r4, [r0,#+374] -ldrh r4, [r0,#+386] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #214] -strh r14, [r0,#+376] -ldrh r14, [r0,#+388] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #216] -strh r12, [r0,#+378] -ldrh r12, [r0,#+390] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #218] -strh r10, [r0,#+380] -ldrh r10, [r0,#+392] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #220] -strh r8, [r0,#+382] -ldrh r8, [r0,#+394] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #222] -strh r6, [r0,#+384] -ldrh r6, [r0,#+396] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #224] -vldrw.u32 Q0, [Q2, #172] -strh r4, [r0,#+386] -ldrh r4, [r0,#+398] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #226] -strh r14, [r0,#+388] -ldrh r14, [r0,#+400] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #228] -strh r12, [r0,#+390] -ldrh r12, [r0,#+402] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #230] -strh r10, [r0,#+392] -ldrh r10, [r0,#+404] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #232] -strh r8, [r0,#+394] -ldrh r8, [r0,#+406] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #234] -strh r6, [r0,#+396] -ldrh r6, [r0,#+408] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #236] -strh r4, [r0,#+398] -ldrh r4, [r0,#+410] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #238] -vldrw.u32 Q1, [Q2, #172] -strh r14, [r0,#+400] -ldrh r14, [r0,#+412] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #240] -strh r12, [r0,#+402] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #242] -strh r10, [r0,#+404] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #244] -strh r8, [r0,#+406] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #246] -strh r6, [r0,#+408] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #248] -strh r4, [r0,#+410] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #250] -strh r14, [r0,#+412] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #252] -vldrw.u32 Q3, [Q2, #172] -strh r12, [r0,#+414] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #254] -strh r10, [r0,#+416] -vmladavx.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #188] -strh r8, [r0,#+418] -ldrh r8, [r0,#+176] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -strh r6, [r0,#+420] -ldrh r6, [r0,#+178] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #-10] -strh r4, [r0,#+422] -ldrh r4, [r0,#+180] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #-8] -strh r14, [r0,#+424] -ldrh r14, [r0,#+182] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #-6] -strh r12, [r0,#+426] -ldrh r12, [r0,#+184] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #-4] -strh r10, [r0,#+428] -ldrh r10, [r0,#+186] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #-2] -strh r8, [r0,#+176] -ldrh r8, [r0,#+188] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #0] -vldrw.u32 Q7, [Q2, #188] -strh r6, [r0,#+178] -ldrh r6, [r0,#+190] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #2] -strh r4, [r0,#+180] -ldrh r4, [r0,#+192] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #4] -strh r14, [r0,#+182] -ldrh r14, [r0,#+194] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #6] -strh r12, [r0,#+184] -ldrh r12, [r0,#+196] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #8] -strh r10, [r0,#+186] -ldrh r10, [r0,#+198] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #10] -strh r8, [r0,#+188] -ldrh r8, [r0,#+200] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #12] -strh r6, [r0,#+190] -ldrh r6, [r0,#+202] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #188] -strh r4, [r0,#+192] -ldrh r4, [r0,#+204] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -strh r14, [r0,#+194] -ldrh r14, [r0,#+206] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #18] -strh r12, [r0,#+196] -ldrh r12, [r0,#+208] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #20] -strh r10, [r0,#+198] -ldrh r10, [r0,#+210] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #22] -strh r8, [r0,#+200] -ldrh r8, [r0,#+212] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #24] -strh r6, [r0,#+202] -ldrh r6, [r0,#+214] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #26] -strh r4, [r0,#+204] -ldrh r4, [r0,#+216] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #28] -vldrw.u32 Q1, [Q2, #188] -strh r14, [r0,#+206] -ldrh r14, [r0,#+218] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #30] -strh r12, [r0,#+208] -ldrh r12, [r0,#+220] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #32] -strh r10, [r0,#+210] -ldrh r10, [r0,#+222] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #34] -strh r8, [r0,#+212] -ldrh r8, [r0,#+224] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #36] -strh r6, [r0,#+214] -ldrh r6, [r0,#+226] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #38] -strh r4, [r0,#+216] -ldrh r4, [r0,#+228] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #40] -strh r14, [r0,#+218] -ldrh r14, [r0,#+230] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #42] -vldrw.u32 Q3, [Q2, #188] -strh r12, [r0,#+220] -ldrh r12, [r0,#+232] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -strh r10, [r0,#+222] -ldrh r10, [r0,#+234] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #46] -strh r8, [r0,#+224] -ldrh r8, [r0,#+236] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #48] -strh r6, [r0,#+226] -ldrh r6, [r0,#+238] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #50] -strh r4, [r0,#+228] -ldrh r4, [r0,#+240] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #52] -strh r14, [r0,#+230] -ldrh r14, [r0,#+242] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #54] -strh r12, [r0,#+232] -ldrh r12, [r0,#+244] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #56] -vldrw.u32 Q4, [Q2, #188] -strh r10, [r0,#+234] -ldrh r10, [r0,#+246] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #58] -strh r8, [r0,#+236] -ldrh r8, [r0,#+248] -vmladavax.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #60] -strh r6, [r0,#+238] -ldrh r6, [r0,#+250] -vmladavax.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #62] -strh r4, [r0,#+240] -ldrh r4, [r0,#+252] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #64] -strh r14, [r0,#+242] -ldrh r14, [r0,#+254] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #66] -strh r12, [r0,#+244] -ldrh r12, [r0,#+256] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #68] -strh r10, [r0,#+246] -ldrh r10, [r0,#+258] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #70] -vldrw.u32 Q5, [Q2, #188] -strh r8, [r0,#+248] -ldrh r8, [r0,#+260] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #72] -strh r6, [r0,#+250] -ldrh r6, [r0,#+262] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #74] -strh r4, [r0,#+252] -ldrh r4, [r0,#+264] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #76] -strh r14, [r0,#+254] -ldrh r14, [r0,#+266] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #78] -strh r12, [r0,#+256] -ldrh r12, [r0,#+268] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #80] -strh r10, [r0,#+258] -ldrh r10, [r0,#+270] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #82] -strh r8, [r0,#+260] -ldrh r8, [r0,#+272] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #188] -strh r6, [r0,#+262] -ldrh r6, [r0,#+274] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -strh r4, [r0,#+264] -ldrh r4, [r0,#+276] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #88] -strh r14, [r0,#+266] -ldrh r14, [r0,#+278] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #90] -strh r12, [r0,#+268] -ldrh r12, [r0,#+280] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #92] -strh r10, [r0,#+270] -ldrh r10, [r0,#+282] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #94] -strh r8, [r0,#+272] -ldrh r8, [r0,#+284] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #96] -strh r6, [r0,#+274] -ldrh r6, [r0,#+286] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #98] -vldrw.u32 Q7, [Q2, #188] -strh r4, [r0,#+276] -ldrh r4, [r0,#+288] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #100] -strh r14, [r0,#+278] -ldrh r14, [r0,#+290] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #102] -strh r12, [r0,#+280] -ldrh r12, [r0,#+292] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #104] -strh r10, [r0,#+282] -ldrh r10, [r0,#+294] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #106] -strh r8, [r0,#+284] -ldrh r8, [r0,#+296] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #108] -strh r6, [r0,#+286] -ldrh r6, [r0,#+298] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #110] -strh r4, [r0,#+288] -ldrh r4, [r0,#+300] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #188] -strh r14, [r0,#+290] -ldrh r14, [r0,#+302] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -strh r12, [r0,#+292] -ldrh r12, [r0,#+304] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #116] -strh r10, [r0,#+294] -ldrh r10, [r0,#+306] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #118] -strh r8, [r0,#+296] -ldrh r8, [r0,#+308] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #120] -strh r6, [r0,#+298] -ldrh r6, [r0,#+310] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #122] -strh r4, [r0,#+300] -ldrh r4, [r0,#+312] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #124] -strh r14, [r0,#+302] -ldrh r14, [r0,#+314] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #126] -vldrw.u32 Q1, [Q2, #188] -strh r12, [r0,#+304] -ldrh r12, [r0,#+316] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #128] -strh r10, [r0,#+306] -ldrh r10, [r0,#+318] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #130] -strh r8, [r0,#+308] -ldrh r8, [r0,#+320] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #132] -strh r6, [r0,#+310] -ldrh r6, [r0,#+322] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #134] -strh r4, [r0,#+312] -ldrh r4, [r0,#+324] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #136] -strh r14, [r0,#+314] -ldrh r14, [r0,#+326] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #138] -strh r12, [r0,#+316] -ldrh r12, [r0,#+328] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #140] -vldrw.u32 Q3, [Q2, #188] -strh r10, [r0,#+318] -ldrh r10, [r0,#+330] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #142] -strh r8, [r0,#+320] -ldrh r8, [r0,#+332] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #144] -strh r6, [r0,#+322] -ldrh r6, [r0,#+334] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #146] -strh r4, [r0,#+324] -ldrh r4, [r0,#+336] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #148] -strh r14, [r0,#+326] -ldrh r14, [r0,#+338] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #150] -strh r12, [r0,#+328] -ldrh r12, [r0,#+340] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #152] -strh r10, [r0,#+330] -ldrh r10, [r0,#+342] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #154] -vldrw.u32 Q4, [Q2, #188] -strh r8, [r0,#+332] -ldrh r8, [r0,#+344] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #156] -strh r6, [r0,#+334] -ldrh r6, [r0,#+346] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #158] -strh r4, [r0,#+336] -ldrh r4, [r0,#+348] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #160] -strh r14, [r0,#+338] -ldrh r14, [r0,#+350] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #162] -strh r12, [r0,#+340] -ldrh r12, [r0,#+352] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #164] -strh r10, [r0,#+342] -ldrh r10, [r0,#+354] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #166] -strh r8, [r0,#+344] -ldrh r8, [r0,#+356] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #168] -vldrw.u32 Q5, [Q2, #188] -strh r6, [r0,#+346] -ldrh r6, [r0,#+358] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #170] -strh r4, [r0,#+348] -ldrh r4, [r0,#+360] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #172] -strh r14, [r0,#+350] -ldrh r14, [r0,#+362] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #174] -strh r12, [r0,#+352] -ldrh r12, [r0,#+364] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #176] -strh r10, [r0,#+354] -ldrh r10, [r0,#+366] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #178] -strh r8, [r0,#+356] -ldrh r8, [r0,#+368] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #180] -strh r6, [r0,#+358] -ldrh r6, [r0,#+370] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #182] -vldrw.u32 Q6, [Q2, #188] -strh r4, [r0,#+360] -ldrh r4, [r0,#+372] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #184] -strh r14, [r0,#+362] -ldrh r14, [r0,#+374] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #186] -strh r12, [r0,#+364] -ldrh r12, [r0,#+376] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #188] -strh r10, [r0,#+366] -ldrh r10, [r0,#+378] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #190] -strh r8, [r0,#+368] -ldrh r8, [r0,#+380] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #192] -strh r6, [r0,#+370] -ldrh r6, [r0,#+382] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #194] -strh r4, [r0,#+372] -ldrh r4, [r0,#+384] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #196] -vldrw.u32 Q7, [Q2, #188] -strh r14, [r0,#+374] -ldrh r14, [r0,#+386] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #198] -strh r12, [r0,#+376] -ldrh r12, [r0,#+388] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #200] -strh r10, [r0,#+378] -ldrh r10, [r0,#+390] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #202] -strh r8, [r0,#+380] -ldrh r8, [r0,#+392] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #204] -strh r6, [r0,#+382] -ldrh r6, [r0,#+394] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #206] -strh r4, [r0,#+384] -ldrh r4, [r0,#+396] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #208] -strh r14, [r0,#+386] -ldrh r14, [r0,#+398] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #210] -vldrw.u32 Q0, [Q2, #188] -strh r12, [r0,#+388] -ldrh r12, [r0,#+400] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #212] -strh r10, [r0,#+390] -ldrh r10, [r0,#+402] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #214] -strh r8, [r0,#+392] -ldrh r8, [r0,#+404] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #216] -strh r6, [r0,#+394] -ldrh r6, [r0,#+406] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #218] -strh r4, [r0,#+396] -ldrh r4, [r0,#+408] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #220] -strh r14, [r0,#+398] -ldrh r14, [r0,#+410] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #222] -strh r12, [r0,#+400] -ldrh r12, [r0,#+412] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #224] -vldrw.u32 Q1, [Q2, #188] -strh r10, [r0,#+402] -ldrh r10, [r0,#+414] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #226] -strh r8, [r0,#+404] -ldrh r8, [r0,#+416] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #228] -strh r6, [r0,#+406] -ldrh r6, [r0,#+418] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #230] -strh r4, [r0,#+408] -ldrh r4, [r0,#+420] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #232] -strh r14, [r0,#+410] -ldrh r14, [r0,#+422] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #234] -strh r12, [r0,#+412] -ldrh r12, [r0,#+424] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #236] -strh r10, [r0,#+414] -ldrh r10, [r0,#+426] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #238] -vldrw.u32 Q3, [Q2, #188] -strh r8, [r0,#+416] -ldrh r8, [r0,#+428] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #240] -strh r6, [r0,#+418] -vmladavx.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #242] -strh r4, [r0,#+420] -vmladavx.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #244] -strh r14, [r0,#+422] -vmladavx.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #246] -strh r12, [r0,#+424] -vmladavx.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #248] -strh r10, [r0,#+426] -vmladavx.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #250] -strh r8, [r0,#+428] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #252] -vldrw.u32 Q4, [Q2, #188] -strh r6, [r0,#+430] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #254] -strh r4, [r0,#+432] -vmladavx.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #-14] -vldrw.u32 Q7, [Q2, #204] -strh r14, [r0,#+434] -ldrh r14, [r0,#+192] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #-12] -strh r12, [r0,#+436] -ldrh r12, [r0,#+194] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #-10] -strh r10, [r0,#+438] -ldrh r10, [r0,#+196] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #-8] -strh r8, [r0,#+440] -ldrh r8, [r0,#+198] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #-6] -strh r6, [r0,#+442] -ldrh r6, [r0,#+200] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #-4] -strh r4, [r0,#+444] -ldrh r4, [r0,#+202] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #-2] -strh r14, [r0,#+192] -ldrh r14, [r0,#+204] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #204] -strh r12, [r0,#+194] -ldrh r12, [r0,#+206] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -strh r10, [r0,#+196] -ldrh r10, [r0,#+208] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #4] -strh r8, [r0,#+198] -ldrh r8, [r0,#+210] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #6] -strh r6, [r0,#+200] -ldrh r6, [r0,#+212] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #8] -strh r4, [r0,#+202] -ldrh r4, [r0,#+214] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #10] -strh r14, [r0,#+204] -ldrh r14, [r0,#+216] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #12] -strh r12, [r0,#+206] -ldrh r12, [r0,#+218] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #14] -vldrw.u32 Q1, [Q2, #204] -strh r10, [r0,#+208] -ldrh r10, [r0,#+220] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #16] -strh r8, [r0,#+210] -ldrh r8, [r0,#+222] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #18] -strh r6, [r0,#+212] -ldrh r6, [r0,#+224] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #20] -strh r4, [r0,#+214] -ldrh r4, [r0,#+226] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #22] -strh r14, [r0,#+216] -ldrh r14, [r0,#+228] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #24] -strh r12, [r0,#+218] -ldrh r12, [r0,#+230] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #26] -strh r10, [r0,#+220] -ldrh r10, [r0,#+232] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #204] -strh r8, [r0,#+222] -ldrh r8, [r0,#+234] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #30] -strh r6, [r0,#+224] -ldrh r6, [r0,#+236] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #32] -strh r4, [r0,#+226] -ldrh r4, [r0,#+238] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #34] -strh r14, [r0,#+228] -ldrh r14, [r0,#+240] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #36] -strh r12, [r0,#+230] -ldrh r12, [r0,#+242] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #38] -strh r10, [r0,#+232] -ldrh r10, [r0,#+244] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #40] -strh r8, [r0,#+234] -ldrh r8, [r0,#+246] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #42] -vldrw.u32 Q4, [Q2, #204] -strh r6, [r0,#+236] -ldrh r6, [r0,#+248] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -strh r4, [r0,#+238] -ldrh r4, [r0,#+250] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #46] -strh r14, [r0,#+240] -ldrh r14, [r0,#+252] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #48] -strh r12, [r0,#+242] -ldrh r12, [r0,#+254] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #50] -strh r10, [r0,#+244] -ldrh r10, [r0,#+256] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #52] -strh r8, [r0,#+246] -ldrh r8, [r0,#+258] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #54] -strh r6, [r0,#+248] -ldrh r6, [r0,#+260] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #56] -vldrw.u32 Q5, [Q2, #204] -strh r4, [r0,#+250] -ldrh r4, [r0,#+262] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -strh r14, [r0,#+252] -ldrh r14, [r0,#+264] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #60] -strh r12, [r0,#+254] -ldrh r12, [r0,#+266] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #62] -strh r10, [r0,#+256] -ldrh r10, [r0,#+268] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #64] -strh r8, [r0,#+258] -ldrh r8, [r0,#+270] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #66] -strh r6, [r0,#+260] -ldrh r6, [r0,#+272] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #68] -strh r4, [r0,#+262] -ldrh r4, [r0,#+274] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #204] -strh r14, [r0,#+264] -ldrh r14, [r0,#+276] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -strh r12, [r0,#+266] -ldrh r12, [r0,#+278] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #74] -strh r10, [r0,#+268] -ldrh r10, [r0,#+280] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #76] -strh r8, [r0,#+270] -ldrh r8, [r0,#+282] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #78] -strh r6, [r0,#+272] -ldrh r6, [r0,#+284] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #80] -strh r4, [r0,#+274] -ldrh r4, [r0,#+286] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #82] -strh r14, [r0,#+276] -ldrh r14, [r0,#+288] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #84] -vldrw.u32 Q7, [Q2, #204] -strh r12, [r0,#+278] -ldrh r12, [r0,#+290] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -strh r10, [r0,#+280] -ldrh r10, [r0,#+292] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #88] -strh r8, [r0,#+282] -ldrh r8, [r0,#+294] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #90] -strh r6, [r0,#+284] -ldrh r6, [r0,#+296] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #92] -strh r4, [r0,#+286] -ldrh r4, [r0,#+298] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #94] -strh r14, [r0,#+288] -ldrh r14, [r0,#+300] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #96] -strh r12, [r0,#+290] -ldrh r12, [r0,#+302] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #204] -strh r10, [r0,#+292] -ldrh r10, [r0,#+304] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #100] -strh r8, [r0,#+294] -ldrh r8, [r0,#+306] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #102] -strh r6, [r0,#+296] -ldrh r6, [r0,#+308] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #104] -strh r4, [r0,#+298] -ldrh r4, [r0,#+310] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #106] -strh r14, [r0,#+300] -ldrh r14, [r0,#+312] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #108] -strh r12, [r0,#+302] -ldrh r12, [r0,#+314] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #110] -strh r10, [r0,#+304] -ldrh r10, [r0,#+316] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #112] -vldrw.u32 Q1, [Q2, #204] -strh r8, [r0,#+306] -ldrh r8, [r0,#+318] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #114] -strh r6, [r0,#+308] -ldrh r6, [r0,#+320] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #116] -strh r4, [r0,#+310] -ldrh r4, [r0,#+322] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #118] -strh r14, [r0,#+312] -ldrh r14, [r0,#+324] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #120] -strh r12, [r0,#+314] -ldrh r12, [r0,#+326] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #122] -strh r10, [r0,#+316] -ldrh r10, [r0,#+328] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #124] -strh r8, [r0,#+318] -ldrh r8, [r0,#+330] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #126] -vldrw.u32 Q3, [Q2, #204] -strh r6, [r0,#+320] -ldrh r6, [r0,#+332] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #128] -strh r4, [r0,#+322] -ldrh r4, [r0,#+334] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #130] -strh r14, [r0,#+324] -ldrh r14, [r0,#+336] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #132] -strh r12, [r0,#+326] -ldrh r12, [r0,#+338] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #134] -strh r10, [r0,#+328] -ldrh r10, [r0,#+340] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #136] -strh r8, [r0,#+330] -ldrh r8, [r0,#+342] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #138] -strh r6, [r0,#+332] -ldrh r6, [r0,#+344] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #140] -vldrw.u32 Q4, [Q2, #204] -strh r4, [r0,#+334] -ldrh r4, [r0,#+346] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #142] -strh r14, [r0,#+336] -ldrh r14, [r0,#+348] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #144] -strh r12, [r0,#+338] -ldrh r12, [r0,#+350] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #146] -strh r10, [r0,#+340] -ldrh r10, [r0,#+352] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #148] -strh r8, [r0,#+342] -ldrh r8, [r0,#+354] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #150] -strh r6, [r0,#+344] -ldrh r6, [r0,#+356] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #152] -strh r4, [r0,#+346] -ldrh r4, [r0,#+358] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #154] -vldrw.u32 Q5, [Q2, #204] -strh r14, [r0,#+348] -ldrh r14, [r0,#+360] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #156] -strh r12, [r0,#+350] -ldrh r12, [r0,#+362] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #158] -strh r10, [r0,#+352] -ldrh r10, [r0,#+364] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #160] -strh r8, [r0,#+354] -ldrh r8, [r0,#+366] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #162] -strh r6, [r0,#+356] -ldrh r6, [r0,#+368] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #164] -strh r4, [r0,#+358] -ldrh r4, [r0,#+370] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #166] -strh r14, [r0,#+360] -ldrh r14, [r0,#+372] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #168] -vldrw.u32 Q6, [Q2, #204] -strh r12, [r0,#+362] -ldrh r12, [r0,#+374] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #170] -strh r10, [r0,#+364] -ldrh r10, [r0,#+376] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #172] -strh r8, [r0,#+366] -ldrh r8, [r0,#+378] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #174] -strh r6, [r0,#+368] -ldrh r6, [r0,#+380] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #176] -strh r4, [r0,#+370] -ldrh r4, [r0,#+382] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #178] -strh r14, [r0,#+372] -ldrh r14, [r0,#+384] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #180] -strh r12, [r0,#+374] -ldrh r12, [r0,#+386] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #182] -vldrw.u32 Q7, [Q2, #204] -strh r10, [r0,#+376] -ldrh r10, [r0,#+388] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #184] -strh r8, [r0,#+378] -ldrh r8, [r0,#+390] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #186] -strh r6, [r0,#+380] -ldrh r6, [r0,#+392] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #188] -strh r4, [r0,#+382] -ldrh r4, [r0,#+394] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #190] -strh r14, [r0,#+384] -ldrh r14, [r0,#+396] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #192] -strh r12, [r0,#+386] -ldrh r12, [r0,#+398] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #194] -strh r10, [r0,#+388] -ldrh r10, [r0,#+400] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #196] -vldrw.u32 Q0, [Q2, #204] -strh r8, [r0,#+390] -ldrh r8, [r0,#+402] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #198] -strh r6, [r0,#+392] -ldrh r6, [r0,#+404] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #200] -strh r4, [r0,#+394] -ldrh r4, [r0,#+406] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #202] -strh r14, [r0,#+396] -ldrh r14, [r0,#+408] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #204] -strh r12, [r0,#+398] -ldrh r12, [r0,#+410] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #206] -strh r10, [r0,#+400] -ldrh r10, [r0,#+412] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #208] -strh r8, [r0,#+402] -ldrh r8, [r0,#+414] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #210] -vldrw.u32 Q1, [Q2, #204] -strh r6, [r0,#+404] -ldrh r6, [r0,#+416] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #212] -strh r4, [r0,#+406] -ldrh r4, [r0,#+418] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #214] -strh r14, [r0,#+408] -ldrh r14, [r0,#+420] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #216] -strh r12, [r0,#+410] -ldrh r12, [r0,#+422] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #218] -strh r10, [r0,#+412] -ldrh r10, [r0,#+424] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #220] -strh r8, [r0,#+414] -ldrh r8, [r0,#+426] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #222] -strh r6, [r0,#+416] -ldrh r6, [r0,#+428] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #224] -vldrw.u32 Q3, [Q2, #204] -strh r4, [r0,#+418] -ldrh r4, [r0,#+430] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #226] -strh r14, [r0,#+420] -ldrh r14, [r0,#+432] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #228] -strh r12, [r0,#+422] -ldrh r12, [r0,#+434] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #230] -strh r10, [r0,#+424] -ldrh r10, [r0,#+436] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #232] -strh r8, [r0,#+426] -ldrh r8, [r0,#+438] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #234] -strh r6, [r0,#+428] -ldrh r6, [r0,#+440] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #236] -strh r4, [r0,#+430] -ldrh r4, [r0,#+442] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #238] -vldrw.u32 Q4, [Q2, #204] -strh r14, [r0,#+432] -ldrh r14, [r0,#+444] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #240] -strh r12, [r0,#+434] -vmladavx.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #242] -strh r10, [r0,#+436] -vmladavx.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #244] -strh r8, [r0,#+438] -vmladavx.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #246] -strh r6, [r0,#+440] -vmladavx.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #248] -strh r4, [r0,#+442] -vmladavx.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #250] -strh r14, [r0,#+444] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #252] -vldrw.u32 Q5, [Q2, #204] -strh r12, [r0,#+446] -vmladavx.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #254] -strh r10, [r0,#+448] -vmladavx.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #220] -strh r8, [r0,#+450] -ldrh r8, [r0,#+208] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r6, [r0,#+452] -ldrh r6, [r0,#+210] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #-10] -strh r4, [r0,#+454] -ldrh r4, [r0,#+212] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #-8] -strh r14, [r0,#+456] -ldrh r14, [r0,#+214] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #-6] -strh r12, [r0,#+458] -ldrh r12, [r0,#+216] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #-4] -strh r10, [r0,#+460] -ldrh r10, [r0,#+218] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #-2] -strh r8, [r0,#+208] -ldrh r8, [r0,#+220] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #0] -vldrw.u32 Q1, [Q2, #220] -strh r6, [r0,#+210] -ldrh r6, [r0,#+222] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #2] -strh r4, [r0,#+212] -ldrh r4, [r0,#+224] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #4] -strh r14, [r0,#+214] -ldrh r14, [r0,#+226] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #6] -strh r12, [r0,#+216] -ldrh r12, [r0,#+228] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #8] -strh r10, [r0,#+218] -ldrh r10, [r0,#+230] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #10] -strh r8, [r0,#+220] -ldrh r8, [r0,#+232] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #12] -strh r6, [r0,#+222] -ldrh r6, [r0,#+234] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #14] -vldrw.u32 Q3, [Q2, #220] -strh r4, [r0,#+224] -ldrh r4, [r0,#+236] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #16] -strh r14, [r0,#+226] -ldrh r14, [r0,#+238] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #18] -strh r12, [r0,#+228] -ldrh r12, [r0,#+240] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #20] -strh r10, [r0,#+230] -ldrh r10, [r0,#+242] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #22] -strh r8, [r0,#+232] -ldrh r8, [r0,#+244] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #24] -strh r6, [r0,#+234] -ldrh r6, [r0,#+246] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #26] -strh r4, [r0,#+236] -ldrh r4, [r0,#+248] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #220] -strh r14, [r0,#+238] -ldrh r14, [r0,#+250] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -strh r12, [r0,#+240] -ldrh r12, [r0,#+252] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #32] -strh r10, [r0,#+242] -ldrh r10, [r0,#+254] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #34] -strh r8, [r0,#+244] -ldrh r8, [r0,#+256] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #36] -strh r6, [r0,#+246] -ldrh r6, [r0,#+258] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #38] -strh r4, [r0,#+248] -ldrh r4, [r0,#+260] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #40] -strh r14, [r0,#+250] -ldrh r14, [r0,#+262] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #220] -strh r12, [r0,#+252] -ldrh r12, [r0,#+264] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #44] -strh r10, [r0,#+254] -ldrh r10, [r0,#+266] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #46] -strh r8, [r0,#+256] -ldrh r8, [r0,#+268] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #48] -strh r6, [r0,#+258] -ldrh r6, [r0,#+270] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #50] -strh r4, [r0,#+260] -ldrh r4, [r0,#+272] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #52] -strh r14, [r0,#+262] -ldrh r14, [r0,#+274] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #54] -strh r12, [r0,#+264] -ldrh r12, [r0,#+276] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #220] -strh r10, [r0,#+266] -ldrh r10, [r0,#+278] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -strh r8, [r0,#+268] -ldrh r8, [r0,#+280] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #60] -strh r6, [r0,#+270] -ldrh r6, [r0,#+282] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #62] -strh r4, [r0,#+272] -ldrh r4, [r0,#+284] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #64] -strh r14, [r0,#+274] -ldrh r14, [r0,#+286] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #66] -strh r12, [r0,#+276] -ldrh r12, [r0,#+288] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #68] -strh r10, [r0,#+278] -ldrh r10, [r0,#+290] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #220] -strh r8, [r0,#+280] -ldrh r8, [r0,#+292] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -strh r6, [r0,#+282] -ldrh r6, [r0,#+294] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #74] -strh r4, [r0,#+284] -ldrh r4, [r0,#+296] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #76] -strh r14, [r0,#+286] -ldrh r14, [r0,#+298] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #78] -strh r12, [r0,#+288] -ldrh r12, [r0,#+300] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #80] -strh r10, [r0,#+290] -ldrh r10, [r0,#+302] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #82] -strh r8, [r0,#+292] -ldrh r8, [r0,#+304] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #84] -vldrw.u32 Q0, [Q2, #220] -strh r6, [r0,#+294] -ldrh r6, [r0,#+306] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #86] -strh r4, [r0,#+296] -ldrh r4, [r0,#+308] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #88] -strh r14, [r0,#+298] -ldrh r14, [r0,#+310] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #90] -strh r12, [r0,#+300] -ldrh r12, [r0,#+312] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #92] -strh r10, [r0,#+302] -ldrh r10, [r0,#+314] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #94] -strh r8, [r0,#+304] -ldrh r8, [r0,#+316] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #96] -strh r6, [r0,#+306] -ldrh r6, [r0,#+318] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #98] -vldrw.u32 Q1, [Q2, #220] -strh r4, [r0,#+308] -ldrh r4, [r0,#+320] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #100] -strh r14, [r0,#+310] -ldrh r14, [r0,#+322] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #102] -strh r12, [r0,#+312] -ldrh r12, [r0,#+324] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #104] -strh r10, [r0,#+314] -ldrh r10, [r0,#+326] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #106] -strh r8, [r0,#+316] -ldrh r8, [r0,#+328] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #108] -strh r6, [r0,#+318] -ldrh r6, [r0,#+330] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #110] -strh r4, [r0,#+320] -ldrh r4, [r0,#+332] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #112] -vldrw.u32 Q3, [Q2, #220] -strh r14, [r0,#+322] -ldrh r14, [r0,#+334] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #114] -strh r12, [r0,#+324] -ldrh r12, [r0,#+336] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #116] -strh r10, [r0,#+326] -ldrh r10, [r0,#+338] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #118] -strh r8, [r0,#+328] -ldrh r8, [r0,#+340] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #120] -strh r6, [r0,#+330] -ldrh r6, [r0,#+342] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #122] -strh r4, [r0,#+332] -ldrh r4, [r0,#+344] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #124] -strh r14, [r0,#+334] -ldrh r14, [r0,#+346] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #220] -strh r12, [r0,#+336] -ldrh r12, [r0,#+348] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #128] -strh r10, [r0,#+338] -ldrh r10, [r0,#+350] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #130] -strh r8, [r0,#+340] -ldrh r8, [r0,#+352] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #132] -strh r6, [r0,#+342] -ldrh r6, [r0,#+354] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #134] -strh r4, [r0,#+344] -ldrh r4, [r0,#+356] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #136] -strh r14, [r0,#+346] -ldrh r14, [r0,#+358] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #138] -strh r12, [r0,#+348] -ldrh r12, [r0,#+360] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #140] -vldrw.u32 Q5, [Q2, #220] -strh r10, [r0,#+350] -ldrh r10, [r0,#+362] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #142] -strh r8, [r0,#+352] -ldrh r8, [r0,#+364] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #144] -strh r6, [r0,#+354] -ldrh r6, [r0,#+366] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #146] -strh r4, [r0,#+356] -ldrh r4, [r0,#+368] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #148] -strh r14, [r0,#+358] -ldrh r14, [r0,#+370] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #150] -strh r12, [r0,#+360] -ldrh r12, [r0,#+372] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #152] -strh r10, [r0,#+362] -ldrh r10, [r0,#+374] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #154] -vldrw.u32 Q6, [Q2, #220] -strh r8, [r0,#+364] -ldrh r8, [r0,#+376] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #156] -strh r6, [r0,#+366] -ldrh r6, [r0,#+378] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #158] -strh r4, [r0,#+368] -ldrh r4, [r0,#+380] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #160] -strh r14, [r0,#+370] -ldrh r14, [r0,#+382] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #162] -strh r12, [r0,#+372] -ldrh r12, [r0,#+384] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #164] -strh r10, [r0,#+374] -ldrh r10, [r0,#+386] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #166] -strh r8, [r0,#+376] -ldrh r8, [r0,#+388] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #168] -vldrw.u32 Q7, [Q2, #220] -strh r6, [r0,#+378] -ldrh r6, [r0,#+390] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #170] -strh r4, [r0,#+380] -ldrh r4, [r0,#+392] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #172] -strh r14, [r0,#+382] -ldrh r14, [r0,#+394] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #174] -strh r12, [r0,#+384] -ldrh r12, [r0,#+396] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #176] -strh r10, [r0,#+386] -ldrh r10, [r0,#+398] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #178] -strh r8, [r0,#+388] -ldrh r8, [r0,#+400] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #180] -strh r6, [r0,#+390] -ldrh r6, [r0,#+402] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #182] -vldrw.u32 Q0, [Q2, #220] -strh r4, [r0,#+392] -ldrh r4, [r0,#+404] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #184] -strh r14, [r0,#+394] -ldrh r14, [r0,#+406] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #186] -strh r12, [r0,#+396] -ldrh r12, [r0,#+408] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #188] -strh r10, [r0,#+398] -ldrh r10, [r0,#+410] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #190] -strh r8, [r0,#+400] -ldrh r8, [r0,#+412] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #192] -strh r6, [r0,#+402] -ldrh r6, [r0,#+414] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #194] -strh r4, [r0,#+404] -ldrh r4, [r0,#+416] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #196] -vldrw.u32 Q1, [Q2, #220] -strh r14, [r0,#+406] -ldrh r14, [r0,#+418] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #198] -strh r12, [r0,#+408] -ldrh r12, [r0,#+420] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #200] -strh r10, [r0,#+410] -ldrh r10, [r0,#+422] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #202] -strh r8, [r0,#+412] -ldrh r8, [r0,#+424] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #204] -strh r6, [r0,#+414] -ldrh r6, [r0,#+426] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #206] -strh r4, [r0,#+416] -ldrh r4, [r0,#+428] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #208] -strh r14, [r0,#+418] -ldrh r14, [r0,#+430] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #210] -vldrw.u32 Q3, [Q2, #220] -strh r12, [r0,#+420] -ldrh r12, [r0,#+432] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #212] -strh r10, [r0,#+422] -ldrh r10, [r0,#+434] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #214] -strh r8, [r0,#+424] -ldrh r8, [r0,#+436] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #216] -strh r6, [r0,#+426] -ldrh r6, [r0,#+438] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #218] -strh r4, [r0,#+428] -ldrh r4, [r0,#+440] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #220] -strh r14, [r0,#+430] -ldrh r14, [r0,#+442] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #222] -strh r12, [r0,#+432] -ldrh r12, [r0,#+444] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #224] -vldrw.u32 Q4, [Q2, #220] -strh r10, [r0,#+434] -ldrh r10, [r0,#+446] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #226] -strh r8, [r0,#+436] -ldrh r8, [r0,#+448] -vmladavax.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #228] -strh r6, [r0,#+438] -ldrh r6, [r0,#+450] -vmladavax.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #230] -strh r4, [r0,#+440] -ldrh r4, [r0,#+452] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #232] -strh r14, [r0,#+442] -ldrh r14, [r0,#+454] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #234] -strh r12, [r0,#+444] -ldrh r12, [r0,#+456] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #236] -strh r10, [r0,#+446] -ldrh r10, [r0,#+458] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #238] -vldrw.u32 Q5, [Q2, #220] -strh r8, [r0,#+448] -ldrh r8, [r0,#+460] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #240] -strh r6, [r0,#+450] -vmladavx.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #242] -strh r4, [r0,#+452] -vmladavx.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #244] -strh r14, [r0,#+454] -vmladavx.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #246] -strh r12, [r0,#+456] -vmladavx.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #248] -strh r10, [r0,#+458] -vmladavx.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #250] -strh r8, [r0,#+460] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #252] -vldrw.u32 Q6, [Q2, #220] -strh r6, [r0,#+462] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #254] -strh r4, [r0,#+464] -vmladavx.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #-14] -vldrw.u32 Q1, [Q2, #236] -strh r14, [r0,#+466] -ldrh r14, [r0,#+224] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -strh r12, [r0,#+468] -ldrh r12, [r0,#+226] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -strh r10, [r0,#+470] -ldrh r10, [r0,#+228] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -strh r8, [r0,#+472] -ldrh r8, [r0,#+230] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -strh r6, [r0,#+474] -ldrh r6, [r0,#+232] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -strh r4, [r0,#+476] -ldrh r4, [r0,#+234] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+224] -ldrh r14, [r0,#+236] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #236] -strh r12, [r0,#+226] -ldrh r12, [r0,#+238] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r10, [r0,#+228] -ldrh r10, [r0,#+240] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r8, [r0,#+230] -ldrh r8, [r0,#+242] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r6, [r0,#+232] -ldrh r6, [r0,#+244] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r4, [r0,#+234] -ldrh r4, [r0,#+246] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r14, [r0,#+236] -ldrh r14, [r0,#+248] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r12, [r0,#+238] -ldrh r12, [r0,#+250] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #236] -strh r10, [r0,#+240] -ldrh r10, [r0,#+252] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r8, [r0,#+242] -ldrh r8, [r0,#+254] -vmladavax.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r6, [r0,#+244] -ldrh r6, [r0,#+256] -vmladavax.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r4, [r0,#+246] -ldrh r4, [r0,#+258] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r14, [r0,#+248] -ldrh r14, [r0,#+260] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r12, [r0,#+250] -ldrh r12, [r0,#+262] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r10, [r0,#+252] -ldrh r10, [r0,#+264] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #236] -strh r8, [r0,#+254] -ldrh r8, [r0,#+266] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r6, [r0,#+256] -ldrh r6, [r0,#+268] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #32] -strh r4, [r0,#+258] -ldrh r4, [r0,#+270] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #34] -strh r14, [r0,#+260] -ldrh r14, [r0,#+272] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #36] -strh r12, [r0,#+262] -ldrh r12, [r0,#+274] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #38] -strh r10, [r0,#+264] -ldrh r10, [r0,#+276] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -strh r8, [r0,#+266] -ldrh r8, [r0,#+278] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #236] -strh r6, [r0,#+268] -ldrh r6, [r0,#+280] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -strh r4, [r0,#+270] -ldrh r4, [r0,#+282] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #46] -strh r14, [r0,#+272] -ldrh r14, [r0,#+284] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #48] -strh r12, [r0,#+274] -ldrh r12, [r0,#+286] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #50] -strh r10, [r0,#+276] -ldrh r10, [r0,#+288] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #52] -strh r8, [r0,#+278] -ldrh r8, [r0,#+290] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #54] -strh r6, [r0,#+280] -ldrh r6, [r0,#+292] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #236] -strh r4, [r0,#+282] -ldrh r4, [r0,#+294] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -strh r14, [r0,#+284] -ldrh r14, [r0,#+296] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #60] -strh r12, [r0,#+286] -ldrh r12, [r0,#+298] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #62] -strh r10, [r0,#+288] -ldrh r10, [r0,#+300] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #64] -strh r8, [r0,#+290] -ldrh r8, [r0,#+302] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #66] -strh r6, [r0,#+292] -ldrh r6, [r0,#+304] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #68] -strh r4, [r0,#+294] -ldrh r4, [r0,#+306] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #70] -vldrw.u32 Q0, [Q2, #236] -strh r14, [r0,#+296] -ldrh r14, [r0,#+308] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -strh r12, [r0,#+298] -ldrh r12, [r0,#+310] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #74] -strh r10, [r0,#+300] -ldrh r10, [r0,#+312] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #76] -strh r8, [r0,#+302] -ldrh r8, [r0,#+314] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #78] -strh r6, [r0,#+304] -ldrh r6, [r0,#+316] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #80] -strh r4, [r0,#+306] -ldrh r4, [r0,#+318] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #82] -strh r14, [r0,#+308] -ldrh r14, [r0,#+320] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #236] -strh r12, [r0,#+310] -ldrh r12, [r0,#+322] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -strh r10, [r0,#+312] -ldrh r10, [r0,#+324] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #88] -strh r8, [r0,#+314] -ldrh r8, [r0,#+326] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #90] -strh r6, [r0,#+316] -ldrh r6, [r0,#+328] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #92] -strh r4, [r0,#+318] -ldrh r4, [r0,#+330] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #94] -strh r14, [r0,#+320] -ldrh r14, [r0,#+332] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #96] -strh r12, [r0,#+322] -ldrh r12, [r0,#+334] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #236] -strh r10, [r0,#+324] -ldrh r10, [r0,#+336] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -strh r8, [r0,#+326] -ldrh r8, [r0,#+338] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #102] -strh r6, [r0,#+328] -ldrh r6, [r0,#+340] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #104] -strh r4, [r0,#+330] -ldrh r4, [r0,#+342] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #106] -strh r14, [r0,#+332] -ldrh r14, [r0,#+344] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #108] -strh r12, [r0,#+334] -ldrh r12, [r0,#+346] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #110] -strh r10, [r0,#+336] -ldrh r10, [r0,#+348] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #236] -strh r8, [r0,#+338] -ldrh r8, [r0,#+350] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -strh r6, [r0,#+340] -ldrh r6, [r0,#+352] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #116] -strh r4, [r0,#+342] -ldrh r4, [r0,#+354] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #118] -strh r14, [r0,#+344] -ldrh r14, [r0,#+356] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #120] -strh r12, [r0,#+346] -ldrh r12, [r0,#+358] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #122] -strh r10, [r0,#+348] -ldrh r10, [r0,#+360] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #124] -strh r8, [r0,#+350] -ldrh r8, [r0,#+362] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #236] -strh r6, [r0,#+352] -ldrh r6, [r0,#+364] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #128] -strh r4, [r0,#+354] -ldrh r4, [r0,#+366] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #130] -strh r14, [r0,#+356] -ldrh r14, [r0,#+368] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #132] -strh r12, [r0,#+358] -ldrh r12, [r0,#+370] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #134] -strh r10, [r0,#+360] -ldrh r10, [r0,#+372] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #136] -strh r8, [r0,#+362] -ldrh r8, [r0,#+374] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #138] -strh r6, [r0,#+364] -ldrh r6, [r0,#+376] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #140] -vldrw.u32 Q6, [Q2, #236] -strh r4, [r0,#+366] -ldrh r4, [r0,#+378] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #142] -strh r14, [r0,#+368] -ldrh r14, [r0,#+380] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #144] -strh r12, [r0,#+370] -ldrh r12, [r0,#+382] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #146] -strh r10, [r0,#+372] -ldrh r10, [r0,#+384] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #148] -strh r8, [r0,#+374] -ldrh r8, [r0,#+386] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #150] -strh r6, [r0,#+376] -ldrh r6, [r0,#+388] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #152] -strh r4, [r0,#+378] -ldrh r4, [r0,#+390] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #154] -vldrw.u32 Q7, [Q2, #236] -strh r14, [r0,#+380] -ldrh r14, [r0,#+392] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #156] -strh r12, [r0,#+382] -ldrh r12, [r0,#+394] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #158] -strh r10, [r0,#+384] -ldrh r10, [r0,#+396] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #160] -strh r8, [r0,#+386] -ldrh r8, [r0,#+398] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #162] -strh r6, [r0,#+388] -ldrh r6, [r0,#+400] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #164] -strh r4, [r0,#+390] -ldrh r4, [r0,#+402] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #166] -strh r14, [r0,#+392] -ldrh r14, [r0,#+404] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #168] -vldrw.u32 Q0, [Q2, #236] -strh r12, [r0,#+394] -ldrh r12, [r0,#+406] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #170] -strh r10, [r0,#+396] -ldrh r10, [r0,#+408] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #172] -strh r8, [r0,#+398] -ldrh r8, [r0,#+410] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #174] -strh r6, [r0,#+400] -ldrh r6, [r0,#+412] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #176] -strh r4, [r0,#+402] -ldrh r4, [r0,#+414] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #178] -strh r14, [r0,#+404] -ldrh r14, [r0,#+416] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #180] -strh r12, [r0,#+406] -ldrh r12, [r0,#+418] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #182] -vldrw.u32 Q1, [Q2, #236] -strh r10, [r0,#+408] -ldrh r10, [r0,#+420] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #184] -strh r8, [r0,#+410] -ldrh r8, [r0,#+422] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #186] -strh r6, [r0,#+412] -ldrh r6, [r0,#+424] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #188] -strh r4, [r0,#+414] -ldrh r4, [r0,#+426] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #190] -strh r14, [r0,#+416] -ldrh r14, [r0,#+428] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #192] -strh r12, [r0,#+418] -ldrh r12, [r0,#+430] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #194] -strh r10, [r0,#+420] -ldrh r10, [r0,#+432] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #196] -vldrw.u32 Q3, [Q2, #236] -strh r8, [r0,#+422] -ldrh r8, [r0,#+434] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #198] -strh r6, [r0,#+424] -ldrh r6, [r0,#+436] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #200] -strh r4, [r0,#+426] -ldrh r4, [r0,#+438] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #202] -strh r14, [r0,#+428] -ldrh r14, [r0,#+440] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #204] -strh r12, [r0,#+430] -ldrh r12, [r0,#+442] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #206] -strh r10, [r0,#+432] -ldrh r10, [r0,#+444] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #208] -strh r8, [r0,#+434] -ldrh r8, [r0,#+446] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #210] -vldrw.u32 Q4, [Q2, #236] -strh r6, [r0,#+436] -ldrh r6, [r0,#+448] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #212] -strh r4, [r0,#+438] -ldrh r4, [r0,#+450] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #214] -strh r14, [r0,#+440] -ldrh r14, [r0,#+452] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #216] -strh r12, [r0,#+442] -ldrh r12, [r0,#+454] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #218] -strh r10, [r0,#+444] -ldrh r10, [r0,#+456] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #220] -strh r8, [r0,#+446] -ldrh r8, [r0,#+458] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #222] -strh r6, [r0,#+448] -ldrh r6, [r0,#+460] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #224] -vldrw.u32 Q5, [Q2, #236] -strh r4, [r0,#+450] -ldrh r4, [r0,#+462] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #226] -strh r14, [r0,#+452] -ldrh r14, [r0,#+464] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #228] -strh r12, [r0,#+454] -ldrh r12, [r0,#+466] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #230] -strh r10, [r0,#+456] -ldrh r10, [r0,#+468] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #232] -strh r8, [r0,#+458] -ldrh r8, [r0,#+470] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #234] -strh r6, [r0,#+460] -ldrh r6, [r0,#+472] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #236] -strh r4, [r0,#+462] -ldrh r4, [r0,#+474] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #238] -vldrw.u32 Q6, [Q2, #236] -strh r14, [r0,#+464] -ldrh r14, [r0,#+476] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #240] -strh r12, [r0,#+466] -vmladavx.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #242] -strh r10, [r0,#+468] -vmladavx.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #244] -strh r8, [r0,#+470] -vmladavx.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #246] -strh r6, [r0,#+472] -vmladavx.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #248] -strh r4, [r0,#+474] -vmladavx.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #250] -strh r14, [r0,#+476] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #252] -vldrw.u32 Q7, [Q2, #236] -strh r12, [r0,#+478] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #254] -strh r10, [r0,#+480] -vmladavx.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #-14] -vldrw.u32 Q3, [Q2, #252] -strh r8, [r0,#+482] -ldrh r8, [r0,#+240] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -strh r6, [r0,#+484] -ldrh r6, [r0,#+242] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #-10] -strh r4, [r0,#+486] -ldrh r4, [r0,#+244] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #-8] -strh r14, [r0,#+488] -ldrh r14, [r0,#+246] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #-6] -strh r12, [r0,#+490] -ldrh r12, [r0,#+248] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #-4] -strh r10, [r0,#+492] -ldrh r10, [r0,#+250] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #-2] -strh r8, [r0,#+240] -ldrh r8, [r0,#+252] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #0] -vldrw.u32 Q4, [Q2, #252] -strh r6, [r0,#+242] -ldrh r6, [r0,#+254] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #2] -strh r4, [r0,#+244] -ldrh r4, [r0,#+256] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #4] -strh r14, [r0,#+246] -ldrh r14, [r0,#+258] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #6] -strh r12, [r0,#+248] -ldrh r12, [r0,#+260] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #8] -strh r10, [r0,#+250] -ldrh r10, [r0,#+262] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #10] -strh r8, [r0,#+252] -ldrh r8, [r0,#+264] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #12] -strh r6, [r0,#+254] -ldrh r6, [r0,#+266] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #252] -strh r4, [r0,#+256] -ldrh r4, [r0,#+268] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -strh r14, [r0,#+258] -ldrh r14, [r0,#+270] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #18] -strh r12, [r0,#+260] -ldrh r12, [r0,#+272] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #20] -strh r10, [r0,#+262] -ldrh r10, [r0,#+274] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #22] -strh r8, [r0,#+264] -ldrh r8, [r0,#+276] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #24] -strh r6, [r0,#+266] -ldrh r6, [r0,#+278] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #26] -strh r4, [r0,#+268] -ldrh r4, [r0,#+280] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #252] -strh r14, [r0,#+270] -ldrh r14, [r0,#+282] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -strh r12, [r0,#+272] -ldrh r12, [r0,#+284] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #32] -strh r10, [r0,#+274] -ldrh r10, [r0,#+286] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #34] -strh r8, [r0,#+276] -ldrh r8, [r0,#+288] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #36] -strh r6, [r0,#+278] -ldrh r6, [r0,#+290] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #38] -strh r4, [r0,#+280] -ldrh r4, [r0,#+292] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #40] -strh r14, [r0,#+282] -ldrh r14, [r0,#+294] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #252] -strh r12, [r0,#+284] -ldrh r12, [r0,#+296] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #44] -strh r10, [r0,#+286] -ldrh r10, [r0,#+298] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #46] -strh r8, [r0,#+288] -ldrh r8, [r0,#+300] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #48] -strh r6, [r0,#+290] -ldrh r6, [r0,#+302] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #50] -strh r4, [r0,#+292] -ldrh r4, [r0,#+304] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #52] -strh r14, [r0,#+294] -ldrh r14, [r0,#+306] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #54] -strh r12, [r0,#+296] -ldrh r12, [r0,#+308] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #252] -strh r10, [r0,#+298] -ldrh r10, [r0,#+310] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #58] -strh r8, [r0,#+300] -ldrh r8, [r0,#+312] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #60] -strh r6, [r0,#+302] -ldrh r6, [r0,#+314] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #62] -strh r4, [r0,#+304] -ldrh r4, [r0,#+316] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #64] -strh r14, [r0,#+306] -ldrh r14, [r0,#+318] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #66] -strh r12, [r0,#+308] -ldrh r12, [r0,#+320] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #68] -strh r10, [r0,#+310] -ldrh r10, [r0,#+322] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #252] -strh r8, [r0,#+312] -ldrh r8, [r0,#+324] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -strh r6, [r0,#+314] -ldrh r6, [r0,#+326] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #74] -strh r4, [r0,#+316] -ldrh r4, [r0,#+328] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #76] -strh r14, [r0,#+318] -ldrh r14, [r0,#+330] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #78] -strh r12, [r0,#+320] -ldrh r12, [r0,#+332] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #80] -strh r10, [r0,#+322] -ldrh r10, [r0,#+334] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #82] -strh r8, [r0,#+324] -ldrh r8, [r0,#+336] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #252] -strh r6, [r0,#+326] -ldrh r6, [r0,#+338] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -strh r4, [r0,#+328] -ldrh r4, [r0,#+340] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #88] -strh r14, [r0,#+330] -ldrh r14, [r0,#+342] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #90] -strh r12, [r0,#+332] -ldrh r12, [r0,#+344] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #92] -strh r10, [r0,#+334] -ldrh r10, [r0,#+346] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #94] -strh r8, [r0,#+336] -ldrh r8, [r0,#+348] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #96] -strh r6, [r0,#+338] -ldrh r6, [r0,#+350] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #252] -strh r4, [r0,#+340] -ldrh r4, [r0,#+352] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #100] -strh r14, [r0,#+342] -ldrh r14, [r0,#+354] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #102] -strh r12, [r0,#+344] -ldrh r12, [r0,#+356] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #104] -strh r10, [r0,#+346] -ldrh r10, [r0,#+358] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #106] -strh r8, [r0,#+348] -ldrh r8, [r0,#+360] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #108] -strh r6, [r0,#+350] -ldrh r6, [r0,#+362] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #110] -strh r4, [r0,#+352] -ldrh r4, [r0,#+364] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #252] -strh r14, [r0,#+354] -ldrh r14, [r0,#+366] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -strh r12, [r0,#+356] -ldrh r12, [r0,#+368] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #116] -strh r10, [r0,#+358] -ldrh r10, [r0,#+370] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #118] -strh r8, [r0,#+360] -ldrh r8, [r0,#+372] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #120] -strh r6, [r0,#+362] -ldrh r6, [r0,#+374] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #122] -strh r4, [r0,#+364] -ldrh r4, [r0,#+376] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #124] -strh r14, [r0,#+366] -ldrh r14, [r0,#+378] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #252] -strh r12, [r0,#+368] -ldrh r12, [r0,#+380] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #128] -strh r10, [r0,#+370] -ldrh r10, [r0,#+382] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #130] -strh r8, [r0,#+372] -ldrh r8, [r0,#+384] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #132] -strh r6, [r0,#+374] -ldrh r6, [r0,#+386] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #134] -strh r4, [r0,#+376] -ldrh r4, [r0,#+388] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #136] -strh r14, [r0,#+378] -ldrh r14, [r0,#+390] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #138] -strh r12, [r0,#+380] -ldrh r12, [r0,#+392] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #140] -vldrw.u32 Q7, [Q2, #252] -strh r10, [r0,#+382] -ldrh r10, [r0,#+394] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #142] -strh r8, [r0,#+384] -ldrh r8, [r0,#+396] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #144] -strh r6, [r0,#+386] -ldrh r6, [r0,#+398] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #146] -strh r4, [r0,#+388] -ldrh r4, [r0,#+400] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #148] -strh r14, [r0,#+390] -ldrh r14, [r0,#+402] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #150] -strh r12, [r0,#+392] -ldrh r12, [r0,#+404] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #152] -strh r10, [r0,#+394] -ldrh r10, [r0,#+406] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #154] -vldrw.u32 Q0, [Q2, #252] -strh r8, [r0,#+396] -ldrh r8, [r0,#+408] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #156] -strh r6, [r0,#+398] -ldrh r6, [r0,#+410] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #158] -strh r4, [r0,#+400] -ldrh r4, [r0,#+412] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #160] -strh r14, [r0,#+402] -ldrh r14, [r0,#+414] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #162] -strh r12, [r0,#+404] -ldrh r12, [r0,#+416] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #164] -strh r10, [r0,#+406] -ldrh r10, [r0,#+418] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #166] -strh r8, [r0,#+408] -ldrh r8, [r0,#+420] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #168] -vldrw.u32 Q1, [Q2, #252] -strh r6, [r0,#+410] -ldrh r6, [r0,#+422] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #170] -strh r4, [r0,#+412] -ldrh r4, [r0,#+424] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #172] -strh r14, [r0,#+414] -ldrh r14, [r0,#+426] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #174] -strh r12, [r0,#+416] -ldrh r12, [r0,#+428] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #176] -strh r10, [r0,#+418] -ldrh r10, [r0,#+430] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #178] -strh r8, [r0,#+420] -ldrh r8, [r0,#+432] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #180] -strh r6, [r0,#+422] -ldrh r6, [r0,#+434] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #182] -vldrw.u32 Q3, [Q2, #252] -strh r4, [r0,#+424] -ldrh r4, [r0,#+436] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #184] -strh r14, [r0,#+426] -ldrh r14, [r0,#+438] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #186] -strh r12, [r0,#+428] -ldrh r12, [r0,#+440] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #188] -strh r10, [r0,#+430] -ldrh r10, [r0,#+442] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #190] -strh r8, [r0,#+432] -ldrh r8, [r0,#+444] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #192] -strh r6, [r0,#+434] -ldrh r6, [r0,#+446] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #194] -strh r4, [r0,#+436] -ldrh r4, [r0,#+448] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #196] -vldrw.u32 Q4, [Q2, #252] -strh r14, [r0,#+438] -ldrh r14, [r0,#+450] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #198] -strh r12, [r0,#+440] -ldrh r12, [r0,#+452] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #200] -strh r10, [r0,#+442] -ldrh r10, [r0,#+454] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #202] -strh r8, [r0,#+444] -ldrh r8, [r0,#+456] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #204] -strh r6, [r0,#+446] -ldrh r6, [r0,#+458] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #206] -strh r4, [r0,#+448] -ldrh r4, [r0,#+460] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #208] -strh r14, [r0,#+450] -ldrh r14, [r0,#+462] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #210] -vldrw.u32 Q5, [Q2, #252] -strh r12, [r0,#+452] -ldrh r12, [r0,#+464] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #212] -strh r10, [r0,#+454] -ldrh r10, [r0,#+466] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #214] -strh r8, [r0,#+456] -ldrh r8, [r0,#+468] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #216] -strh r6, [r0,#+458] -ldrh r6, [r0,#+470] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #218] -strh r4, [r0,#+460] -ldrh r4, [r0,#+472] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #220] -strh r14, [r0,#+462] -ldrh r14, [r0,#+474] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #222] -strh r12, [r0,#+464] -ldrh r12, [r0,#+476] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #224] -vldrw.u32 Q6, [Q2, #252] -strh r10, [r0,#+466] -ldrh r10, [r0,#+478] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #226] -strh r8, [r0,#+468] -ldrh r8, [r0,#+480] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #228] -strh r6, [r0,#+470] -ldrh r6, [r0,#+482] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #230] -strh r4, [r0,#+472] -ldrh r4, [r0,#+484] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #232] -strh r14, [r0,#+474] -ldrh r14, [r0,#+486] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #234] -strh r12, [r0,#+476] -ldrh r12, [r0,#+488] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #236] -strh r10, [r0,#+478] -ldrh r10, [r0,#+490] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #238] -vldrw.u32 Q7, [Q2, #252] -strh r8, [r0,#+480] -ldrh r8, [r0,#+492] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #240] -strh r6, [r0,#+482] -vmladavx.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #242] -strh r4, [r0,#+484] -vmladavx.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #244] -strh r14, [r0,#+486] -vmladavx.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #246] -strh r12, [r0,#+488] -vmladavx.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #248] -strh r10, [r0,#+490] -vmladavx.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #250] -strh r8, [r0,#+492] -vmladavx.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #252] -vldrw.u32 Q0, [Q2, #252] -strh r6, [r0,#+494] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #254] -strh r4, [r0,#+496] -vmladavx.s16 r4, Q1, Q0 -strh r14, [r0,#+498] -strh r12, [r0,#+500] -strh r10, [r0,#+502] -strh r8, [r0,#+504] -strh r6, [r0,#+506] -strh r4, [r0,#+508] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_16_anticyclic_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_16_anticyclic_mve_simd.s deleted file mode 100644 index cf128b4..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_16_anticyclic_mve_simd.s +++ /dev/null @@ -1,108 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_mve_simd, %function -.global poly_u16_mul_16_anticyclic_mve_simd -poly_u16_mul_16_anticyclic_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -vmov.u16 Q2, #0 -mov r12, #0 -vldrh.u16 Q3, [r2, #(2 * 0)] -vldrh.u16 Q4, [r2, #(2 * 8)] -vneg.s16 Q5, Q3 -ldrd r14, r11, [r1, #8] -ldrd r10, r9, [r1, #24] -vmul.u16 Q0, Q4, r14 -vmla.s16 Q0, Q3, r10 -vmul.u16 Q1, Q4, r10 -vmla.s16 Q1, Q5, r14 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -ldrd r8, r7, [r1, #0] -ldrd r6, r5, [r1, #16] -vmla.s16 Q0, Q4, r7 -vmla.s16 Q0, Q3, r5 -vmla.s16 Q1, Q4, r5 -vmla.s16 Q1, Q5, r7 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vmla.s16 Q0, Q4, r8 -vmla.s16 Q0, Q3, r6 -vmla.s16 Q1, Q4, r6 -vmla.s16 Q1, Q5, r8 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vneg.s16 Q5, Q4 -vmla.s16 Q0, Q3, r11 -vmla.s16 Q0, Q5, r9 -vmla.s16 Q1, Q4, r11 -vmla.s16 Q1, Q3, r9 -vsub.u16 Q0, Q0, Q2 -vmov.u16 Q2, #0 -vshlc Q0, r12, #16 -vshlc Q1, r12, #16 -vshlc Q2, r12, #16 -asrl r14, r11, #16 -asrl r10, r9, #16 -vmla.s16 Q0, Q3, r14 -vmla.s16 Q0, Q5, r10 -vmla.s16 Q1, Q4, r14 -vmla.s16 Q1, Q3, r10 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -asrl r8, r7, #16 -asrl r6, r5, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q0, Q5, r5 -vmla.s16 Q1, Q4, r7 -vmla.s16 Q1, Q3, r5 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -vmla.s16 Q0, Q3, r8 -vmla.s16 Q0, Q5, r6 -vmla.s16 Q1, Q4, r8 -vmla.s16 Q1, Q3, r6 -vshlc Q0, r12, #32 -vshlc Q1, r12, #32 -vshlc Q2, r12, #32 -neg r9, r9 -vmla.s16 Q0, Q3, r9 -vmla.s16 Q0, Q5, r11 -vmla.s16 Q1, Q4, r9 -vmla.s16 Q1, Q3, r11 -vsub.u16 Q0, Q0, Q2 -vstrh.u16 Q0, [r0,#(0)] -vstrh.u16 Q1, [r0,#(16)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s deleted file mode 100644 index 4728c69..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_16_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,425 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_opt_mve_simd, %function -.global poly_u16_mul_16_anticyclic_opt_mve_simd -poly_u16_mul_16_anticyclic_opt_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -vldrh.u16 Q2, [r0, #(2 * 0)] -vldrh.u16 Q3, [r0, #(2 * 8)] -ldrd r10, r11, [r1, #0] -ldrd r8, r9, [r1, #16] -ldrd r6, r7, [r1, #24] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2, #(2 * 8)] -vmla.s16 Q3, Q1, r6 -ldrd r4, r5, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r9 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r11 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r7 -asrl r4, r5, #16 -asrl r6, r7, #16 -asrl r10, r11, #16 -asrl r8, r9, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r5 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r11 -vmla.s16 Q3, Q0, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r8 -vstrh.u16 Q3, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -mov r14, #30 -wls r14, r14, loop_end -loop_start: -mov r11, r11 -mov r11, r11 -mov r11, r11 -vldrh.u16 Q5, [r0, #(2 * 0)] -vldrh.u16 Q4, [r0, #(2 * 8)] -ldrd r10, r9, [r1, #0] -ldrd r8, r7, [r1, #16] -ldrd r6, r5, [r1, #24] -vldrh.u16 Q7, [r2, #(2 * 0)] -vmla.s16 Q5, Q7, r6 -vldrh.u16 Q6, [r2, #(2 * 8)] -vmla.s16 Q4, Q6, r6 -ldrd r4, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q5, Q6, r4 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r4 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r8 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r10 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r5 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r3 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r5 -asrl r4, r3, #16 -asrl r6, r5, #16 -asrl r10, r9, #16 -asrl r8, r7, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r5 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r3 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r5 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r4 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r6 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r4 -vmla.s16 Q4, Q7, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r9 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0,#(0)] -vmla.s16 Q4, Q7, r8 -vstrh.u16 Q4, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -vldrh.u16 Q2, [r0, #(2 * 0)] -vldrh.u16 Q3, [r0, #(2 * 8)] -ldrd r10, r9, [r1, #0] -ldrd r8, r7, [r1, #16] -ldrd r6, r5, [r1, #24] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2, #(2 * 8)] -vmla.s16 Q3, Q1, r6 -ldrd r4, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r5 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r5 -asrl r4, r3, #16 -asrl r6, r5, #16 -asrl r10, r9, #16 -asrl r8, r7, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r5 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r3 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r9 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r8 -vstrh.u16 Q3, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -le r14, loop_start -loop_end: -vldrh.u16 Q5, [r0, #(2 * 0)] -vldrh.u16 Q4, [r0, #(2 * 8)] -ldrd r14, r9, [r1, #0] -ldrd r10, r7, [r1, #16] -ldrd r8, r5, [r1, #24] -vldrh.u16 Q7, [r2, #(2 * 0)] -vmla.s16 Q5, Q7, r8 -vldrh.u16 Q6, [r2, #(2 * 8)] -vmla.s16 Q4, Q6, r8 -ldrd r6, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q5, Q6, r6 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r14 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r5 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r3 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r5 -asrl r6, r3, #16 -asrl r8, r5, #16 -asrl r14, r9, #16 -asrl r10, r7, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r5 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r3 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r5 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r6 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r6 -vmla.s16 Q4, Q7, r8 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r9 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r9 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r14 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0,#(0)] -vmla.s16 Q4, Q7, r10 -vstrh.u16 Q4, [r0,#(16)] -add r1, r1, #32 -add r2, r2, #32 -add r0, r0, #32 -vldrh.u16 Q2, [r0, #(2 * 0)] -vldrh.u16 Q3, [r0, #(2 * 8)] -ldrd r14, r9, [r1, #0] -ldrd r10, r7, [r1, #16] -ldrd r8, r5, [r1, #24] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmla.s16 Q2, Q0, r8 -vldrh.u16 Q1, [r2, #(2 * 8)] -vmla.s16 Q3, Q1, r8 -ldrd r6, r3, [r1, #8] -mov r12, #0 -vmla.s16 Q2, Q1, r6 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r14 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r5 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r5 -asrl r6, r3, #16 -asrl r8, r5, #16 -asrl r14, r9, #16 -asrl r10, r7, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r5 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r3 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r6 -vmla.s16 Q3, Q0, r8 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r9 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r14 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r10 -vstrh.u16 Q3, [r0,#(16)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_16_mve_comba.s b/tests/schoolbook/auto/poly_u16_mul_16_mve_comba.s deleted file mode 100644 index caaa67a..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_16_mve_comba.s +++ /dev/null @@ -1,140 +0,0 @@ -.syntax unified -.type poly_u16_mul_16_comba_mve, %function -.global poly_u16_mul_16_comba_mve -poly_u16_mul_16_comba_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #28] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #-12] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #4] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #-10] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #6] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #28] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #10] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #-4] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #12] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #28] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #0] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #16] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #2] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #18] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #28] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #6] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #22] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #8] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #24] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #28] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #12] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #28] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #14] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #28] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #18] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #20] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #22] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #24] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #26] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #28] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #30] -vldrw.u32 Q4, [Q2, #28] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q3, Q4 -strh r14, [r0,#+60] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_16_mve_schoolbook.s b/tests/schoolbook/auto/poly_u16_mul_16_mve_schoolbook.s deleted file mode 100644 index 3851c54..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_16_mve_schoolbook.s +++ /dev/null @@ -1,169 +0,0 @@ -.syntax unified -.type poly_u16_mul_16_schoolbook_mve, %function -.global poly_u16_mul_16_schoolbook_mve -poly_u16_mul_16_schoolbook_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #12] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #12] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #28] -strh r4, [r0,#+34] -ldrh r4, [r0,#+16] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r14, [r0,#+36] -ldrh r14, [r0,#+18] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #-10] -strh r12, [r0,#+38] -ldrh r12, [r0,#+20] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #-8] -strh r10, [r0,#+40] -ldrh r10, [r0,#+22] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #-6] -strh r8, [r0,#+42] -ldrh r8, [r0,#+24] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #-4] -strh r6, [r0,#+44] -ldrh r6, [r0,#+26] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #-2] -strh r4, [r0,#+16] -ldrh r4, [r0,#+28] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #0] -vldrw.u32 Q1, [Q2, #28] -strh r14, [r0,#+18] -ldrh r14, [r0,#+30] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #2] -strh r12, [r0,#+20] -ldrh r12, [r0,#+32] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #4] -strh r10, [r0,#+22] -ldrh r10, [r0,#+34] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #6] -strh r8, [r0,#+24] -ldrh r8, [r0,#+36] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #8] -strh r6, [r0,#+26] -ldrh r6, [r0,#+38] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #10] -strh r4, [r0,#+28] -ldrh r4, [r0,#+40] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #12] -strh r14, [r0,#+30] -ldrh r14, [r0,#+42] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #14] -vldrw.u32 Q3, [Q2, #28] -strh r12, [r0,#+32] -ldrh r12, [r0,#+44] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #16] -strh r10, [r0,#+34] -vmladavx.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #18] -strh r8, [r0,#+36] -vmladavx.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #20] -strh r6, [r0,#+38] -vmladavx.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #22] -strh r4, [r0,#+40] -vmladavx.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #24] -strh r14, [r0,#+42] -vmladavx.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #26] -strh r12, [r0,#+44] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #28] -strh r10, [r0,#+46] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -strh r8, [r0,#+48] -vmladavx.s16 r8, Q5, Q4 -strh r6, [r0,#+50] -strh r4, [r0,#+52] -strh r14, [r0,#+54] -strh r12, [r0,#+56] -strh r10, [r0,#+58] -strh r8, [r0,#+60] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_16_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_16_mve_simd.s deleted file mode 100644 index cac2bab..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_16_mve_simd.s +++ /dev/null @@ -1,178 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_mve_simd, %function -.global poly_u16_mul_16_mve_simd -poly_u16_mul_16_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmul.u16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vmul.u16 Q1, Q0, r11 -vldrh.u16 Q2, [r2, #(2 * 8)] -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vmul.u16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vmov.u16 Q1, #0 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q2, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s deleted file mode 100644 index f0fd03e..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd.s +++ /dev/null @@ -1,773 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd, %function -.global poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd -poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #224 -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add sp, sp, #224 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s deleted file mode 100644 index 43aefad..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_mve_simd.s +++ /dev/null @@ -1,749 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd -poly_u16_mul_32_anticyclic_karatsuba_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #224 -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -mov r12, #0 -mov r11, sp -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add sp, sp, #224 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s deleted file mode 100644 index 6724253..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,274 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_opt_mve_simd, %function -.global poly_u16_mul_32_anticyclic_opt_mve_simd -poly_u16_mul_32_anticyclic_opt_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -ldrh r14, [r1, #30] -vldrh.u16 Q4, [r2, #(2 * 0)] -vmul.u16 Q1, Q4, r14 -ldrh r11, [r1, #46] -vldrh.u16 Q5, [r2, #(2 * 8)] -vmul.u16 Q2, Q4, r11 -ldrh r10, [r1, #62] -vldrh.u16 Q6, [r2, #(2 * 16)] -vmul.u16 Q3, Q4, r10 -ldrh r9, [r1, #14] -vldrh.u16 Q7, [r2, #(2 * 24)] -vmla.s16 Q3, Q5, r11 -neg r10, r10 -vmla.s16 Q2, Q5, r14 -neg r11, r11 -vmla.s16 Q3, Q6, r14 -neg r14, r14 -vmla.s16 Q3, Q7, r9 -ldrh r8, [r1, #12] -vmul.u16 Q0, Q7, r14 -ldrh r7, [r1, #28] -vmla.s16 Q0, Q6, r11 -ldrh r6, [r1, #44] -vmla.s16 Q0, Q5, r10 -ldrh r5, [r1, #60] -vmla.s16 Q0, Q4, r9 -vmla.s16 Q1, Q5, r9 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r11 -vmla.s16 Q1, Q6, r10 -neg r12, r12 -vmla.s16 Q2, Q6, r9 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #10] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #26] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #42] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #58] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #8] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #24] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #40] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #56] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #6] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #22] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #38] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #54] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #4] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #20] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #36] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #52] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #2] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #18] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #34] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #50] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r6 -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r5 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r9 -vmla.s16 Q3, Q5, r10 -neg r9, r9 -vmla.s16 Q2, Q5, r11 -neg r10, r10 -vmla.s16 Q3, Q6, r11 -neg r11, r11 -vmla.s16 Q3, Q7, r14 -ldrh r8, [r1, #0] -vmla.s16 Q0, Q7, r11 -ldrh r7, [r1, #16] -vmla.s16 Q0, Q6, r10 -ldrh r6, [r1, #32] -vmla.s16 Q0, Q5, r9 -ldrh r5, [r1, #48] -vmla.s16 Q0, Q4, r14 -vmla.s16 Q1, Q5, r14 -vshlc Q3, r12, #16 -vmla.s16 Q1, Q7, r10 -vmla.s16 Q1, Q6, r9 -neg r12, r12 -vmla.s16 Q2, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q7, r9 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vmov.u16 Q3[0], r12 -vmla.s16 Q3, Q4, r5 -vmla.s16 Q3, Q5, r6 -neg r5, r5 -vmla.s16 Q2, Q5, r7 -neg r6, r6 -vmla.s16 Q3, Q6, r7 -neg r7, r7 -vmla.s16 Q3, Q7, r8 -ldrh r14, [r1, #-2] -vmla.s16 Q0, Q7, r7 -ldrh r11, [r1, #14] -vmla.s16 Q0, Q6, r6 -ldrh r10, [r1, #30] -vmla.s16 Q0, Q5, r5 -ldrh r9, [r1, #46] -vmla.s16 Q0, Q4, r8 -vmla.s16 Q1, Q5, r8 -vstrh.u16 Q3, [r0,#(48)] -vmla.s16 Q1, Q7, r6 -vstrh.u16 Q0, [r0,#(0)] -vmla.s16 Q1, Q6, r5 -neg r12, r12 -vmla.s16 Q2, Q6, r8 -vstrh.u16 Q1, [r0,#(16)] -vmla.s16 Q2, Q7, r5 -vstrh.u16 Q2, [r0,#(32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_mve_comba.s b/tests/schoolbook/auto/poly_u16_mul_32_mve_comba.s deleted file mode 100644 index b05d3eb..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_mve_comba.s +++ /dev/null @@ -1,446 +0,0 @@ -.syntax unified -.type poly_u16_mul_32_comba_mve, %function -.global poly_u16_mul_32_comba_mve -poly_u16_mul_32_comba_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #28] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #-12] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #4] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #-10] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #6] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #28] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #10] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #-4] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #12] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #28] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #0] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #16] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #44] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #18] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #4] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #20] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #-10] -vldrw.u32 Q7, [Q2, #44] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #6] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #-8] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #8] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #24] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #-6] -vldrw.u32 Q1, [Q2, #44] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #10] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #26] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #-4] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #44] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #0] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #60] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #4] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #20] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #36] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #60] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #-8] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #8] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #-6] -vldrw.u32 Q1, [Q2, #60] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #10] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #26] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #12] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #60] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #30] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #0] -strh r12, [r0,#+50] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #16] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #2] -vldrw.u32 Q5, [Q2, #60] -strh r10, [r0,#+52] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #18] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #34] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #50] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #4] -strh r8, [r0,#+54] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #20] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #6] -vldrw.u32 Q3, [Q2, #60] -strh r6, [r0,#+56] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #22] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #38] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #54] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #8] -strh r4, [r0,#+58] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #24] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #56] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #10] -vldrw.u32 Q0, [Q2, #60] -strh r14, [r0,#+60] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #26] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #42] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #58] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -strh r12, [r0,#+62] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #28] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #44] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #60] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #60] -strh r10, [r0,#+64] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #62] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r8, [r0,#+66] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #32] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #48] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #60] -strh r6, [r0,#+68] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #20] -strh r4, [r0,#+70] -vmladavx.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #36] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #52] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #22] -vldrw.u32 Q5, [Q2, #60] -strh r14, [r0,#+72] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #38] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #54] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #24] -strh r12, [r0,#+74] -vmladavx.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #40] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #56] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #26] -vldrw.u32 Q7, [Q2, #60] -strh r10, [r0,#+76] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #28] -strh r8, [r0,#+78] -vmladavx.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #44] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #30] -vldrw.u32 Q1, [Q2, #60] -strh r6, [r0,#+80] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #46] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #62] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #32] -strh r4, [r0,#+82] -vmladavx.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #48] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #34] -strh r14, [r0,#+84] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #50] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #36] -vldrw.u32 Q4, [Q2, #60] -strh r12, [r0,#+86] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #52] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #38] -strh r10, [r0,#+88] -vmladavx.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #54] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #40] -strh r8, [r0,#+90] -vmladavx.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #56] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #60] -strh r6, [r0,#+92] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #44] -strh r4, [r0,#+94] -vmladavx.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #60] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #46] -strh r14, [r0,#+96] -vmladavx.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #62] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #48] -vldrw.u32 Q6, [Q2, #60] -strh r12, [r0,#+98] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #50] -strh r10, [r0,#+100] -vmladavx.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #52] -strh r8, [r0,#+102] -vmladavx.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #54] -strh r6, [r0,#+104] -vmladavx.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #56] -strh r4, [r0,#+106] -vmladavx.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #58] -strh r14, [r0,#+108] -vmladavx.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #60] -strh r12, [r0,#+110] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #60] -strh r10, [r0,#+112] -vmladavx.s16 r10, Q6, Q7 -strh r10, [r0,#+124] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_mve_schoolbook.s b/tests/schoolbook/auto/poly_u16_mul_32_mve_schoolbook.s deleted file mode 100644 index a8508b3..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_mve_schoolbook.s +++ /dev/null @@ -1,593 +0,0 @@ -.syntax unified -.type poly_u16_mul_32_schoolbook_mve, %function -.global poly_u16_mul_32_schoolbook_mve -poly_u16_mul_32_schoolbook_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #12] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #12] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #32] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #34] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #36] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #38] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #12] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #46] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #48] -strh r12, [r0,#+50] -vmladavx.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #50] -strh r10, [r0,#+52] -vmladavx.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #52] -strh r8, [r0,#+54] -vmladavx.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #54] -strh r6, [r0,#+56] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #12] -strh r4, [r0,#+58] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -strh r14, [r0,#+60] -vmladavx.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #60] -strh r12, [r0,#+62] -vmladavx.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #62] -strh r10, [r0,#+64] -vmladavx.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #28] -strh r8, [r0,#+66] -ldrh r8, [r0,#+16] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #-12] -strh r6, [r0,#+68] -ldrh r6, [r0,#+18] -vmladavax.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #-10] -strh r4, [r0,#+70] -ldrh r4, [r0,#+20] -vmladavax.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #-8] -strh r14, [r0,#+72] -ldrh r14, [r0,#+22] -vmladavax.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #-6] -strh r12, [r0,#+74] -ldrh r12, [r0,#+24] -vmladavax.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #-4] -strh r10, [r0,#+76] -ldrh r10, [r0,#+26] -vmladavax.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #-2] -strh r8, [r0,#+16] -ldrh r8, [r0,#+28] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #28] -strh r6, [r0,#+18] -ldrh r6, [r0,#+30] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -strh r4, [r0,#+20] -ldrh r4, [r0,#+32] -vmladavax.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #4] -strh r14, [r0,#+22] -ldrh r14, [r0,#+34] -vmladavax.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #6] -strh r12, [r0,#+24] -ldrh r12, [r0,#+36] -vmladavax.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #8] -strh r10, [r0,#+26] -ldrh r10, [r0,#+38] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #10] -strh r8, [r0,#+28] -ldrh r8, [r0,#+40] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #12] -strh r6, [r0,#+30] -ldrh r6, [r0,#+42] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #14] -vldrw.u32 Q7, [Q2, #28] -strh r4, [r0,#+32] -ldrh r4, [r0,#+44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #16] -strh r14, [r0,#+34] -ldrh r14, [r0,#+46] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #18] -strh r12, [r0,#+36] -ldrh r12, [r0,#+48] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #20] -strh r10, [r0,#+38] -ldrh r10, [r0,#+50] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #22] -strh r8, [r0,#+40] -ldrh r8, [r0,#+52] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #24] -strh r6, [r0,#+42] -ldrh r6, [r0,#+54] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #26] -strh r4, [r0,#+44] -ldrh r4, [r0,#+56] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #28] -vldrw.u32 Q0, [Q2, #28] -strh r14, [r0,#+46] -ldrh r14, [r0,#+58] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -strh r12, [r0,#+48] -ldrh r12, [r0,#+60] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #32] -strh r10, [r0,#+50] -ldrh r10, [r0,#+62] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #34] -strh r8, [r0,#+52] -ldrh r8, [r0,#+64] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #36] -strh r6, [r0,#+54] -ldrh r6, [r0,#+66] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #38] -strh r4, [r0,#+56] -ldrh r4, [r0,#+68] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #40] -strh r14, [r0,#+58] -ldrh r14, [r0,#+70] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #28] -strh r12, [r0,#+60] -ldrh r12, [r0,#+72] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -strh r10, [r0,#+62] -ldrh r10, [r0,#+74] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #46] -strh r8, [r0,#+64] -ldrh r8, [r0,#+76] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #48] -strh r6, [r0,#+66] -vmladavx.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #50] -strh r4, [r0,#+68] -vmladavx.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #52] -strh r14, [r0,#+70] -vmladavx.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #54] -strh r12, [r0,#+72] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #28] -strh r10, [r0,#+74] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #58] -strh r8, [r0,#+76] -vmladavx.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #60] -strh r6, [r0,#+78] -vmladavx.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #62] -strh r4, [r0,#+80] -vmladavx.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #44] -strh r14, [r0,#+82] -ldrh r14, [r0,#+32] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r12, [r0,#+84] -ldrh r12, [r0,#+34] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #-10] -strh r10, [r0,#+86] -ldrh r10, [r0,#+36] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #-8] -strh r8, [r0,#+88] -ldrh r8, [r0,#+38] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #-6] -strh r6, [r0,#+90] -ldrh r6, [r0,#+40] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #-4] -strh r4, [r0,#+92] -ldrh r4, [r0,#+42] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #-2] -strh r14, [r0,#+32] -ldrh r14, [r0,#+44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #0] -vldrw.u32 Q1, [Q2, #44] -strh r12, [r0,#+34] -ldrh r12, [r0,#+46] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #2] -strh r10, [r0,#+36] -ldrh r10, [r0,#+48] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #4] -strh r8, [r0,#+38] -ldrh r8, [r0,#+50] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #6] -strh r6, [r0,#+40] -ldrh r6, [r0,#+52] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #8] -strh r4, [r0,#+42] -ldrh r4, [r0,#+54] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #10] -strh r14, [r0,#+44] -ldrh r14, [r0,#+56] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #12] -strh r12, [r0,#+46] -ldrh r12, [r0,#+58] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #14] -vldrw.u32 Q3, [Q2, #44] -strh r10, [r0,#+48] -ldrh r10, [r0,#+60] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #16] -strh r8, [r0,#+50] -ldrh r8, [r0,#+62] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #18] -strh r6, [r0,#+52] -ldrh r6, [r0,#+64] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #20] -strh r4, [r0,#+54] -ldrh r4, [r0,#+66] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #22] -strh r14, [r0,#+56] -ldrh r14, [r0,#+68] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #24] -strh r12, [r0,#+58] -ldrh r12, [r0,#+70] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #26] -strh r10, [r0,#+60] -ldrh r10, [r0,#+72] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #44] -strh r8, [r0,#+62] -ldrh r8, [r0,#+74] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -strh r6, [r0,#+64] -ldrh r6, [r0,#+76] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #32] -strh r4, [r0,#+66] -ldrh r4, [r0,#+78] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #34] -strh r14, [r0,#+68] -ldrh r14, [r0,#+80] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #36] -strh r12, [r0,#+70] -ldrh r12, [r0,#+82] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #38] -strh r10, [r0,#+72] -ldrh r10, [r0,#+84] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #40] -strh r8, [r0,#+74] -ldrh r8, [r0,#+86] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #44] -strh r6, [r0,#+76] -ldrh r6, [r0,#+88] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #44] -strh r4, [r0,#+78] -ldrh r4, [r0,#+90] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #46] -strh r14, [r0,#+80] -ldrh r14, [r0,#+92] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #48] -strh r12, [r0,#+82] -vmladavx.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #50] -strh r10, [r0,#+84] -vmladavx.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #52] -strh r8, [r0,#+86] -vmladavx.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #54] -strh r6, [r0,#+88] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #44] -strh r4, [r0,#+90] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -strh r14, [r0,#+92] -vmladavx.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #60] -strh r12, [r0,#+94] -vmladavx.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #62] -strh r10, [r0,#+96] -vmladavx.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #-14] -vldrw.u32 Q4, [Q2, #60] -strh r8, [r0,#+98] -ldrh r8, [r0,#+48] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r6, [r0,#+100] -ldrh r6, [r0,#+50] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #-10] -strh r4, [r0,#+102] -ldrh r4, [r0,#+52] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #-8] -strh r14, [r0,#+104] -ldrh r14, [r0,#+54] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r12, [r0,#+106] -ldrh r12, [r0,#+56] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #-4] -strh r10, [r0,#+108] -ldrh r10, [r0,#+58] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #-2] -strh r8, [r0,#+48] -ldrh r8, [r0,#+60] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #60] -strh r6, [r0,#+50] -ldrh r6, [r0,#+62] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -strh r4, [r0,#+52] -ldrh r4, [r0,#+64] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #4] -strh r14, [r0,#+54] -ldrh r14, [r0,#+66] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #6] -strh r12, [r0,#+56] -ldrh r12, [r0,#+68] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #8] -strh r10, [r0,#+58] -ldrh r10, [r0,#+70] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #10] -strh r8, [r0,#+60] -ldrh r8, [r0,#+72] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #12] -strh r6, [r0,#+62] -ldrh r6, [r0,#+74] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #60] -strh r4, [r0,#+64] -ldrh r4, [r0,#+76] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -strh r14, [r0,#+66] -ldrh r14, [r0,#+78] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #18] -strh r12, [r0,#+68] -ldrh r12, [r0,#+80] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #20] -strh r10, [r0,#+70] -ldrh r10, [r0,#+82] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #22] -strh r8, [r0,#+72] -ldrh r8, [r0,#+84] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #24] -strh r6, [r0,#+74] -ldrh r6, [r0,#+86] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #26] -strh r4, [r0,#+76] -ldrh r4, [r0,#+88] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #28] -vldrw.u32 Q7, [Q2, #60] -strh r14, [r0,#+78] -ldrh r14, [r0,#+90] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #30] -strh r12, [r0,#+80] -ldrh r12, [r0,#+92] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #32] -strh r10, [r0,#+82] -ldrh r10, [r0,#+94] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #34] -strh r8, [r0,#+84] -ldrh r8, [r0,#+96] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #36] -strh r6, [r0,#+86] -ldrh r6, [r0,#+98] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #38] -strh r4, [r0,#+88] -ldrh r4, [r0,#+100] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #40] -strh r14, [r0,#+90] -ldrh r14, [r0,#+102] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #42] -vldrw.u32 Q0, [Q2, #60] -strh r12, [r0,#+92] -ldrh r12, [r0,#+104] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #44] -strh r10, [r0,#+94] -ldrh r10, [r0,#+106] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #46] -strh r8, [r0,#+96] -ldrh r8, [r0,#+108] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #48] -strh r6, [r0,#+98] -vmladavx.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #50] -strh r4, [r0,#+100] -vmladavx.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #52] -strh r14, [r0,#+102] -vmladavx.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #54] -strh r12, [r0,#+104] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #60] -strh r10, [r0,#+106] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -strh r8, [r0,#+108] -vmladavx.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #60] -strh r6, [r0,#+110] -vmladavx.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #62] -strh r4, [r0,#+112] -vmladavx.s16 r4, Q5, Q1 -strh r14, [r0,#+114] -strh r12, [r0,#+116] -strh r10, [r0,#+118] -strh r8, [r0,#+120] -strh r6, [r0,#+122] -strh r4, [r0,#+124] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_64_mve_comba.s b/tests/schoolbook/auto/poly_u16_mul_64_mve_comba.s deleted file mode 100644 index 204cb7d..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_64_mve_comba.s +++ /dev/null @@ -1,1760 +0,0 @@ -.syntax unified -.type poly_u16_mul_64_comba_mve, %function -.global poly_u16_mul_64_comba_mve -poly_u16_mul_64_comba_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #28] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #-12] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #4] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #-10] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #6] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #28] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vmladavax.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #10] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #-4] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #12] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #28] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #0] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #16] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #44] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #18] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #4] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #20] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #-10] -vldrw.u32 Q7, [Q2, #44] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #6] -vmladavax.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #-8] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #8] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #24] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #-6] -vldrw.u32 Q1, [Q2, #44] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #10] -vmladavax.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #26] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #-4] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #44] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #0] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #60] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #4] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #20] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #36] -vmladavax.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #60] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #-8] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #8] -vmladavax.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #-6] -vldrw.u32 Q1, [Q2, #60] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #10] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #26] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #12] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #-2] -vldrw.u32 Q7, [Q2, #60] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #14] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #30] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #0] -strh r12, [r0,#+50] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #16] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #76] -strh r10, [r0,#+52] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #18] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #34] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #50] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -vldrw.u32 Q0, [Q2, #76] -strh r8, [r0,#+54] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #76] -strh r6, [r0,#+56] -vmladavx.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-8] -vldrw.u32 Q7, [Q2, #76] -strh r4, [r0,#+58] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #8] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #24] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #40] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #-6] -vldrw.u32 Q3, [Q2, #76] -strh r14, [r0,#+60] -vmladavx.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #10] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #26] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-4] -vldrw.u32 Q6, [Q2, #76] -strh r12, [r0,#+62] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #-2] -vldrw.u32 Q1, [Q2, #76] -strh r10, [r0,#+64] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #46] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #62] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #76] -strh r8, [r0,#+66] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #32] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #48] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #64] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #92] -strh r6, [r0,#+68] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #18] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #34] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -vldrw.u32 Q6, [Q2, #92] -strh r4, [r0,#+70] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #36] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #52] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #68] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #-10] -vldrw.u32 Q4, [Q2, #92] -strh r14, [r0,#+72] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #6] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #-8] -vldrw.u32 Q1, [Q2, #92] -strh r12, [r0,#+74] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #8] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #24] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #40] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #-6] -vldrw.u32 Q7, [Q2, #92] -strh r10, [r0,#+76] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #-4] -vldrw.u32 Q5, [Q2, #92] -strh r8, [r0,#+78] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #12] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #28] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #60] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #76] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #-2] -vldrw.u32 Q3, [Q2, #92] -strh r6, [r0,#+80] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #46] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #62] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #78] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #92] -strh r4, [r0,#+82] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #32] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #48] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #108] -strh r14, [r0,#+84] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -vldrw.u32 Q6, [Q2, #108] -strh r12, [r0,#+86] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #4] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #20] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #36] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #52] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #68] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #-10] -vldrw.u32 Q6, [Q2, #108] -strh r10, [r0,#+88] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #6] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #22] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #38] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #54] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #-8] -vldrw.u32 Q6, [Q2, #108] -strh r8, [r0,#+90] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #8] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #24] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #40] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #88] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #-6] -vldrw.u32 Q6, [Q2, #108] -strh r6, [r0,#+92] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #10] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #26] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #74] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #90] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #-4] -vldrw.u32 Q6, [Q2, #108] -strh r4, [r0,#+94] -vmladavx.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #12] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #60] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #76] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #92] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #-2] -vldrw.u32 Q6, [Q2, #108] -strh r14, [r0,#+96] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #46] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #62] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #108] -strh r12, [r0,#+98] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #124] -strh r10, [r0,#+100] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -vldrw.u32 Q0, [Q2, #124] -strh r8, [r0,#+102] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #-10] -vldrw.u32 Q3, [Q2, #124] -strh r6, [r0,#+104] -vmladavx.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #6] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #22] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #54] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #102] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #-8] -vldrw.u32 Q5, [Q2, #124] -strh r4, [r0,#+106] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #8] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #24] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #40] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #88] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #104] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #-6] -vldrw.u32 Q7, [Q2, #124] -strh r14, [r0,#+108] -vmladavx.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #90] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #106] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #-4] -vldrw.u32 Q1, [Q2, #124] -strh r12, [r0,#+110] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #12] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #-2] -vldrw.u32 Q4, [Q2, #124] -strh r10, [r0,#+112] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+114] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -vldrw.u32 Q0, [Q2, #124] -strh r6, [r0,#+116] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #18] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #34] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #50] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #66] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #82] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #114] -vldrw.u32 Q0, [Q2, #12] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #4] -vldrw.u32 Q3, [Q2, #124] -strh r4, [r0,#+118] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #20] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #36] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #52] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #28] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #116] -vldrw.u32 Q3, [Q2, #12] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #6] -vldrw.u32 Q5, [Q2, #124] -strh r14, [r0,#+120] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #22] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #54] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #102] -vldrw.u32 Q3, [Q2, #28] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #118] -vldrw.u32 Q5, [Q2, #12] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #8] -vldrw.u32 Q7, [Q2, #124] -strh r12, [r0,#+122] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #24] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #40] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #88] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #104] -vldrw.u32 Q5, [Q2, #28] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #120] -vldrw.u32 Q7, [Q2, #12] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #10] -vldrw.u32 Q1, [Q2, #124] -strh r10, [r0,#+124] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #26] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #74] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #90] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #106] -vldrw.u32 Q7, [Q2, #28] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #122] -vldrw.u32 Q1, [Q2, #12] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #12] -vldrw.u32 Q4, [Q2, #124] -strh r8, [r0,#+126] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #28] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #12] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #124] -strh r6, [r0,#+128] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #28] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #12] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -vldrw.u32 Q0, [Q2, #124] -strh r4, [r0,#+130] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #32] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #48] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #64] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #80] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #18] -vldrw.u32 Q0, [Q2, #124] -strh r14, [r0,#+132] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #34] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #50] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #66] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #82] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #20] -vldrw.u32 Q0, [Q2, #124] -strh r12, [r0,#+134] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #36] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #52] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #68] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #100] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #116] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #22] -vldrw.u32 Q0, [Q2, #124] -strh r10, [r0,#+136] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #38] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #54] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #102] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #118] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #24] -vldrw.u32 Q0, [Q2, #124] -strh r8, [r0,#+138] -vmladavx.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #40] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #56] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #72] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #88] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #104] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #120] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #26] -vldrw.u32 Q0, [Q2, #124] -strh r6, [r0,#+140] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #42] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #58] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #74] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #90] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #106] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #122] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #28] -vldrw.u32 Q0, [Q2, #124] -strh r4, [r0,#+142] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #44] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #60] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #76] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #92] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #108] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #124] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -vldrw.u32 Q0, [Q2, #124] -strh r14, [r0,#+144] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #46] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #62] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #78] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #94] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #110] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #28] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #32] -vldrw.u32 Q0, [Q2, #124] -strh r12, [r0,#+146] -vmladavx.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #48] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #64] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #80] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #96] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #34] -vldrw.u32 Q6, [Q2, #124] -strh r10, [r0,#+148] -vmladavx.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #50] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #66] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #82] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #98] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #114] -vldrw.u32 Q1, [Q2, #44] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #36] -vldrw.u32 Q4, [Q2, #124] -strh r8, [r0,#+150] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #52] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #68] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #44] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #38] -vldrw.u32 Q1, [Q2, #124] -strh r6, [r0,#+152] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #54] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #102] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #118] -vldrw.u32 Q5, [Q2, #44] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #40] -vldrw.u32 Q7, [Q2, #124] -strh r4, [r0,#+154] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #88] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #104] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #120] -vldrw.u32 Q3, [Q2, #44] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #124] -strh r14, [r0,#+156] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #74] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #90] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #106] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #122] -vldrw.u32 Q0, [Q2, #44] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #44] -vldrw.u32 Q3, [Q2, #124] -strh r12, [r0,#+158] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #60] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #76] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #92] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #108] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #124] -vldrw.u32 Q6, [Q2, #44] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #46] -vldrw.u32 Q0, [Q2, #124] -strh r10, [r0,#+160] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #62] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #78] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #94] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #110] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #44] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #48] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+162] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #64] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #80] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #96] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #112] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #50] -vldrw.u32 Q1, [Q2, #124] -strh r6, [r0,#+164] -vmladavx.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #66] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #82] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -vldrw.u32 Q3, [Q2, #60] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #52] -vldrw.u32 Q5, [Q2, #124] -strh r4, [r0,#+166] -vmladavx.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #68] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #100] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #116] -vldrw.u32 Q6, [Q2, #60] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #54] -vldrw.u32 Q0, [Q2, #124] -strh r14, [r0,#+168] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #70] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #102] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #118] -vldrw.u32 Q1, [Q2, #60] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #56] -vldrw.u32 Q4, [Q2, #124] -strh r12, [r0,#+170] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #72] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #88] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #104] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #120] -vldrw.u32 Q5, [Q2, #60] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -vldrw.u32 Q7, [Q2, #124] -strh r10, [r0,#+172] -vmladavx.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #74] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #90] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #106] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #122] -vldrw.u32 Q0, [Q2, #60] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #60] -vldrw.u32 Q3, [Q2, #124] -strh r8, [r0,#+174] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #76] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #92] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #108] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #124] -vldrw.u32 Q4, [Q2, #60] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #62] -vldrw.u32 Q6, [Q2, #124] -strh r6, [r0,#+176] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #78] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #94] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #110] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #126] -vldrw.u32 Q7, [Q2, #60] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #64] -vldrw.u32 Q1, [Q2, #124] -strh r4, [r0,#+178] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #80] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #96] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #66] -vldrw.u32 Q3, [Q2, #124] -strh r14, [r0,#+180] -vmladavx.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #82] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #98] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #114] -vldrw.u32 Q1, [Q2, #76] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #68] -vldrw.u32 Q4, [Q2, #124] -strh r12, [r0,#+182] -vmladavx.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -vldrw.u32 Q0, [Q2, #92] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #116] -vldrw.u32 Q3, [Q2, #76] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #70] -vldrw.u32 Q5, [Q2, #124] -strh r10, [r0,#+184] -vmladavx.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #86] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #102] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #118] -vldrw.u32 Q4, [Q2, #76] -vmladavax.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #72] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+186] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #88] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #104] -vldrw.u32 Q3, [Q2, #92] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #120] -vldrw.u32 Q5, [Q2, #76] -vmladavax.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #74] -vldrw.u32 Q7, [Q2, #124] -strh r6, [r0,#+188] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #90] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #106] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #122] -vldrw.u32 Q6, [Q2, #76] -vmladavax.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #76] -vldrw.u32 Q0, [Q2, #124] -strh r4, [r0,#+190] -vmladavx.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #92] -vldrw.u32 Q3, [Q2, #108] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #108] -vldrw.u32 Q5, [Q2, #92] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #124] -vldrw.u32 Q7, [Q2, #76] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #78] -vldrw.u32 Q1, [Q2, #124] -strh r14, [r0,#+192] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #94] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #110] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #76] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #80] -vldrw.u32 Q3, [Q2, #124] -strh r12, [r0,#+194] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #96] -vldrw.u32 Q5, [Q2, #108] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #112] -vldrw.u32 Q7, [Q2, #92] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #82] -strh r10, [r0,#+196] -vmladavx.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #98] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #114] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #84] -vldrw.u32 Q5, [Q2, #124] -strh r8, [r0,#+198] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #100] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #116] -vldrw.u32 Q1, [Q2, #92] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -strh r6, [r0,#+200] -vmladavx.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #102] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #118] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #88] -vldrw.u32 Q7, [Q2, #124] -strh r4, [r0,#+202] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #104] -vldrw.u32 Q1, [Q2, #108] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #120] -vldrw.u32 Q4, [Q2, #92] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #90] -strh r14, [r0,#+204] -vmladavx.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #106] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #122] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #92] -vldrw.u32 Q1, [Q2, #124] -strh r12, [r0,#+206] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #108] -vldrw.u32 Q4, [Q2, #108] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #124] -vldrw.u32 Q6, [Q2, #92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #94] -strh r10, [r0,#+208] -vmladavx.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #110] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #126] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #96] -vldrw.u32 Q4, [Q2, #124] -strh r8, [r0,#+210] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #108] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #98] -strh r6, [r0,#+212] -vmladavx.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #114] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #100] -strh r4, [r0,#+214] -vmladavx.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #116] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #102] -vldrw.u32 Q5, [Q2, #124] -strh r14, [r0,#+216] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #118] -vldrw.u32 Q7, [Q2, #108] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #104] -strh r12, [r0,#+218] -vmladavx.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #120] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #106] -strh r10, [r0,#+220] -vmladavx.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #122] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #108] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+222] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #124] -vldrw.u32 Q0, [Q2, #108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #110] -strh r6, [r0,#+224] -vmladavx.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #126] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #112] -strh r4, [r0,#+226] -vmladavx.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #114] -strh r14, [r0,#+228] -vmladavx.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #116] -vldrw.u32 Q7, [Q2, #124] -strh r12, [r0,#+230] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #118] -strh r10, [r0,#+232] -vmladavx.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #120] -strh r8, [r0,#+234] -vmladavx.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #122] -strh r6, [r0,#+236] -vmladavx.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #124] -strh r4, [r0,#+238] -vmladavx.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #126] -strh r14, [r0,#+240] -vmladavx.s16 r14, Q5, Q7 -strh r14, [r0,#+252] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_64_mve_schoolbook.s b/tests/schoolbook/auto/poly_u16_mul_64_mve_schoolbook.s deleted file mode 100644 index 3b01bf3..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_64_mve_schoolbook.s +++ /dev/null @@ -1,2241 +0,0 @@ -.syntax unified -.type poly_u16_mul_64_schoolbook_mve, %function -.global poly_u16_mul_64_schoolbook_mve -poly_u16_mul_64_schoolbook_mve: -push {r4-r11,lr} -vldrh.u16 Q0, [r1, #-14] -vddup.u32 Q2,r2,#4 -vldrw.u32 Q1, [Q2, #12] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -vmladavx.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -vmladavx.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -vmladavx.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -vmladavx.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -vmladavx.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r14, [r0,#+0] -vmladavx.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #12] -strh r12, [r0,#+2] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r10, [r0,#+4] -vmladavx.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r8, [r0,#+6] -vmladavx.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r6, [r0,#+8] -vmladavx.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r4, [r0,#+10] -vmladavx.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r14, [r0,#+12] -vmladavx.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r12, [r0,#+14] -vmladavx.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #12] -strh r10, [r0,#+16] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r8, [r0,#+18] -vmladavx.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r6, [r0,#+20] -vmladavx.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r4, [r0,#+22] -vmladavx.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r14, [r0,#+24] -vmladavx.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r12, [r0,#+26] -vmladavx.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r10, [r0,#+28] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #12] -strh r8, [r0,#+30] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r6, [r0,#+32] -vmladavx.s16 r6, Q6, Q5 -vldrh.u16 Q7, [r1, #32] -strh r4, [r0,#+34] -vmladavx.s16 r4, Q7, Q5 -vldrh.u16 Q0, [r1, #34] -strh r14, [r0,#+36] -vmladavx.s16 r14, Q0, Q5 -vldrh.u16 Q1, [r1, #36] -strh r12, [r0,#+38] -vmladavx.s16 r12, Q1, Q5 -vldrh.u16 Q3, [r1, #38] -strh r10, [r0,#+40] -vmladavx.s16 r10, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -strh r8, [r0,#+42] -vmladavx.s16 r8, Q4, Q5 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #12] -strh r6, [r0,#+44] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -strh r4, [r0,#+46] -vmladavx.s16 r4, Q7, Q6 -vldrh.u16 Q0, [r1, #46] -strh r14, [r0,#+48] -vmladavx.s16 r14, Q0, Q6 -vldrh.u16 Q1, [r1, #48] -strh r12, [r0,#+50] -vmladavx.s16 r12, Q1, Q6 -vldrh.u16 Q3, [r1, #50] -strh r10, [r0,#+52] -vmladavx.s16 r10, Q3, Q6 -vldrh.u16 Q4, [r1, #52] -strh r8, [r0,#+54] -vmladavx.s16 r8, Q4, Q6 -vldrh.u16 Q5, [r1, #54] -strh r6, [r0,#+56] -vmladavx.s16 r6, Q5, Q6 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #12] -strh r4, [r0,#+58] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -strh r14, [r0,#+60] -vmladavx.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #60] -strh r12, [r0,#+62] -vmladavx.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #62] -strh r10, [r0,#+64] -vmladavx.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #64] -strh r8, [r0,#+66] -vmladavx.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #66] -strh r6, [r0,#+68] -vmladavx.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #68] -strh r4, [r0,#+70] -vmladavx.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #70] -vldrw.u32 Q0, [Q2, #12] -strh r14, [r0,#+72] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -strh r12, [r0,#+74] -vmladavx.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #74] -strh r10, [r0,#+76] -vmladavx.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #76] -strh r8, [r0,#+78] -vmladavx.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #78] -strh r6, [r0,#+80] -vmladavx.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #80] -strh r4, [r0,#+82] -vmladavx.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #82] -strh r14, [r0,#+84] -vmladavx.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #12] -strh r12, [r0,#+86] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -strh r10, [r0,#+88] -vmladavx.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #88] -strh r8, [r0,#+90] -vmladavx.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #90] -strh r6, [r0,#+92] -vmladavx.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #92] -strh r4, [r0,#+94] -vmladavx.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #94] -strh r14, [r0,#+96] -vmladavx.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #96] -strh r12, [r0,#+98] -vmladavx.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #12] -strh r10, [r0,#+100] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -strh r8, [r0,#+102] -vmladavx.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #102] -strh r6, [r0,#+104] -vmladavx.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #104] -strh r4, [r0,#+106] -vmladavx.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #106] -strh r14, [r0,#+108] -vmladavx.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #108] -strh r12, [r0,#+110] -vmladavx.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #110] -strh r10, [r0,#+112] -vmladavx.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #12] -strh r8, [r0,#+114] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -strh r6, [r0,#+116] -vmladavx.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #116] -strh r4, [r0,#+118] -vmladavx.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #118] -strh r14, [r0,#+120] -vmladavx.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #120] -strh r12, [r0,#+122] -vmladavx.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #122] -strh r10, [r0,#+124] -vmladavx.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #124] -strh r8, [r0,#+126] -vmladavx.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #12] -strh r6, [r0,#+128] -vmladavx.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-14] -vldrw.u32 Q7, [Q2, #28] -strh r4, [r0,#+130] -ldrh r4, [r0,#+16] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q0, [r1, #-12] -strh r14, [r0,#+132] -ldrh r14, [r0,#+18] -vmladavax.s16 r14, Q0, Q7 -vldrh.u16 Q1, [r1, #-10] -strh r12, [r0,#+134] -ldrh r12, [r0,#+20] -vmladavax.s16 r12, Q1, Q7 -vldrh.u16 Q3, [r1, #-8] -strh r10, [r0,#+136] -ldrh r10, [r0,#+22] -vmladavax.s16 r10, Q3, Q7 -vldrh.u16 Q4, [r1, #-6] -strh r8, [r0,#+138] -ldrh r8, [r0,#+24] -vmladavax.s16 r8, Q4, Q7 -vldrh.u16 Q5, [r1, #-4] -strh r6, [r0,#+140] -ldrh r6, [r0,#+26] -vmladavax.s16 r6, Q5, Q7 -vldrh.u16 Q6, [r1, #-2] -strh r4, [r0,#+16] -ldrh r4, [r0,#+28] -vmladavax.s16 r4, Q6, Q7 -vldrh.u16 Q7, [r1, #0] -vldrw.u32 Q0, [Q2, #28] -strh r14, [r0,#+18] -ldrh r14, [r0,#+30] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q1, [r1, #2] -strh r12, [r0,#+20] -ldrh r12, [r0,#+32] -vmladavax.s16 r12, Q1, Q0 -vldrh.u16 Q3, [r1, #4] -strh r10, [r0,#+22] -ldrh r10, [r0,#+34] -vmladavax.s16 r10, Q3, Q0 -vldrh.u16 Q4, [r1, #6] -strh r8, [r0,#+24] -ldrh r8, [r0,#+36] -vmladavax.s16 r8, Q4, Q0 -vldrh.u16 Q5, [r1, #8] -strh r6, [r0,#+26] -ldrh r6, [r0,#+38] -vmladavax.s16 r6, Q5, Q0 -vldrh.u16 Q6, [r1, #10] -strh r4, [r0,#+28] -ldrh r4, [r0,#+40] -vmladavax.s16 r4, Q6, Q0 -vldrh.u16 Q7, [r1, #12] -strh r14, [r0,#+30] -ldrh r14, [r0,#+42] -vmladavax.s16 r14, Q7, Q0 -vldrh.u16 Q0, [r1, #14] -vldrw.u32 Q1, [Q2, #28] -strh r12, [r0,#+32] -ldrh r12, [r0,#+44] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q3, [r1, #16] -strh r10, [r0,#+34] -ldrh r10, [r0,#+46] -vmladavax.s16 r10, Q3, Q1 -vldrh.u16 Q4, [r1, #18] -strh r8, [r0,#+36] -ldrh r8, [r0,#+48] -vmladavax.s16 r8, Q4, Q1 -vldrh.u16 Q5, [r1, #20] -strh r6, [r0,#+38] -ldrh r6, [r0,#+50] -vmladavax.s16 r6, Q5, Q1 -vldrh.u16 Q6, [r1, #22] -strh r4, [r0,#+40] -ldrh r4, [r0,#+52] -vmladavax.s16 r4, Q6, Q1 -vldrh.u16 Q7, [r1, #24] -strh r14, [r0,#+42] -ldrh r14, [r0,#+54] -vmladavax.s16 r14, Q7, Q1 -vldrh.u16 Q0, [r1, #26] -strh r12, [r0,#+44] -ldrh r12, [r0,#+56] -vmladavax.s16 r12, Q0, Q1 -vldrh.u16 Q1, [r1, #28] -vldrw.u32 Q3, [Q2, #28] -strh r10, [r0,#+46] -ldrh r10, [r0,#+58] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q4, [r1, #30] -strh r8, [r0,#+48] -ldrh r8, [r0,#+60] -vmladavax.s16 r8, Q4, Q3 -vldrh.u16 Q5, [r1, #32] -strh r6, [r0,#+50] -ldrh r6, [r0,#+62] -vmladavax.s16 r6, Q5, Q3 -vldrh.u16 Q6, [r1, #34] -strh r4, [r0,#+52] -ldrh r4, [r0,#+64] -vmladavax.s16 r4, Q6, Q3 -vldrh.u16 Q7, [r1, #36] -strh r14, [r0,#+54] -ldrh r14, [r0,#+66] -vmladavax.s16 r14, Q7, Q3 -vldrh.u16 Q0, [r1, #38] -strh r12, [r0,#+56] -ldrh r12, [r0,#+68] -vmladavax.s16 r12, Q0, Q3 -vldrh.u16 Q1, [r1, #40] -strh r10, [r0,#+58] -ldrh r10, [r0,#+70] -vmladavax.s16 r10, Q1, Q3 -vldrh.u16 Q3, [r1, #42] -vldrw.u32 Q4, [Q2, #28] -strh r8, [r0,#+60] -ldrh r8, [r0,#+72] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q5, [r1, #44] -strh r6, [r0,#+62] -ldrh r6, [r0,#+74] -vmladavax.s16 r6, Q5, Q4 -vldrh.u16 Q6, [r1, #46] -strh r4, [r0,#+64] -ldrh r4, [r0,#+76] -vmladavax.s16 r4, Q6, Q4 -vldrh.u16 Q7, [r1, #48] -strh r14, [r0,#+66] -ldrh r14, [r0,#+78] -vmladavax.s16 r14, Q7, Q4 -vldrh.u16 Q0, [r1, #50] -strh r12, [r0,#+68] -ldrh r12, [r0,#+80] -vmladavax.s16 r12, Q0, Q4 -vldrh.u16 Q1, [r1, #52] -strh r10, [r0,#+70] -ldrh r10, [r0,#+82] -vmladavax.s16 r10, Q1, Q4 -vldrh.u16 Q3, [r1, #54] -strh r8, [r0,#+72] -ldrh r8, [r0,#+84] -vmladavax.s16 r8, Q3, Q4 -vldrh.u16 Q4, [r1, #56] -vldrw.u32 Q5, [Q2, #28] -strh r6, [r0,#+74] -ldrh r6, [r0,#+86] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #58] -strh r4, [r0,#+76] -ldrh r4, [r0,#+88] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #60] -strh r14, [r0,#+78] -ldrh r14, [r0,#+90] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #62] -strh r12, [r0,#+80] -ldrh r12, [r0,#+92] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #64] -strh r10, [r0,#+82] -ldrh r10, [r0,#+94] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #66] -strh r8, [r0,#+84] -ldrh r8, [r0,#+96] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #68] -strh r6, [r0,#+86] -ldrh r6, [r0,#+98] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #70] -vldrw.u32 Q6, [Q2, #28] -strh r4, [r0,#+88] -ldrh r4, [r0,#+100] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #72] -strh r14, [r0,#+90] -ldrh r14, [r0,#+102] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #74] -strh r12, [r0,#+92] -ldrh r12, [r0,#+104] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #76] -strh r10, [r0,#+94] -ldrh r10, [r0,#+106] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #78] -strh r8, [r0,#+96] -ldrh r8, [r0,#+108] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #80] -strh r6, [r0,#+98] -ldrh r6, [r0,#+110] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #82] -strh r4, [r0,#+100] -ldrh r4, [r0,#+112] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #84] -vldrw.u32 Q7, [Q2, #28] -strh r14, [r0,#+102] -ldrh r14, [r0,#+114] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #86] -strh r12, [r0,#+104] -ldrh r12, [r0,#+116] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #88] -strh r10, [r0,#+106] -ldrh r10, [r0,#+118] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #90] -strh r8, [r0,#+108] -ldrh r8, [r0,#+120] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #92] -strh r6, [r0,#+110] -ldrh r6, [r0,#+122] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #94] -strh r4, [r0,#+112] -ldrh r4, [r0,#+124] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #96] -strh r14, [r0,#+114] -ldrh r14, [r0,#+126] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #98] -vldrw.u32 Q0, [Q2, #28] -strh r12, [r0,#+116] -ldrh r12, [r0,#+128] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #100] -strh r10, [r0,#+118] -ldrh r10, [r0,#+130] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #102] -strh r8, [r0,#+120] -ldrh r8, [r0,#+132] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #104] -strh r6, [r0,#+122] -ldrh r6, [r0,#+134] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #106] -strh r4, [r0,#+124] -ldrh r4, [r0,#+136] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #108] -strh r14, [r0,#+126] -ldrh r14, [r0,#+138] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #110] -strh r12, [r0,#+128] -ldrh r12, [r0,#+140] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #112] -vldrw.u32 Q1, [Q2, #28] -strh r10, [r0,#+130] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #114] -strh r8, [r0,#+132] -vmladavx.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #116] -strh r6, [r0,#+134] -vmladavx.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #118] -strh r4, [r0,#+136] -vmladavx.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #120] -strh r14, [r0,#+138] -vmladavx.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #122] -strh r12, [r0,#+140] -vmladavx.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #124] -strh r10, [r0,#+142] -vmladavx.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #126] -vldrw.u32 Q3, [Q2, #28] -strh r8, [r0,#+144] -vmladavx.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #-14] -vldrw.u32 Q5, [Q2, #44] -strh r6, [r0,#+146] -ldrh r6, [r0,#+32] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q6, [r1, #-12] -strh r4, [r0,#+148] -ldrh r4, [r0,#+34] -vmladavax.s16 r4, Q6, Q5 -vldrh.u16 Q7, [r1, #-10] -strh r14, [r0,#+150] -ldrh r14, [r0,#+36] -vmladavax.s16 r14, Q7, Q5 -vldrh.u16 Q0, [r1, #-8] -strh r12, [r0,#+152] -ldrh r12, [r0,#+38] -vmladavax.s16 r12, Q0, Q5 -vldrh.u16 Q1, [r1, #-6] -strh r10, [r0,#+154] -ldrh r10, [r0,#+40] -vmladavax.s16 r10, Q1, Q5 -vldrh.u16 Q3, [r1, #-4] -strh r8, [r0,#+156] -ldrh r8, [r0,#+42] -vmladavax.s16 r8, Q3, Q5 -vldrh.u16 Q4, [r1, #-2] -strh r6, [r0,#+32] -ldrh r6, [r0,#+44] -vmladavax.s16 r6, Q4, Q5 -vldrh.u16 Q5, [r1, #0] -vldrw.u32 Q6, [Q2, #44] -strh r4, [r0,#+34] -ldrh r4, [r0,#+46] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q7, [r1, #2] -strh r14, [r0,#+36] -ldrh r14, [r0,#+48] -vmladavax.s16 r14, Q7, Q6 -vldrh.u16 Q0, [r1, #4] -strh r12, [r0,#+38] -ldrh r12, [r0,#+50] -vmladavax.s16 r12, Q0, Q6 -vldrh.u16 Q1, [r1, #6] -strh r10, [r0,#+40] -ldrh r10, [r0,#+52] -vmladavax.s16 r10, Q1, Q6 -vldrh.u16 Q3, [r1, #8] -strh r8, [r0,#+42] -ldrh r8, [r0,#+54] -vmladavax.s16 r8, Q3, Q6 -vldrh.u16 Q4, [r1, #10] -strh r6, [r0,#+44] -ldrh r6, [r0,#+56] -vmladavax.s16 r6, Q4, Q6 -vldrh.u16 Q5, [r1, #12] -strh r4, [r0,#+46] -ldrh r4, [r0,#+58] -vmladavax.s16 r4, Q5, Q6 -vldrh.u16 Q6, [r1, #14] -vldrw.u32 Q7, [Q2, #44] -strh r14, [r0,#+48] -ldrh r14, [r0,#+60] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q0, [r1, #16] -strh r12, [r0,#+50] -ldrh r12, [r0,#+62] -vmladavax.s16 r12, Q0, Q7 -vldrh.u16 Q1, [r1, #18] -strh r10, [r0,#+52] -ldrh r10, [r0,#+64] -vmladavax.s16 r10, Q1, Q7 -vldrh.u16 Q3, [r1, #20] -strh r8, [r0,#+54] -ldrh r8, [r0,#+66] -vmladavax.s16 r8, Q3, Q7 -vldrh.u16 Q4, [r1, #22] -strh r6, [r0,#+56] -ldrh r6, [r0,#+68] -vmladavax.s16 r6, Q4, Q7 -vldrh.u16 Q5, [r1, #24] -strh r4, [r0,#+58] -ldrh r4, [r0,#+70] -vmladavax.s16 r4, Q5, Q7 -vldrh.u16 Q6, [r1, #26] -strh r14, [r0,#+60] -ldrh r14, [r0,#+72] -vmladavax.s16 r14, Q6, Q7 -vldrh.u16 Q7, [r1, #28] -vldrw.u32 Q0, [Q2, #44] -strh r12, [r0,#+62] -ldrh r12, [r0,#+74] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q1, [r1, #30] -strh r10, [r0,#+64] -ldrh r10, [r0,#+76] -vmladavax.s16 r10, Q1, Q0 -vldrh.u16 Q3, [r1, #32] -strh r8, [r0,#+66] -ldrh r8, [r0,#+78] -vmladavax.s16 r8, Q3, Q0 -vldrh.u16 Q4, [r1, #34] -strh r6, [r0,#+68] -ldrh r6, [r0,#+80] -vmladavax.s16 r6, Q4, Q0 -vldrh.u16 Q5, [r1, #36] -strh r4, [r0,#+70] -ldrh r4, [r0,#+82] -vmladavax.s16 r4, Q5, Q0 -vldrh.u16 Q6, [r1, #38] -strh r14, [r0,#+72] -ldrh r14, [r0,#+84] -vmladavax.s16 r14, Q6, Q0 -vldrh.u16 Q7, [r1, #40] -strh r12, [r0,#+74] -ldrh r12, [r0,#+86] -vmladavax.s16 r12, Q7, Q0 -vldrh.u16 Q0, [r1, #42] -vldrw.u32 Q1, [Q2, #44] -strh r10, [r0,#+76] -ldrh r10, [r0,#+88] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q3, [r1, #44] -strh r8, [r0,#+78] -ldrh r8, [r0,#+90] -vmladavax.s16 r8, Q3, Q1 -vldrh.u16 Q4, [r1, #46] -strh r6, [r0,#+80] -ldrh r6, [r0,#+92] -vmladavax.s16 r6, Q4, Q1 -vldrh.u16 Q5, [r1, #48] -strh r4, [r0,#+82] -ldrh r4, [r0,#+94] -vmladavax.s16 r4, Q5, Q1 -vldrh.u16 Q6, [r1, #50] -strh r14, [r0,#+84] -ldrh r14, [r0,#+96] -vmladavax.s16 r14, Q6, Q1 -vldrh.u16 Q7, [r1, #52] -strh r12, [r0,#+86] -ldrh r12, [r0,#+98] -vmladavax.s16 r12, Q7, Q1 -vldrh.u16 Q0, [r1, #54] -strh r10, [r0,#+88] -ldrh r10, [r0,#+100] -vmladavax.s16 r10, Q0, Q1 -vldrh.u16 Q1, [r1, #56] -vldrw.u32 Q3, [Q2, #44] -strh r8, [r0,#+90] -ldrh r8, [r0,#+102] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #58] -strh r6, [r0,#+92] -ldrh r6, [r0,#+104] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #60] -strh r4, [r0,#+94] -ldrh r4, [r0,#+106] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #62] -strh r14, [r0,#+96] -ldrh r14, [r0,#+108] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #64] -strh r12, [r0,#+98] -ldrh r12, [r0,#+110] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #66] -strh r10, [r0,#+100] -ldrh r10, [r0,#+112] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #68] -strh r8, [r0,#+102] -ldrh r8, [r0,#+114] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #70] -vldrw.u32 Q4, [Q2, #44] -strh r6, [r0,#+104] -ldrh r6, [r0,#+116] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #72] -strh r4, [r0,#+106] -ldrh r4, [r0,#+118] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #74] -strh r14, [r0,#+108] -ldrh r14, [r0,#+120] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #76] -strh r12, [r0,#+110] -ldrh r12, [r0,#+122] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #78] -strh r10, [r0,#+112] -ldrh r10, [r0,#+124] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #80] -strh r8, [r0,#+114] -ldrh r8, [r0,#+126] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #82] -strh r6, [r0,#+116] -ldrh r6, [r0,#+128] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #84] -vldrw.u32 Q5, [Q2, #44] -strh r4, [r0,#+118] -ldrh r4, [r0,#+130] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #86] -strh r14, [r0,#+120] -ldrh r14, [r0,#+132] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #88] -strh r12, [r0,#+122] -ldrh r12, [r0,#+134] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #90] -strh r10, [r0,#+124] -ldrh r10, [r0,#+136] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #92] -strh r8, [r0,#+126] -ldrh r8, [r0,#+138] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #94] -strh r6, [r0,#+128] -ldrh r6, [r0,#+140] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #96] -strh r4, [r0,#+130] -ldrh r4, [r0,#+142] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #98] -vldrw.u32 Q6, [Q2, #44] -strh r14, [r0,#+132] -ldrh r14, [r0,#+144] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #100] -strh r12, [r0,#+134] -ldrh r12, [r0,#+146] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #102] -strh r10, [r0,#+136] -ldrh r10, [r0,#+148] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #104] -strh r8, [r0,#+138] -ldrh r8, [r0,#+150] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #106] -strh r6, [r0,#+140] -ldrh r6, [r0,#+152] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #108] -strh r4, [r0,#+142] -ldrh r4, [r0,#+154] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #110] -strh r14, [r0,#+144] -ldrh r14, [r0,#+156] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #112] -vldrw.u32 Q7, [Q2, #44] -strh r12, [r0,#+146] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #114] -strh r10, [r0,#+148] -vmladavx.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #116] -strh r8, [r0,#+150] -vmladavx.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #118] -strh r6, [r0,#+152] -vmladavx.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #120] -strh r4, [r0,#+154] -vmladavx.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #122] -strh r14, [r0,#+156] -vmladavx.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #124] -strh r12, [r0,#+158] -vmladavx.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #126] -vldrw.u32 Q0, [Q2, #44] -strh r10, [r0,#+160] -vmladavx.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #-14] -vldrw.u32 Q3, [Q2, #60] -strh r8, [r0,#+162] -ldrh r8, [r0,#+48] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q4, [r1, #-12] -strh r6, [r0,#+164] -ldrh r6, [r0,#+50] -vmladavax.s16 r6, Q4, Q3 -vldrh.u16 Q5, [r1, #-10] -strh r4, [r0,#+166] -ldrh r4, [r0,#+52] -vmladavax.s16 r4, Q5, Q3 -vldrh.u16 Q6, [r1, #-8] -strh r14, [r0,#+168] -ldrh r14, [r0,#+54] -vmladavax.s16 r14, Q6, Q3 -vldrh.u16 Q7, [r1, #-6] -strh r12, [r0,#+170] -ldrh r12, [r0,#+56] -vmladavax.s16 r12, Q7, Q3 -vldrh.u16 Q0, [r1, #-4] -strh r10, [r0,#+172] -ldrh r10, [r0,#+58] -vmladavax.s16 r10, Q0, Q3 -vldrh.u16 Q1, [r1, #-2] -strh r8, [r0,#+48] -ldrh r8, [r0,#+60] -vmladavax.s16 r8, Q1, Q3 -vldrh.u16 Q3, [r1, #0] -vldrw.u32 Q4, [Q2, #60] -strh r6, [r0,#+50] -ldrh r6, [r0,#+62] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q5, [r1, #2] -strh r4, [r0,#+52] -ldrh r4, [r0,#+64] -vmladavax.s16 r4, Q5, Q4 -vldrh.u16 Q6, [r1, #4] -strh r14, [r0,#+54] -ldrh r14, [r0,#+66] -vmladavax.s16 r14, Q6, Q4 -vldrh.u16 Q7, [r1, #6] -strh r12, [r0,#+56] -ldrh r12, [r0,#+68] -vmladavax.s16 r12, Q7, Q4 -vldrh.u16 Q0, [r1, #8] -strh r10, [r0,#+58] -ldrh r10, [r0,#+70] -vmladavax.s16 r10, Q0, Q4 -vldrh.u16 Q1, [r1, #10] -strh r8, [r0,#+60] -ldrh r8, [r0,#+72] -vmladavax.s16 r8, Q1, Q4 -vldrh.u16 Q3, [r1, #12] -strh r6, [r0,#+62] -ldrh r6, [r0,#+74] -vmladavax.s16 r6, Q3, Q4 -vldrh.u16 Q4, [r1, #14] -vldrw.u32 Q5, [Q2, #60] -strh r4, [r0,#+64] -ldrh r4, [r0,#+76] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q6, [r1, #16] -strh r14, [r0,#+66] -ldrh r14, [r0,#+78] -vmladavax.s16 r14, Q6, Q5 -vldrh.u16 Q7, [r1, #18] -strh r12, [r0,#+68] -ldrh r12, [r0,#+80] -vmladavax.s16 r12, Q7, Q5 -vldrh.u16 Q0, [r1, #20] -strh r10, [r0,#+70] -ldrh r10, [r0,#+82] -vmladavax.s16 r10, Q0, Q5 -vldrh.u16 Q1, [r1, #22] -strh r8, [r0,#+72] -ldrh r8, [r0,#+84] -vmladavax.s16 r8, Q1, Q5 -vldrh.u16 Q3, [r1, #24] -strh r6, [r0,#+74] -ldrh r6, [r0,#+86] -vmladavax.s16 r6, Q3, Q5 -vldrh.u16 Q4, [r1, #26] -strh r4, [r0,#+76] -ldrh r4, [r0,#+88] -vmladavax.s16 r4, Q4, Q5 -vldrh.u16 Q5, [r1, #28] -vldrw.u32 Q6, [Q2, #60] -strh r14, [r0,#+78] -ldrh r14, [r0,#+90] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q7, [r1, #30] -strh r12, [r0,#+80] -ldrh r12, [r0,#+92] -vmladavax.s16 r12, Q7, Q6 -vldrh.u16 Q0, [r1, #32] -strh r10, [r0,#+82] -ldrh r10, [r0,#+94] -vmladavax.s16 r10, Q0, Q6 -vldrh.u16 Q1, [r1, #34] -strh r8, [r0,#+84] -ldrh r8, [r0,#+96] -vmladavax.s16 r8, Q1, Q6 -vldrh.u16 Q3, [r1, #36] -strh r6, [r0,#+86] -ldrh r6, [r0,#+98] -vmladavax.s16 r6, Q3, Q6 -vldrh.u16 Q4, [r1, #38] -strh r4, [r0,#+88] -ldrh r4, [r0,#+100] -vmladavax.s16 r4, Q4, Q6 -vldrh.u16 Q5, [r1, #40] -strh r14, [r0,#+90] -ldrh r14, [r0,#+102] -vmladavax.s16 r14, Q5, Q6 -vldrh.u16 Q6, [r1, #42] -vldrw.u32 Q7, [Q2, #60] -strh r12, [r0,#+92] -ldrh r12, [r0,#+104] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q0, [r1, #44] -strh r10, [r0,#+94] -ldrh r10, [r0,#+106] -vmladavax.s16 r10, Q0, Q7 -vldrh.u16 Q1, [r1, #46] -strh r8, [r0,#+96] -ldrh r8, [r0,#+108] -vmladavax.s16 r8, Q1, Q7 -vldrh.u16 Q3, [r1, #48] -strh r6, [r0,#+98] -ldrh r6, [r0,#+110] -vmladavax.s16 r6, Q3, Q7 -vldrh.u16 Q4, [r1, #50] -strh r4, [r0,#+100] -ldrh r4, [r0,#+112] -vmladavax.s16 r4, Q4, Q7 -vldrh.u16 Q5, [r1, #52] -strh r14, [r0,#+102] -ldrh r14, [r0,#+114] -vmladavax.s16 r14, Q5, Q7 -vldrh.u16 Q6, [r1, #54] -strh r12, [r0,#+104] -ldrh r12, [r0,#+116] -vmladavax.s16 r12, Q6, Q7 -vldrh.u16 Q7, [r1, #56] -vldrw.u32 Q0, [Q2, #60] -strh r10, [r0,#+106] -ldrh r10, [r0,#+118] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #58] -strh r8, [r0,#+108] -ldrh r8, [r0,#+120] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #60] -strh r6, [r0,#+110] -ldrh r6, [r0,#+122] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #62] -strh r4, [r0,#+112] -ldrh r4, [r0,#+124] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #64] -strh r14, [r0,#+114] -ldrh r14, [r0,#+126] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #66] -strh r12, [r0,#+116] -ldrh r12, [r0,#+128] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #68] -strh r10, [r0,#+118] -ldrh r10, [r0,#+130] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #70] -vldrw.u32 Q1, [Q2, #60] -strh r8, [r0,#+120] -ldrh r8, [r0,#+132] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #72] -strh r6, [r0,#+122] -ldrh r6, [r0,#+134] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #74] -strh r4, [r0,#+124] -ldrh r4, [r0,#+136] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #76] -strh r14, [r0,#+126] -ldrh r14, [r0,#+138] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #78] -strh r12, [r0,#+128] -ldrh r12, [r0,#+140] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #80] -strh r10, [r0,#+130] -ldrh r10, [r0,#+142] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #82] -strh r8, [r0,#+132] -ldrh r8, [r0,#+144] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #84] -vldrw.u32 Q3, [Q2, #60] -strh r6, [r0,#+134] -ldrh r6, [r0,#+146] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #86] -strh r4, [r0,#+136] -ldrh r4, [r0,#+148] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #88] -strh r14, [r0,#+138] -ldrh r14, [r0,#+150] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #90] -strh r12, [r0,#+140] -ldrh r12, [r0,#+152] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #92] -strh r10, [r0,#+142] -ldrh r10, [r0,#+154] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #94] -strh r8, [r0,#+144] -ldrh r8, [r0,#+156] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #96] -strh r6, [r0,#+146] -ldrh r6, [r0,#+158] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #98] -vldrw.u32 Q4, [Q2, #60] -strh r4, [r0,#+148] -ldrh r4, [r0,#+160] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #100] -strh r14, [r0,#+150] -ldrh r14, [r0,#+162] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #102] -strh r12, [r0,#+152] -ldrh r12, [r0,#+164] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #104] -strh r10, [r0,#+154] -ldrh r10, [r0,#+166] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #106] -strh r8, [r0,#+156] -ldrh r8, [r0,#+168] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #108] -strh r6, [r0,#+158] -ldrh r6, [r0,#+170] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #110] -strh r4, [r0,#+160] -ldrh r4, [r0,#+172] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #112] -vldrw.u32 Q5, [Q2, #60] -strh r14, [r0,#+162] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #114] -strh r12, [r0,#+164] -vmladavx.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #116] -strh r10, [r0,#+166] -vmladavx.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #118] -strh r8, [r0,#+168] -vmladavx.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #120] -strh r6, [r0,#+170] -vmladavx.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #122] -strh r4, [r0,#+172] -vmladavx.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #124] -strh r14, [r0,#+174] -vmladavx.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #126] -vldrw.u32 Q6, [Q2, #60] -strh r12, [r0,#+176] -vmladavx.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #-14] -vldrw.u32 Q0, [Q2, #76] -strh r10, [r0,#+178] -ldrh r10, [r0,#+64] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q1, [r1, #-12] -strh r8, [r0,#+180] -ldrh r8, [r0,#+66] -vmladavax.s16 r8, Q1, Q0 -vldrh.u16 Q3, [r1, #-10] -strh r6, [r0,#+182] -ldrh r6, [r0,#+68] -vmladavax.s16 r6, Q3, Q0 -vldrh.u16 Q4, [r1, #-8] -strh r4, [r0,#+184] -ldrh r4, [r0,#+70] -vmladavax.s16 r4, Q4, Q0 -vldrh.u16 Q5, [r1, #-6] -strh r14, [r0,#+186] -ldrh r14, [r0,#+72] -vmladavax.s16 r14, Q5, Q0 -vldrh.u16 Q6, [r1, #-4] -strh r12, [r0,#+188] -ldrh r12, [r0,#+74] -vmladavax.s16 r12, Q6, Q0 -vldrh.u16 Q7, [r1, #-2] -strh r10, [r0,#+64] -ldrh r10, [r0,#+76] -vmladavax.s16 r10, Q7, Q0 -vldrh.u16 Q0, [r1, #0] -vldrw.u32 Q1, [Q2, #76] -strh r8, [r0,#+66] -ldrh r8, [r0,#+78] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q3, [r1, #2] -strh r6, [r0,#+68] -ldrh r6, [r0,#+80] -vmladavax.s16 r6, Q3, Q1 -vldrh.u16 Q4, [r1, #4] -strh r4, [r0,#+70] -ldrh r4, [r0,#+82] -vmladavax.s16 r4, Q4, Q1 -vldrh.u16 Q5, [r1, #6] -strh r14, [r0,#+72] -ldrh r14, [r0,#+84] -vmladavax.s16 r14, Q5, Q1 -vldrh.u16 Q6, [r1, #8] -strh r12, [r0,#+74] -ldrh r12, [r0,#+86] -vmladavax.s16 r12, Q6, Q1 -vldrh.u16 Q7, [r1, #10] -strh r10, [r0,#+76] -ldrh r10, [r0,#+88] -vmladavax.s16 r10, Q7, Q1 -vldrh.u16 Q0, [r1, #12] -strh r8, [r0,#+78] -ldrh r8, [r0,#+90] -vmladavax.s16 r8, Q0, Q1 -vldrh.u16 Q1, [r1, #14] -vldrw.u32 Q3, [Q2, #76] -strh r6, [r0,#+80] -ldrh r6, [r0,#+92] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q4, [r1, #16] -strh r4, [r0,#+82] -ldrh r4, [r0,#+94] -vmladavax.s16 r4, Q4, Q3 -vldrh.u16 Q5, [r1, #18] -strh r14, [r0,#+84] -ldrh r14, [r0,#+96] -vmladavax.s16 r14, Q5, Q3 -vldrh.u16 Q6, [r1, #20] -strh r12, [r0,#+86] -ldrh r12, [r0,#+98] -vmladavax.s16 r12, Q6, Q3 -vldrh.u16 Q7, [r1, #22] -strh r10, [r0,#+88] -ldrh r10, [r0,#+100] -vmladavax.s16 r10, Q7, Q3 -vldrh.u16 Q0, [r1, #24] -strh r8, [r0,#+90] -ldrh r8, [r0,#+102] -vmladavax.s16 r8, Q0, Q3 -vldrh.u16 Q1, [r1, #26] -strh r6, [r0,#+92] -ldrh r6, [r0,#+104] -vmladavax.s16 r6, Q1, Q3 -vldrh.u16 Q3, [r1, #28] -vldrw.u32 Q4, [Q2, #76] -strh r4, [r0,#+94] -ldrh r4, [r0,#+106] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q5, [r1, #30] -strh r14, [r0,#+96] -ldrh r14, [r0,#+108] -vmladavax.s16 r14, Q5, Q4 -vldrh.u16 Q6, [r1, #32] -strh r12, [r0,#+98] -ldrh r12, [r0,#+110] -vmladavax.s16 r12, Q6, Q4 -vldrh.u16 Q7, [r1, #34] -strh r10, [r0,#+100] -ldrh r10, [r0,#+112] -vmladavax.s16 r10, Q7, Q4 -vldrh.u16 Q0, [r1, #36] -strh r8, [r0,#+102] -ldrh r8, [r0,#+114] -vmladavax.s16 r8, Q0, Q4 -vldrh.u16 Q1, [r1, #38] -strh r6, [r0,#+104] -ldrh r6, [r0,#+116] -vmladavax.s16 r6, Q1, Q4 -vldrh.u16 Q3, [r1, #40] -strh r4, [r0,#+106] -ldrh r4, [r0,#+118] -vmladavax.s16 r4, Q3, Q4 -vldrh.u16 Q4, [r1, #42] -vldrw.u32 Q5, [Q2, #76] -strh r14, [r0,#+108] -ldrh r14, [r0,#+120] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q6, [r1, #44] -strh r12, [r0,#+110] -ldrh r12, [r0,#+122] -vmladavax.s16 r12, Q6, Q5 -vldrh.u16 Q7, [r1, #46] -strh r10, [r0,#+112] -ldrh r10, [r0,#+124] -vmladavax.s16 r10, Q7, Q5 -vldrh.u16 Q0, [r1, #48] -strh r8, [r0,#+114] -ldrh r8, [r0,#+126] -vmladavax.s16 r8, Q0, Q5 -vldrh.u16 Q1, [r1, #50] -strh r6, [r0,#+116] -ldrh r6, [r0,#+128] -vmladavax.s16 r6, Q1, Q5 -vldrh.u16 Q3, [r1, #52] -strh r4, [r0,#+118] -ldrh r4, [r0,#+130] -vmladavax.s16 r4, Q3, Q5 -vldrh.u16 Q4, [r1, #54] -strh r14, [r0,#+120] -ldrh r14, [r0,#+132] -vmladavax.s16 r14, Q4, Q5 -vldrh.u16 Q5, [r1, #56] -vldrw.u32 Q6, [Q2, #76] -strh r12, [r0,#+122] -ldrh r12, [r0,#+134] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #58] -strh r10, [r0,#+124] -ldrh r10, [r0,#+136] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #60] -strh r8, [r0,#+126] -ldrh r8, [r0,#+138] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #62] -strh r6, [r0,#+128] -ldrh r6, [r0,#+140] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #64] -strh r4, [r0,#+130] -ldrh r4, [r0,#+142] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #66] -strh r14, [r0,#+132] -ldrh r14, [r0,#+144] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #68] -strh r12, [r0,#+134] -ldrh r12, [r0,#+146] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #70] -vldrw.u32 Q7, [Q2, #76] -strh r10, [r0,#+136] -ldrh r10, [r0,#+148] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #72] -strh r8, [r0,#+138] -ldrh r8, [r0,#+150] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #74] -strh r6, [r0,#+140] -ldrh r6, [r0,#+152] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #76] -strh r4, [r0,#+142] -ldrh r4, [r0,#+154] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #78] -strh r14, [r0,#+144] -ldrh r14, [r0,#+156] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #80] -strh r12, [r0,#+146] -ldrh r12, [r0,#+158] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #82] -strh r10, [r0,#+148] -ldrh r10, [r0,#+160] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #84] -vldrw.u32 Q0, [Q2, #76] -strh r8, [r0,#+150] -ldrh r8, [r0,#+162] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #86] -strh r6, [r0,#+152] -ldrh r6, [r0,#+164] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #88] -strh r4, [r0,#+154] -ldrh r4, [r0,#+166] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #90] -strh r14, [r0,#+156] -ldrh r14, [r0,#+168] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #92] -strh r12, [r0,#+158] -ldrh r12, [r0,#+170] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #94] -strh r10, [r0,#+160] -ldrh r10, [r0,#+172] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #96] -strh r8, [r0,#+162] -ldrh r8, [r0,#+174] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #98] -vldrw.u32 Q1, [Q2, #76] -strh r6, [r0,#+164] -ldrh r6, [r0,#+176] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #100] -strh r4, [r0,#+166] -ldrh r4, [r0,#+178] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #102] -strh r14, [r0,#+168] -ldrh r14, [r0,#+180] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #104] -strh r12, [r0,#+170] -ldrh r12, [r0,#+182] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #106] -strh r10, [r0,#+172] -ldrh r10, [r0,#+184] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #108] -strh r8, [r0,#+174] -ldrh r8, [r0,#+186] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #110] -strh r6, [r0,#+176] -ldrh r6, [r0,#+188] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #112] -vldrw.u32 Q3, [Q2, #76] -strh r4, [r0,#+178] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #114] -strh r14, [r0,#+180] -vmladavx.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #116] -strh r12, [r0,#+182] -vmladavx.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #118] -strh r10, [r0,#+184] -vmladavx.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #120] -strh r8, [r0,#+186] -vmladavx.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #122] -strh r6, [r0,#+188] -vmladavx.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #124] -strh r4, [r0,#+190] -vmladavx.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #126] -vldrw.u32 Q4, [Q2, #76] -strh r14, [r0,#+192] -vmladavx.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-14] -vldrw.u32 Q6, [Q2, #92] -strh r12, [r0,#+194] -ldrh r12, [r0,#+80] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q7, [r1, #-12] -strh r10, [r0,#+196] -ldrh r10, [r0,#+82] -vmladavax.s16 r10, Q7, Q6 -vldrh.u16 Q0, [r1, #-10] -strh r8, [r0,#+198] -ldrh r8, [r0,#+84] -vmladavax.s16 r8, Q0, Q6 -vldrh.u16 Q1, [r1, #-8] -strh r6, [r0,#+200] -ldrh r6, [r0,#+86] -vmladavax.s16 r6, Q1, Q6 -vldrh.u16 Q3, [r1, #-6] -strh r4, [r0,#+202] -ldrh r4, [r0,#+88] -vmladavax.s16 r4, Q3, Q6 -vldrh.u16 Q4, [r1, #-4] -strh r14, [r0,#+204] -ldrh r14, [r0,#+90] -vmladavax.s16 r14, Q4, Q6 -vldrh.u16 Q5, [r1, #-2] -strh r12, [r0,#+80] -ldrh r12, [r0,#+92] -vmladavax.s16 r12, Q5, Q6 -vldrh.u16 Q6, [r1, #0] -vldrw.u32 Q7, [Q2, #92] -strh r10, [r0,#+82] -ldrh r10, [r0,#+94] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q0, [r1, #2] -strh r8, [r0,#+84] -ldrh r8, [r0,#+96] -vmladavax.s16 r8, Q0, Q7 -vldrh.u16 Q1, [r1, #4] -strh r6, [r0,#+86] -ldrh r6, [r0,#+98] -vmladavax.s16 r6, Q1, Q7 -vldrh.u16 Q3, [r1, #6] -strh r4, [r0,#+88] -ldrh r4, [r0,#+100] -vmladavax.s16 r4, Q3, Q7 -vldrh.u16 Q4, [r1, #8] -strh r14, [r0,#+90] -ldrh r14, [r0,#+102] -vmladavax.s16 r14, Q4, Q7 -vldrh.u16 Q5, [r1, #10] -strh r12, [r0,#+92] -ldrh r12, [r0,#+104] -vmladavax.s16 r12, Q5, Q7 -vldrh.u16 Q6, [r1, #12] -strh r10, [r0,#+94] -ldrh r10, [r0,#+106] -vmladavax.s16 r10, Q6, Q7 -vldrh.u16 Q7, [r1, #14] -vldrw.u32 Q0, [Q2, #92] -strh r8, [r0,#+96] -ldrh r8, [r0,#+108] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q1, [r1, #16] -strh r6, [r0,#+98] -ldrh r6, [r0,#+110] -vmladavax.s16 r6, Q1, Q0 -vldrh.u16 Q3, [r1, #18] -strh r4, [r0,#+100] -ldrh r4, [r0,#+112] -vmladavax.s16 r4, Q3, Q0 -vldrh.u16 Q4, [r1, #20] -strh r14, [r0,#+102] -ldrh r14, [r0,#+114] -vmladavax.s16 r14, Q4, Q0 -vldrh.u16 Q5, [r1, #22] -strh r12, [r0,#+104] -ldrh r12, [r0,#+116] -vmladavax.s16 r12, Q5, Q0 -vldrh.u16 Q6, [r1, #24] -strh r10, [r0,#+106] -ldrh r10, [r0,#+118] -vmladavax.s16 r10, Q6, Q0 -vldrh.u16 Q7, [r1, #26] -strh r8, [r0,#+108] -ldrh r8, [r0,#+120] -vmladavax.s16 r8, Q7, Q0 -vldrh.u16 Q0, [r1, #28] -vldrw.u32 Q1, [Q2, #92] -strh r6, [r0,#+110] -ldrh r6, [r0,#+122] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q3, [r1, #30] -strh r4, [r0,#+112] -ldrh r4, [r0,#+124] -vmladavax.s16 r4, Q3, Q1 -vldrh.u16 Q4, [r1, #32] -strh r14, [r0,#+114] -ldrh r14, [r0,#+126] -vmladavax.s16 r14, Q4, Q1 -vldrh.u16 Q5, [r1, #34] -strh r12, [r0,#+116] -ldrh r12, [r0,#+128] -vmladavax.s16 r12, Q5, Q1 -vldrh.u16 Q6, [r1, #36] -strh r10, [r0,#+118] -ldrh r10, [r0,#+130] -vmladavax.s16 r10, Q6, Q1 -vldrh.u16 Q7, [r1, #38] -strh r8, [r0,#+120] -ldrh r8, [r0,#+132] -vmladavax.s16 r8, Q7, Q1 -vldrh.u16 Q0, [r1, #40] -strh r6, [r0,#+122] -ldrh r6, [r0,#+134] -vmladavax.s16 r6, Q0, Q1 -vldrh.u16 Q1, [r1, #42] -vldrw.u32 Q3, [Q2, #92] -strh r4, [r0,#+124] -ldrh r4, [r0,#+136] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q4, [r1, #44] -strh r14, [r0,#+126] -ldrh r14, [r0,#+138] -vmladavax.s16 r14, Q4, Q3 -vldrh.u16 Q5, [r1, #46] -strh r12, [r0,#+128] -ldrh r12, [r0,#+140] -vmladavax.s16 r12, Q5, Q3 -vldrh.u16 Q6, [r1, #48] -strh r10, [r0,#+130] -ldrh r10, [r0,#+142] -vmladavax.s16 r10, Q6, Q3 -vldrh.u16 Q7, [r1, #50] -strh r8, [r0,#+132] -ldrh r8, [r0,#+144] -vmladavax.s16 r8, Q7, Q3 -vldrh.u16 Q0, [r1, #52] -strh r6, [r0,#+134] -ldrh r6, [r0,#+146] -vmladavax.s16 r6, Q0, Q3 -vldrh.u16 Q1, [r1, #54] -strh r4, [r0,#+136] -ldrh r4, [r0,#+148] -vmladavax.s16 r4, Q1, Q3 -vldrh.u16 Q3, [r1, #56] -vldrw.u32 Q4, [Q2, #92] -strh r14, [r0,#+138] -ldrh r14, [r0,#+150] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #58] -strh r12, [r0,#+140] -ldrh r12, [r0,#+152] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #60] -strh r10, [r0,#+142] -ldrh r10, [r0,#+154] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #62] -strh r8, [r0,#+144] -ldrh r8, [r0,#+156] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #64] -strh r6, [r0,#+146] -ldrh r6, [r0,#+158] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #66] -strh r4, [r0,#+148] -ldrh r4, [r0,#+160] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #68] -strh r14, [r0,#+150] -ldrh r14, [r0,#+162] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #70] -vldrw.u32 Q5, [Q2, #92] -strh r12, [r0,#+152] -ldrh r12, [r0,#+164] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #72] -strh r10, [r0,#+154] -ldrh r10, [r0,#+166] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #74] -strh r8, [r0,#+156] -ldrh r8, [r0,#+168] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #76] -strh r6, [r0,#+158] -ldrh r6, [r0,#+170] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #78] -strh r4, [r0,#+160] -ldrh r4, [r0,#+172] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #80] -strh r14, [r0,#+162] -ldrh r14, [r0,#+174] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #82] -strh r12, [r0,#+164] -ldrh r12, [r0,#+176] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #84] -vldrw.u32 Q6, [Q2, #92] -strh r10, [r0,#+166] -ldrh r10, [r0,#+178] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #86] -strh r8, [r0,#+168] -ldrh r8, [r0,#+180] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #88] -strh r6, [r0,#+170] -ldrh r6, [r0,#+182] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #90] -strh r4, [r0,#+172] -ldrh r4, [r0,#+184] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #92] -strh r14, [r0,#+174] -ldrh r14, [r0,#+186] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #94] -strh r12, [r0,#+176] -ldrh r12, [r0,#+188] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #96] -strh r10, [r0,#+178] -ldrh r10, [r0,#+190] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #98] -vldrw.u32 Q7, [Q2, #92] -strh r8, [r0,#+180] -ldrh r8, [r0,#+192] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #100] -strh r6, [r0,#+182] -ldrh r6, [r0,#+194] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #102] -strh r4, [r0,#+184] -ldrh r4, [r0,#+196] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #104] -strh r14, [r0,#+186] -ldrh r14, [r0,#+198] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #106] -strh r12, [r0,#+188] -ldrh r12, [r0,#+200] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #108] -strh r10, [r0,#+190] -ldrh r10, [r0,#+202] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #110] -strh r8, [r0,#+192] -ldrh r8, [r0,#+204] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #112] -vldrw.u32 Q0, [Q2, #92] -strh r6, [r0,#+194] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #114] -strh r4, [r0,#+196] -vmladavx.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #116] -strh r14, [r0,#+198] -vmladavx.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #118] -strh r12, [r0,#+200] -vmladavx.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #120] -strh r10, [r0,#+202] -vmladavx.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #122] -strh r8, [r0,#+204] -vmladavx.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #124] -strh r6, [r0,#+206] -vmladavx.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #126] -vldrw.u32 Q1, [Q2, #92] -strh r4, [r0,#+208] -vmladavx.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #-14] -vldrw.u32 Q4, [Q2, #108] -strh r14, [r0,#+210] -ldrh r14, [r0,#+96] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q5, [r1, #-12] -strh r12, [r0,#+212] -ldrh r12, [r0,#+98] -vmladavax.s16 r12, Q5, Q4 -vldrh.u16 Q6, [r1, #-10] -strh r10, [r0,#+214] -ldrh r10, [r0,#+100] -vmladavax.s16 r10, Q6, Q4 -vldrh.u16 Q7, [r1, #-8] -strh r8, [r0,#+216] -ldrh r8, [r0,#+102] -vmladavax.s16 r8, Q7, Q4 -vldrh.u16 Q0, [r1, #-6] -strh r6, [r0,#+218] -ldrh r6, [r0,#+104] -vmladavax.s16 r6, Q0, Q4 -vldrh.u16 Q1, [r1, #-4] -strh r4, [r0,#+220] -ldrh r4, [r0,#+106] -vmladavax.s16 r4, Q1, Q4 -vldrh.u16 Q3, [r1, #-2] -strh r14, [r0,#+96] -ldrh r14, [r0,#+108] -vmladavax.s16 r14, Q3, Q4 -vldrh.u16 Q4, [r1, #0] -vldrw.u32 Q5, [Q2, #108] -strh r12, [r0,#+98] -ldrh r12, [r0,#+110] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q6, [r1, #2] -strh r10, [r0,#+100] -ldrh r10, [r0,#+112] -vmladavax.s16 r10, Q6, Q5 -vldrh.u16 Q7, [r1, #4] -strh r8, [r0,#+102] -ldrh r8, [r0,#+114] -vmladavax.s16 r8, Q7, Q5 -vldrh.u16 Q0, [r1, #6] -strh r6, [r0,#+104] -ldrh r6, [r0,#+116] -vmladavax.s16 r6, Q0, Q5 -vldrh.u16 Q1, [r1, #8] -strh r4, [r0,#+106] -ldrh r4, [r0,#+118] -vmladavax.s16 r4, Q1, Q5 -vldrh.u16 Q3, [r1, #10] -strh r14, [r0,#+108] -ldrh r14, [r0,#+120] -vmladavax.s16 r14, Q3, Q5 -vldrh.u16 Q4, [r1, #12] -strh r12, [r0,#+110] -ldrh r12, [r0,#+122] -vmladavax.s16 r12, Q4, Q5 -vldrh.u16 Q5, [r1, #14] -vldrw.u32 Q6, [Q2, #108] -strh r10, [r0,#+112] -ldrh r10, [r0,#+124] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q7, [r1, #16] -strh r8, [r0,#+114] -ldrh r8, [r0,#+126] -vmladavax.s16 r8, Q7, Q6 -vldrh.u16 Q0, [r1, #18] -strh r6, [r0,#+116] -ldrh r6, [r0,#+128] -vmladavax.s16 r6, Q0, Q6 -vldrh.u16 Q1, [r1, #20] -strh r4, [r0,#+118] -ldrh r4, [r0,#+130] -vmladavax.s16 r4, Q1, Q6 -vldrh.u16 Q3, [r1, #22] -strh r14, [r0,#+120] -ldrh r14, [r0,#+132] -vmladavax.s16 r14, Q3, Q6 -vldrh.u16 Q4, [r1, #24] -strh r12, [r0,#+122] -ldrh r12, [r0,#+134] -vmladavax.s16 r12, Q4, Q6 -vldrh.u16 Q5, [r1, #26] -strh r10, [r0,#+124] -ldrh r10, [r0,#+136] -vmladavax.s16 r10, Q5, Q6 -vldrh.u16 Q6, [r1, #28] -vldrw.u32 Q7, [Q2, #108] -strh r8, [r0,#+126] -ldrh r8, [r0,#+138] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q0, [r1, #30] -strh r6, [r0,#+128] -ldrh r6, [r0,#+140] -vmladavax.s16 r6, Q0, Q7 -vldrh.u16 Q1, [r1, #32] -strh r4, [r0,#+130] -ldrh r4, [r0,#+142] -vmladavax.s16 r4, Q1, Q7 -vldrh.u16 Q3, [r1, #34] -strh r14, [r0,#+132] -ldrh r14, [r0,#+144] -vmladavax.s16 r14, Q3, Q7 -vldrh.u16 Q4, [r1, #36] -strh r12, [r0,#+134] -ldrh r12, [r0,#+146] -vmladavax.s16 r12, Q4, Q7 -vldrh.u16 Q5, [r1, #38] -strh r10, [r0,#+136] -ldrh r10, [r0,#+148] -vmladavax.s16 r10, Q5, Q7 -vldrh.u16 Q6, [r1, #40] -strh r8, [r0,#+138] -ldrh r8, [r0,#+150] -vmladavax.s16 r8, Q6, Q7 -vldrh.u16 Q7, [r1, #42] -vldrw.u32 Q0, [Q2, #108] -strh r6, [r0,#+140] -ldrh r6, [r0,#+152] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q1, [r1, #44] -strh r4, [r0,#+142] -ldrh r4, [r0,#+154] -vmladavax.s16 r4, Q1, Q0 -vldrh.u16 Q3, [r1, #46] -strh r14, [r0,#+144] -ldrh r14, [r0,#+156] -vmladavax.s16 r14, Q3, Q0 -vldrh.u16 Q4, [r1, #48] -strh r12, [r0,#+146] -ldrh r12, [r0,#+158] -vmladavax.s16 r12, Q4, Q0 -vldrh.u16 Q5, [r1, #50] -strh r10, [r0,#+148] -ldrh r10, [r0,#+160] -vmladavax.s16 r10, Q5, Q0 -vldrh.u16 Q6, [r1, #52] -strh r8, [r0,#+150] -ldrh r8, [r0,#+162] -vmladavax.s16 r8, Q6, Q0 -vldrh.u16 Q7, [r1, #54] -strh r6, [r0,#+152] -ldrh r6, [r0,#+164] -vmladavax.s16 r6, Q7, Q0 -vldrh.u16 Q0, [r1, #56] -vldrw.u32 Q1, [Q2, #108] -strh r4, [r0,#+154] -ldrh r4, [r0,#+166] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #58] -strh r14, [r0,#+156] -ldrh r14, [r0,#+168] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #60] -strh r12, [r0,#+158] -ldrh r12, [r0,#+170] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #62] -strh r10, [r0,#+160] -ldrh r10, [r0,#+172] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #64] -strh r8, [r0,#+162] -ldrh r8, [r0,#+174] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #66] -strh r6, [r0,#+164] -ldrh r6, [r0,#+176] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #68] -strh r4, [r0,#+166] -ldrh r4, [r0,#+178] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #70] -vldrw.u32 Q3, [Q2, #108] -strh r14, [r0,#+168] -ldrh r14, [r0,#+180] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #72] -strh r12, [r0,#+170] -ldrh r12, [r0,#+182] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #74] -strh r10, [r0,#+172] -ldrh r10, [r0,#+184] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #76] -strh r8, [r0,#+174] -ldrh r8, [r0,#+186] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #78] -strh r6, [r0,#+176] -ldrh r6, [r0,#+188] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #80] -strh r4, [r0,#+178] -ldrh r4, [r0,#+190] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #82] -strh r14, [r0,#+180] -ldrh r14, [r0,#+192] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #84] -vldrw.u32 Q4, [Q2, #108] -strh r12, [r0,#+182] -ldrh r12, [r0,#+194] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #86] -strh r10, [r0,#+184] -ldrh r10, [r0,#+196] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #88] -strh r8, [r0,#+186] -ldrh r8, [r0,#+198] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #90] -strh r6, [r0,#+188] -ldrh r6, [r0,#+200] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #92] -strh r4, [r0,#+190] -ldrh r4, [r0,#+202] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #94] -strh r14, [r0,#+192] -ldrh r14, [r0,#+204] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #96] -strh r12, [r0,#+194] -ldrh r12, [r0,#+206] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #98] -vldrw.u32 Q5, [Q2, #108] -strh r10, [r0,#+196] -ldrh r10, [r0,#+208] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #100] -strh r8, [r0,#+198] -ldrh r8, [r0,#+210] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #102] -strh r6, [r0,#+200] -ldrh r6, [r0,#+212] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #104] -strh r4, [r0,#+202] -ldrh r4, [r0,#+214] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #106] -strh r14, [r0,#+204] -ldrh r14, [r0,#+216] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #108] -strh r12, [r0,#+206] -ldrh r12, [r0,#+218] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #110] -strh r10, [r0,#+208] -ldrh r10, [r0,#+220] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #112] -vldrw.u32 Q6, [Q2, #108] -strh r8, [r0,#+210] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #114] -strh r6, [r0,#+212] -vmladavx.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #116] -strh r4, [r0,#+214] -vmladavx.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #118] -strh r14, [r0,#+216] -vmladavx.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #120] -strh r12, [r0,#+218] -vmladavx.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #122] -strh r10, [r0,#+220] -vmladavx.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #124] -strh r8, [r0,#+222] -vmladavx.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #126] -vldrw.u32 Q7, [Q2, #108] -strh r6, [r0,#+224] -vmladavx.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #-14] -vldrw.u32 Q1, [Q2, #124] -strh r4, [r0,#+226] -ldrh r4, [r0,#+112] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q3, [r1, #-12] -strh r14, [r0,#+228] -ldrh r14, [r0,#+114] -vmladavax.s16 r14, Q3, Q1 -vldrh.u16 Q4, [r1, #-10] -strh r12, [r0,#+230] -ldrh r12, [r0,#+116] -vmladavax.s16 r12, Q4, Q1 -vldrh.u16 Q5, [r1, #-8] -strh r10, [r0,#+232] -ldrh r10, [r0,#+118] -vmladavax.s16 r10, Q5, Q1 -vldrh.u16 Q6, [r1, #-6] -strh r8, [r0,#+234] -ldrh r8, [r0,#+120] -vmladavax.s16 r8, Q6, Q1 -vldrh.u16 Q7, [r1, #-4] -strh r6, [r0,#+236] -ldrh r6, [r0,#+122] -vmladavax.s16 r6, Q7, Q1 -vldrh.u16 Q0, [r1, #-2] -strh r4, [r0,#+112] -ldrh r4, [r0,#+124] -vmladavax.s16 r4, Q0, Q1 -vldrh.u16 Q1, [r1, #0] -vldrw.u32 Q3, [Q2, #124] -strh r14, [r0,#+114] -ldrh r14, [r0,#+126] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q4, [r1, #2] -strh r12, [r0,#+116] -ldrh r12, [r0,#+128] -vmladavax.s16 r12, Q4, Q3 -vldrh.u16 Q5, [r1, #4] -strh r10, [r0,#+118] -ldrh r10, [r0,#+130] -vmladavax.s16 r10, Q5, Q3 -vldrh.u16 Q6, [r1, #6] -strh r8, [r0,#+120] -ldrh r8, [r0,#+132] -vmladavax.s16 r8, Q6, Q3 -vldrh.u16 Q7, [r1, #8] -strh r6, [r0,#+122] -ldrh r6, [r0,#+134] -vmladavax.s16 r6, Q7, Q3 -vldrh.u16 Q0, [r1, #10] -strh r4, [r0,#+124] -ldrh r4, [r0,#+136] -vmladavax.s16 r4, Q0, Q3 -vldrh.u16 Q1, [r1, #12] -strh r14, [r0,#+126] -ldrh r14, [r0,#+138] -vmladavax.s16 r14, Q1, Q3 -vldrh.u16 Q3, [r1, #14] -vldrw.u32 Q4, [Q2, #124] -strh r12, [r0,#+128] -ldrh r12, [r0,#+140] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q5, [r1, #16] -strh r10, [r0,#+130] -ldrh r10, [r0,#+142] -vmladavax.s16 r10, Q5, Q4 -vldrh.u16 Q6, [r1, #18] -strh r8, [r0,#+132] -ldrh r8, [r0,#+144] -vmladavax.s16 r8, Q6, Q4 -vldrh.u16 Q7, [r1, #20] -strh r6, [r0,#+134] -ldrh r6, [r0,#+146] -vmladavax.s16 r6, Q7, Q4 -vldrh.u16 Q0, [r1, #22] -strh r4, [r0,#+136] -ldrh r4, [r0,#+148] -vmladavax.s16 r4, Q0, Q4 -vldrh.u16 Q1, [r1, #24] -strh r14, [r0,#+138] -ldrh r14, [r0,#+150] -vmladavax.s16 r14, Q1, Q4 -vldrh.u16 Q3, [r1, #26] -strh r12, [r0,#+140] -ldrh r12, [r0,#+152] -vmladavax.s16 r12, Q3, Q4 -vldrh.u16 Q4, [r1, #28] -vldrw.u32 Q5, [Q2, #124] -strh r10, [r0,#+142] -ldrh r10, [r0,#+154] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q6, [r1, #30] -strh r8, [r0,#+144] -ldrh r8, [r0,#+156] -vmladavax.s16 r8, Q6, Q5 -vldrh.u16 Q7, [r1, #32] -strh r6, [r0,#+146] -ldrh r6, [r0,#+158] -vmladavax.s16 r6, Q7, Q5 -vldrh.u16 Q0, [r1, #34] -strh r4, [r0,#+148] -ldrh r4, [r0,#+160] -vmladavax.s16 r4, Q0, Q5 -vldrh.u16 Q1, [r1, #36] -strh r14, [r0,#+150] -ldrh r14, [r0,#+162] -vmladavax.s16 r14, Q1, Q5 -vldrh.u16 Q3, [r1, #38] -strh r12, [r0,#+152] -ldrh r12, [r0,#+164] -vmladavax.s16 r12, Q3, Q5 -vldrh.u16 Q4, [r1, #40] -strh r10, [r0,#+154] -ldrh r10, [r0,#+166] -vmladavax.s16 r10, Q4, Q5 -vldrh.u16 Q5, [r1, #42] -vldrw.u32 Q6, [Q2, #124] -strh r8, [r0,#+156] -ldrh r8, [r0,#+168] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q7, [r1, #44] -strh r6, [r0,#+158] -ldrh r6, [r0,#+170] -vmladavax.s16 r6, Q7, Q6 -vldrh.u16 Q0, [r1, #46] -strh r4, [r0,#+160] -ldrh r4, [r0,#+172] -vmladavax.s16 r4, Q0, Q6 -vldrh.u16 Q1, [r1, #48] -strh r14, [r0,#+162] -ldrh r14, [r0,#+174] -vmladavax.s16 r14, Q1, Q6 -vldrh.u16 Q3, [r1, #50] -strh r12, [r0,#+164] -ldrh r12, [r0,#+176] -vmladavax.s16 r12, Q3, Q6 -vldrh.u16 Q4, [r1, #52] -strh r10, [r0,#+166] -ldrh r10, [r0,#+178] -vmladavax.s16 r10, Q4, Q6 -vldrh.u16 Q5, [r1, #54] -strh r8, [r0,#+168] -ldrh r8, [r0,#+180] -vmladavax.s16 r8, Q5, Q6 -vldrh.u16 Q6, [r1, #56] -vldrw.u32 Q7, [Q2, #124] -strh r6, [r0,#+170] -ldrh r6, [r0,#+182] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q0, [r1, #58] -strh r4, [r0,#+172] -ldrh r4, [r0,#+184] -vmladavax.s16 r4, Q0, Q7 -vldrh.u16 Q1, [r1, #60] -strh r14, [r0,#+174] -ldrh r14, [r0,#+186] -vmladavax.s16 r14, Q1, Q7 -vldrh.u16 Q3, [r1, #62] -strh r12, [r0,#+176] -ldrh r12, [r0,#+188] -vmladavax.s16 r12, Q3, Q7 -vldrh.u16 Q4, [r1, #64] -strh r10, [r0,#+178] -ldrh r10, [r0,#+190] -vmladavax.s16 r10, Q4, Q7 -vldrh.u16 Q5, [r1, #66] -strh r8, [r0,#+180] -ldrh r8, [r0,#+192] -vmladavax.s16 r8, Q5, Q7 -vldrh.u16 Q6, [r1, #68] -strh r6, [r0,#+182] -ldrh r6, [r0,#+194] -vmladavax.s16 r6, Q6, Q7 -vldrh.u16 Q7, [r1, #70] -vldrw.u32 Q0, [Q2, #124] -strh r4, [r0,#+184] -ldrh r4, [r0,#+196] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q1, [r1, #72] -strh r14, [r0,#+186] -ldrh r14, [r0,#+198] -vmladavax.s16 r14, Q1, Q0 -vldrh.u16 Q3, [r1, #74] -strh r12, [r0,#+188] -ldrh r12, [r0,#+200] -vmladavax.s16 r12, Q3, Q0 -vldrh.u16 Q4, [r1, #76] -strh r10, [r0,#+190] -ldrh r10, [r0,#+202] -vmladavax.s16 r10, Q4, Q0 -vldrh.u16 Q5, [r1, #78] -strh r8, [r0,#+192] -ldrh r8, [r0,#+204] -vmladavax.s16 r8, Q5, Q0 -vldrh.u16 Q6, [r1, #80] -strh r6, [r0,#+194] -ldrh r6, [r0,#+206] -vmladavax.s16 r6, Q6, Q0 -vldrh.u16 Q7, [r1, #82] -strh r4, [r0,#+196] -ldrh r4, [r0,#+208] -vmladavax.s16 r4, Q7, Q0 -vldrh.u16 Q0, [r1, #84] -vldrw.u32 Q1, [Q2, #124] -strh r14, [r0,#+198] -ldrh r14, [r0,#+210] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q3, [r1, #86] -strh r12, [r0,#+200] -ldrh r12, [r0,#+212] -vmladavax.s16 r12, Q3, Q1 -vldrh.u16 Q4, [r1, #88] -strh r10, [r0,#+202] -ldrh r10, [r0,#+214] -vmladavax.s16 r10, Q4, Q1 -vldrh.u16 Q5, [r1, #90] -strh r8, [r0,#+204] -ldrh r8, [r0,#+216] -vmladavax.s16 r8, Q5, Q1 -vldrh.u16 Q6, [r1, #92] -strh r6, [r0,#+206] -ldrh r6, [r0,#+218] -vmladavax.s16 r6, Q6, Q1 -vldrh.u16 Q7, [r1, #94] -strh r4, [r0,#+208] -ldrh r4, [r0,#+220] -vmladavax.s16 r4, Q7, Q1 -vldrh.u16 Q0, [r1, #96] -strh r14, [r0,#+210] -ldrh r14, [r0,#+222] -vmladavax.s16 r14, Q0, Q1 -vldrh.u16 Q1, [r1, #98] -vldrw.u32 Q3, [Q2, #124] -strh r12, [r0,#+212] -ldrh r12, [r0,#+224] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q4, [r1, #100] -strh r10, [r0,#+214] -ldrh r10, [r0,#+226] -vmladavax.s16 r10, Q4, Q3 -vldrh.u16 Q5, [r1, #102] -strh r8, [r0,#+216] -ldrh r8, [r0,#+228] -vmladavax.s16 r8, Q5, Q3 -vldrh.u16 Q6, [r1, #104] -strh r6, [r0,#+218] -ldrh r6, [r0,#+230] -vmladavax.s16 r6, Q6, Q3 -vldrh.u16 Q7, [r1, #106] -strh r4, [r0,#+220] -ldrh r4, [r0,#+232] -vmladavax.s16 r4, Q7, Q3 -vldrh.u16 Q0, [r1, #108] -strh r14, [r0,#+222] -ldrh r14, [r0,#+234] -vmladavax.s16 r14, Q0, Q3 -vldrh.u16 Q1, [r1, #110] -strh r12, [r0,#+224] -ldrh r12, [r0,#+236] -vmladavax.s16 r12, Q1, Q3 -vldrh.u16 Q3, [r1, #112] -vldrw.u32 Q4, [Q2, #124] -strh r10, [r0,#+226] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q5, [r1, #114] -strh r8, [r0,#+228] -vmladavx.s16 r8, Q5, Q4 -vldrh.u16 Q6, [r1, #116] -strh r6, [r0,#+230] -vmladavx.s16 r6, Q6, Q4 -vldrh.u16 Q7, [r1, #118] -strh r4, [r0,#+232] -vmladavx.s16 r4, Q7, Q4 -vldrh.u16 Q0, [r1, #120] -strh r14, [r0,#+234] -vmladavx.s16 r14, Q0, Q4 -vldrh.u16 Q1, [r1, #122] -strh r12, [r0,#+236] -vmladavx.s16 r12, Q1, Q4 -vldrh.u16 Q3, [r1, #124] -strh r10, [r0,#+238] -vmladavx.s16 r10, Q3, Q4 -vldrh.u16 Q4, [r1, #126] -vldrw.u32 Q5, [Q2, #124] -strh r8, [r0,#+240] -vmladavx.s16 r8, Q4, Q5 -strh r6, [r0,#+242] -strh r4, [r0,#+244] -strh r14, [r0,#+246] -strh r12, [r0,#+248] -strh r10, [r0,#+250] -strh r8, [r0,#+252] -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/main.c b/tests/schoolbook/main.c index 423b094..811716e 100644 --- a/tests/schoolbook/main.c +++ b/tests/schoolbook/main.c @@ -44,9 +44,9 @@ #define DIMENSION_DIV2 16 #endif -//#define TEST_ORDINARY -//#define TEST_ANTICYCLIC -//#define TEST_KARATSUBA_FWD +#define TEST_ORDINARY +#define TEST_ANTICYCLIC +#define TEST_KARATSUBA_FWD #define TEST_KARATSUBA_FWD_MUL_INV #define NUM_POLY_MULS 21 diff --git a/tests/schoolbook/misc.c b/tests/schoolbook/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/schoolbook/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/schoolbook/misc.h b/tests/schoolbook/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/schoolbook/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/schoolbook/poly.c b/tests/schoolbook/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/schoolbook/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/schoolbook/poly.h b/tests/schoolbook/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/schoolbook/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/schoolbook/poly_u16_32.s b/tests/schoolbook/poly_u16_32.s deleted file mode 100644 index 99cb1a6..0000000 --- a/tests/schoolbook/poly_u16_32.s +++ /dev/null @@ -1,1050 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd_handshuffle, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd_handshuffle -poly_u16_mul_32_anticyclic_karatsuba_mve_simd_handshuffle: -push {r4-r11,lr} -vpush {d0-d15} -vld20.u16 {Q4, Q5}, [r2] -sub sp, sp, #224 -vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -mov r14, #19 -wls r14, r14, loop_end -loop_start: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -le r14, loop_start -loop_end: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -// vld20.u16 {Q4, Q5}, [r2] -// vld21.u16 {Q4, Q5}, [r2]! -// vld20.u16 {Q6, Q7}, [r2] -// vld21.u16 {Q6, Q7}, [r2]! -// vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -// vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -// mov r12, #0 -// mov r11, sp -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r1, #24] -// vmul.u16 Q2, Q4, r9 -// ldrd r8, r7, [r1, #56] -// vmul.u16 Q3, Q4, r7 -// vneg.s16 Q7, Q6 -// vmla.s16 Q2, Q7, r7 -// ldrd r6, r5, [r1, #16] -// vmla.s16 Q3, Q6, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r8 -// ldrd r9, r7, [r1, #48] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r10, r8, [r1, #8] -// vmla.s16 Q3, Q6, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r9 -// ldrd r7, r5, [r1, #40] -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r9, r6, [r1, #0] -// vmla.s16 Q3, Q6, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r8, r5, [r1, #32] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q6, r9 -// vstrh.u16 Q3, [r11,#(144)] -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r8 -// vstrh.u16 Q2, [r11,#(128)] -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -// vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r11, #-104] -// vmul.u16 Q2, Q0, r10 -// ldrd r8, r7, [r11, #-40] -// vmul.u16 Q3, Q0, r8 -// vneg.s16 Q7, Q1 -// vmla.s16 Q2, Q7, r8 -// ldrd r6, r5, [r11, #-112] -// ldrd r4, r3, [r11, #-48] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r10, r8, [r11, #-120] -// vmla.s16 Q3, Q1, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// ldrd r5, r3, [r11, #-56] -// vmla.s16 Q3, Q1, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r6, r4, [r11, #-64] -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r8, r3, [r11, #-128] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r3 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// vmla.s16 Q3, Q1, r3 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r6 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r6 -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// neg r7, r7 -// vmla.s16 Q2, Q0, r7 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q1, r7 -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r9 -// vadd.u16 Q4, Q4, Q0 -// vldrh.u16 Q5, [r11,#0] -// vadd.u16 Q5, Q5, Q2 -// vldrh.u16 Q7, [r11,#16] -// vadd.u16 Q7, Q7, Q3 -// vstrh.u16 Q5, [r0, #0] -// vstrh.u16 Q7, [r0, #16] -// vadd.u16 Q6, Q6, Q1 -// vneg.s16 Q3, Q3 -// vmov.u16 Q0, #0 -// mov r12, #0 -// ldrd r10, r9, [r11, #-72] -// vmla.s16 Q3, Q4, r9 -// ldrd r8, r7, [r11, #-8] -// vmla.s16 Q2, Q4, r7 -// vneg.s16 Q1, Q6 -// vmla.s16 Q3, Q1, r7 -// ldrd r6, r5, [r11, #-80] -// vmla.s16 Q2, Q6, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r8 -// ldrd r9, r7, [r11, #-16] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r10, r8, [r11, #-88] -// vmla.s16 Q2, Q6, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r9 -// ldrd r7, r5, [r11, #-24] -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// ldrd r9, r6, [r11, #-96] -// vmla.s16 Q2, Q6, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r8, r5, [r11, #-32] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q2, Q6, r9 -// vsub.u16 Q3, Q3, Q0 -// vmla.s16 Q3, Q1, r8 -// vldrh.u16 Q0, [r11,#0] -// vldrh.u16 Q1, [r11,#16] -// vsub.u16 Q0, Q3, Q0 -// vsub.u16 Q1, Q2, Q1 -// vstrh.u16 Q0, [r0, #32] -// vstrh.u16 Q1, [r0, #48] - -add sp, sp, #224 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr diff --git a/tests/schoolbook/poly_u16_mul_16_anticyclic_opt_mve_simd.s b/tests/schoolbook/poly_u16_mul_16_anticyclic_opt_mve_simd.s deleted file mode 100644 index f244006..0000000 --- a/tests/schoolbook/poly_u16_mul_16_anticyclic_opt_mve_simd.s +++ /dev/null @@ -1,423 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_16_anticyclic_opt_mve_simd_manual, %function -.global poly_u16_mul_16_anticyclic_opt_mve_simd_manual -poly_u16_mul_16_anticyclic_opt_mve_simd_manual: -push {r4-r11,lr} -vpush {d0-d15} -ldrd r10, r11, [r1], #32 -ldrd r8, r9, [r1, #-16] -ldrd r6, r7, [r1, #-8] -vldrh.u16 Q2, [r0, #0] -vldrh.u16 Q3, [r0, #16] -vldrh.u16 Q0, [r2], #16 -vmla.s16 Q2, Q0, r6 -vldrh.u16 Q1, [r2], #16 -vmla.s16 Q3, Q1, r6 -ldrd r4, r5, [r1, #-24] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r9 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r11 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r7 -asrl r4, r5, #16 -asrl r6, r7, #16 -asrl r10, r11, #16 -asrl r8, r9, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r5 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vldrh.u16 Q7, [r2], #16 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r11 -vldrh.u16 Q6, [r2], #16 -vmla.s16 Q3, Q0, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vldrh.u16 Q5, [r0, #32] -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0], #16 -vmla.s16 Q3, Q0, r8 -//vstrh.u16 Q3, [r0], #16 -mov r11, #30 -wls r14, r11, loop_end -loop_start: -ldrd r10, r11, [r1], #32 -ldrd r8, r9, [r1, #-16] -ldrd r6, r7, [r1, #-8] -//vstrh.u16 Q3, [r0], #16 -vldrh.u16 Q4, [r0, #16] -//vldrh.u16 Q6, [r2], #16 -//vldrh.u16 Q5, [r0, #0] -//vldrh.u16 Q7, [r2], #16 -vmla.s16 Q5, Q7, r6 -//vldrh.u16 Q4, [r0, #16] -vstrh.u16 Q3, [r0], #16 -vmla.s16 Q4, Q6, r6 -ldrd r4, r5, [r1, #-24] -mov r12, #0 -vmla.s16 Q5, Q6, r4 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r4 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r11 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r9 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r9 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r11 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r8 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r10 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r5 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r5 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r7 -asrl r4, r5, #16 -asrl r6, r7, #16 -asrl r10, r11, #16 -asrl r8, r9, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r5 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r4 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r6 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r4 -vldrh.u16 Q0, [r2], #16 -vmla.s16 Q4, Q7, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r11 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r9 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r11 -vldrh.u16 Q1, [r2], #16 -vmla.s16 Q4, Q7, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vldrh.u16 Q2, [r0, #32] -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0], #16 -vmla.s16 Q4, Q7, r8 -ldrd r10, r11, [r1], #32 -ldrd r8, r9, [r1, #-16] -ldrd r6, r7, [r1, #-8] -//vldrh.u16 Q2, [r0, #0] -vldrh.u16 Q3, [r0, #16] -//vldrh.u16 Q0, [r2], #16 -vmla.s16 Q2, Q0, r6 -vstrh.u16 Q4, [r0], #16 -//vldrh.u16 Q1, [r2], #16 -vmla.s16 Q3, Q1, r6 -ldrd r4, r5, [r1, #-24] -mov r12, #0 -vmla.s16 Q2, Q1, r4 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r4 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r9 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r11 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r10 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r7 -asrl r4, r5, #16 -asrl r6, r7, #16 -asrl r10, r11, #16 -asrl r8, r9, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r5 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r6 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r4 -vldrh.u16 Q7, [r2], #16 -vmla.s16 Q3, Q0, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r11 -vldrh.u16 Q6, [r2], #16 -vmla.s16 Q3, Q0, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vldrh.u16 Q5, [r0, #32] -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0], #16 -vmla.s16 Q3, Q0, r8 -//vstrh.u16 Q3, [r0], #16 -le r14, loop_start -loop_end: -ldrd r14, r11, [r1], #32 -ldrd r10, r9, [r1, #-16] -ldrd r8, r7, [r1, #-8] -vstrh.u16 Q3, [r0], #16 -//vldrh.u16 Q6, [r2], #16 -//vldrh.u16 Q5, [r0, #0] -//vldrh.u16 Q7, [r2], #16 -vmla.s16 Q5, Q7, r8 -vldrh.u16 Q4, [r0, #16] -vmla.s16 Q4, Q6, r8 -ldrd r6, r5, [r1, #-24] -mov r12, #0 -vmla.s16 Q5, Q6, r6 -vneg.s16 Q2, Q7 -vmla.s16 Q4, Q2, r6 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r11 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r9 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r9 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q2, r11 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q6, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q7, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r10 -vshlc Q5, r12, #32 -vmla.s16 Q4, Q2, r14 -vneg.s16 Q2, Q6 -vmla.s16 Q5, Q7, r5 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r7 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r5 -vsub.u16 Q5, Q5, Q3 -vmla.s16 Q4, Q7, r7 -asrl r6, r5, #16 -asrl r8, r7, #16 -asrl r14, r11, #16 -asrl r10, r9, #16 -vshlc Q5, r12, #16 -vmla.s16 Q5, Q7, r5 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r7 -veor.u16 Q3, Q3, Q3 -vmla.s16 Q4, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q4, Q7, r7 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r6 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r8 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r6 -vmla.s16 Q4, Q7, r8 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r11 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r9 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r11 -vmla.s16 Q4, Q7, r9 -vshlc Q5, r12, #32 -vmla.s16 Q5, Q7, r14 -vshlc Q4, r12, #32 -vmla.s16 Q5, Q2, r10 -vshlc Q3, r12, #32 -vmla.s16 Q4, Q6, r14 -vsub.u16 Q5, Q5, Q3 -vstrh.u16 Q5, [r0], #16 -vmla.s16 Q4, Q7, r10 -vstrh.u16 Q4, [r0], #16 -ldrd r14, r11, [r1], #32 -ldrd r10, r9, [r1, #-16] -ldrd r8, r7, [r1, #-8] -vldrh.u16 Q2, [r0, #0] -vldrh.u16 Q3, [r0, #16] -vldrh.u16 Q0, [r2, #0] -vmla.s16 Q2, Q0, r8 -vldrh.u16 Q1, [r2, #16] -vmla.s16 Q3, Q1, r8 -ldrd r6, r5, [r1, #-24] -mov r12, #0 -vmla.s16 Q2, Q1, r6 -vneg.s16 Q5, Q0 -vmla.s16 Q3, Q5, r6 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r9 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q5, r11 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q1, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q0, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #32 -vmla.s16 Q3, Q5, r14 -vneg.s16 Q5, Q1 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r7 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q3, Q0, r7 -asrl r6, r5, #16 -asrl r8, r7, #16 -asrl r14, r11, #16 -asrl r10, r9, #16 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q2, Q5, r7 -veor.u16 Q4, Q4, Q4 -vmla.s16 Q3, Q1, r5 -vshlc Q4, r12, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r8 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r6 -vmla.s16 Q3, Q0, r8 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r11 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r9 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r11 -vmla.s16 Q3, Q0, r9 -vshlc Q2, r12, #32 -vmla.s16 Q2, Q0, r14 -vshlc Q3, r12, #32 -vmla.s16 Q2, Q5, r10 -vshlc Q4, r12, #32 -vmla.s16 Q3, Q1, r14 -vsub.u16 Q2, Q2, Q4 -vstrh.u16 Q2, [r0,#(0)] -vmla.s16 Q3, Q0, r10 -vstrh.u16 Q3, [r0,#(16)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr diff --git a/tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual.s b/tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual.s deleted file mode 100644 index 33e0422..0000000 --- a/tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual.s +++ /dev/null @@ -1,276 +0,0 @@ -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual -poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual: -push {r4-r11,lr} -vld20.u16 {Q4, Q5}, [r2] -sub r12, sp, #(196) -//sub sp, sp, #32 -vld21.u16 {Q4, Q5}, [r2]! -mov r14, #0 -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r11, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r11 -vstrh.u16 q5, [sp, #(+16-32)] -vmov.u16 Q5, #0 -ldrd r8, r9, [r1, #56] -vstrh.u16 q7, [sp, #(+0-32)] - - - -vneg.s16 Q7, Q6 -vmul.u16 Q3, Q4, r9 -vld20.u16 {Q0, Q1}, [r1] -vmla.s16 Q2, Q7, r9 -ldrd r6, r7, [r1, #16] -vmla.s16 Q3, Q6, r11 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r14, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r8 -ldrd r4, r11, [r1, #(48-32)] - - - -vmla.s16 Q3, Q6, r10 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r11 -vshlc Q5, r14, #16 -vst20.u16 {Q1, Q2}, [r12] -vmla.s16 Q2, Q7, r11 -ldrd r10, r9, [r1, #(8-32)] -vadd.u16 Q0, Q0, Q1 -//ldrd r10, r9, [r1, #8] -vmla.s16 Q3, Q6, r7 -vst21.u16 {Q1, Q2}, [r12]! -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r4 -vshlc Q5, r14, #16 -vst20.u16 {Q0, Q1}, [r12] -vmla.s16 Q2, Q7, r4 -ldrd r8, r11, [r1, #(40-32)] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r14, #16 -vst21.u16 {Q0, Q1}, [r12]! -vmla.s16 Q2, Q4, r9 -vshlc Q3, r14, #16 -vld20.u16 {Q0, Q1}, [r1] -vmla.s16 Q3, Q4, r11 -vshlc Q5, r14, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r11 -ldrd r6, r7, [r1, #(0-32-32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r9 -vst20.u16 {Q1, Q2}, [r12] -vshlc Q2, r14, #16 -vst21.u16 {Q1, Q2}, [r12]! -vmla.s16 Q2, Q4, r10 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r14, #16 -vst21.u16 {Q0, Q1}, [r12] -vmla.s16 Q2, Q7, r8 -ldrd r4, r11, [r1, #(32-32-32)] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r11 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r11 -vst20.u16 {Q0, Q1}, [r12]! -vmla.s16 Q3, Q6, r7 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r4 -vshlc Q5, r14, #16 -vldrh.u16 q1, [sp,#(-32)] -vmla.s16 Q3, Q6, r6 -vstrh.u16 Q3, [r12,#(144-32-32-32-32)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r4 -//add sp, sp, #32 - -vstrh.u16 Q2, [r12,#(128-32-32-32-32)] -vmov.u16 Q5, #0 -ldrd r10, r11, [r12, #-104] -vldrh.u16 q0, [sp,#(-16)] -vmul.u16 Q2, Q0, r10 -ldrd r8, r9, [r12, #-40] -vneg.s16 Q7, Q1 -vmul.u16 Q3, Q0, r8 -ldrd r6, r7, [r12, #-112] -vmla.s16 Q2, Q7, r8 - -ldrd r4, r5, [r12, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r5 -ldrd r10, r3, [r12, #-120] -vmla.s16 Q3, Q1, r7 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r4 -ldrd r8, r7, [r12, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r7 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r12, #-64] -vmla.s16 Q3, Q1, r3 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r8 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r8 -vadd.u16 Q4, Q4, Q0 -ldrd r4, r7, [r12, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q1, r7 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q0, r4 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r14, #16 -vmla.s16 Q2, Q7, r6 -vadd.u16 Q6, Q6, Q1 -vmla.s16 Q3, Q1, r4 -vshlc Q2, r14, #16 -neg r9, r9 -vmla.s16 Q2, Q0, r9 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q0, r11 -vshlc Q5, r14, #16 -vmla.s16 Q3, Q1, r9 -//vstrh.u16 Q3, [r12,#(48)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r11 -vneg.s16 Q3, Q3 -vldrh.u16 Q5, [r12,#0] -vadd.u16 Q5, Q5, Q2 -//vstrh.u16 Q5, [r0, #0] - -//vstrh.u16 Q2, [r12,#(32)] -//vadd.u16 Q5, Q5, Q2 -//vstrh.u16 Q5, [r0, #0] - - - -ldrd r10, r11, [r12, #-72] -vldrh.u16 Q7, [r12,#16] -vsub.u16 Q7, Q7, Q3 -ldrd r8, r9, [r12, #-8] -vmla.s16 Q3, Q4, r11 -vmov.u16 Q0, #0 -vmla.s16 Q2, Q4, r9 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r9 -ldrd r6, r7, [r12, #-80] -vmla.s16 Q2, Q6, r11 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r14, #16 -vstrh.u16 Q7, [r0, #16] -vmla.s16 Q3, Q1, r8 -ldrd r4, r11, [r12, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r11 -vshlc Q0, r14, #16 -vstrh.u16 Q5, [r0, #0] -vmla.s16 Q3, Q1, r11 -ldrd r10, r9, [r12, #-88] -vmla.s16 Q2, Q6, r7 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r4 -vshlc Q0, r14, #16 -vmla.s16 Q3, Q1, r4 -ldrd r8, r11, [r12, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r11 -vshlc Q0, r14, #16 -vmla.s16 Q3, Q1, r11 -ldrd r6, r7, [r12, #-96] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r14, #16 -vmla.s16 Q3, Q1, r8 -ldrd r4, r11, [r12, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r11 -vshlc Q0, r14, #16 -vmla.s16 Q3, Q1, r11 -vmla.s16 Q2, Q6, r7 -vshlc Q3, r14, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r14, #16 -vmla.s16 Q2, Q4, r4 -vshlc Q0, r14, #16 -vmla.s16 Q2, Q6, r6 -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r12,#0] - -//vstrh.u16 Q2, [r12,#(-80)] - -vmla.s16 Q3, Q1, r4 -//vstrh.u16 Q3, [r12,#(-96)] - -vldrh.u16 Q1, [r12,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] -nop -nop -nop -nop -nop -nop -pop {r4-r11,lr} -bx lr diff --git a/tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop.s b/tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop.s deleted file mode 100644 index 7cf7568..0000000 --- a/tests/schoolbook/poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop.s +++ /dev/null @@ -1,747 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop -poly_u16_mul_32_anticyclic_karatsuba_mve_simd_manual_loop: -push {r4-r11,lr} -vpush {d0-d15} -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(-16)] -vstrh.u16 Q7, [sp, #(-32)] -mov r12, #0 -sub r11, sp, #224 -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(-16)] -vldrh.u16 Q1, [sp, #(-32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q4, Q4, Q0 -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -mov r12, #19 -wls r14, r12, loop_end -loop_start: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(-16)] -vstrh.u16 Q7, [sp, #(-32)] -mov r12, #0 -sub r11, sp, #224 -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(-16)] -vldrh.u16 Q1, [sp, #(-32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q4, Q4, Q0 -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -add r0, r0, #64 -le r14, loop_start -loop_end: -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vstrh.u16 Q5, [sp, #(-16)] -vstrh.u16 Q7, [sp, #(-32)] -mov r12, #0 -sub r11, sp, #224 -vmov.u16 Q5, #0 -ldrd r10, r9, [r1, #24] -vmul.u16 Q2, Q4, r9 -ldrd r8, r7, [r1, #56] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r10, r8, [r1, #8] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r9 -ldrd r7, r5, [r1, #40] -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r9, r6, [r1, #0] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #32] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vmla.s16 Q3, Q6, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(128)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vadd.u16 Q0, Q0, Q1 -vst20.u16 {Q1, Q2}, [r11] -vst21.u16 {Q1, Q2}, [r11]! -vst20.u16 {Q0, Q1}, [r11] -vst21.u16 {Q0, Q1}, [r11]! -vldrh.u16 Q0, [sp, #(-16)] -vldrh.u16 Q1, [sp, #(-32)] -vmov.u16 Q5, #0 -ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vstrh.u16 Q5, [r0, #0] -vstrh.u16 Q7, [r0, #16] -vadd.u16 Q4, Q4, Q0 -vadd.u16 Q6, Q6, Q1 -vneg.s16 Q3, Q3 -vmov.u16 Q0, #0 -mov r12, #0 -ldrd r10, r9, [r11, #-72] -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q0, [r11,#0] -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vstrh.u16 Q1, [r0, #48] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr diff --git a/tests/schoolbook/schoolbook.mk b/tests/schoolbook/schoolbook.mk new file mode 100644 index 0000000..153953e --- /dev/null +++ b/tests/schoolbook/schoolbook.mk @@ -0,0 +1,19 @@ +# Test name - needs to match the directory name +TESTS += schoolbook + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +SCHOOLBOOK_PLATFORMS += m55-an547 +SCHOOLBOOK_PLATFORMS += m85-an555 + +# C sources required for this test +SCHOOLBOOK_SOURCES += main.c +SCHOOLBOOK_SOURCES += misc.c +SCHOOLBOOK_SOURCES += poly.c + +# Assembly sources required for this test +SCHOOLBOOK_ASMS += poly_u16_32_acc.s +SCHOOLBOOK_ASMS += auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s +SCHOOLBOOK_ASMS += auto/ +SCHOOLBOOK_ASMS += auto/poly_u16_mul_32_anticyclic_mve_simd.s \ No newline at end of file diff --git a/tests/sqmag/main.c b/tests/sqmag/main.c index b35cf52..270113f 100644 --- a/tests/sqmag/main.c +++ b/tests/sqmag/main.c @@ -106,7 +106,8 @@ int main(void) hal_pmu_enable(); debug_printf( "Squared magnitude test!\n" ); bench_sqmag(); - debug_printf( "Done!\n:" ); + debug_printf( "Done!\n" ); hal_pmu_disable(); + debug_printf( "ALL GOOD!\n" ); return( 0 ); } diff --git a/tests/sqmag/misc.c b/tests/sqmag/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/sqmag/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/sqmag/misc.h b/tests/sqmag/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/sqmag/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/sqmag/sqmag.mk b/tests/sqmag/sqmag.mk new file mode 100644 index 0000000..03423a8 --- /dev/null +++ b/tests/sqmag/sqmag.mk @@ -0,0 +1,21 @@ +# Test name - needs to match the directory name +TESTS += sqmag + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +SQMAG_PLATFORMS += m55-an547 +SQMAG_PLATFORMS += m85-an555 + +# C sources required for this test +SQMAG_SOURCES += main.c +SQMAG_SOURCES += misc.c + +# Assembly sources required for this test +SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll1.s +SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll2.s +SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll4.s +SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M85_unroll1.s +SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M85_unroll2.s +SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M85_unroll4.s +SQMAG_ASMS += cmplx_mag_sqr_fx.s \ No newline at end of file diff --git a/tests/toom/main.c b/tests/toom/main.c index fa32c66..7686ff6 100644 --- a/tests/toom/main.c +++ b/tests/toom/main.c @@ -54,27 +54,27 @@ /* #define TEST_TOOM3_FWD_MUL_INV */ /* #define TEST_TOOM3_STANDALONE */ -//#define TEST_TOOM4_FWD -//#define TEST_TOOM4_FWD_OOP -//#define TEST_TOOM4_FWD_KARATSUBA_X1_OOP -//#define TEST_TOOM4_FWD_KARATSUBA_X2_OOP -//#define TEST_TOOM4_FWD_DUAL_TOP -//#define TEST_TOOM4_FWD_DUAL_TOP_OOP -//#define TEST_TOOM4_FWD_DUAL_PACKED_LIMBS_OOP -//#define TEST_TOOM4_FWD_DUAL_PACKED_LIMBS_KARATSUBA_X1_OOP -//#define TEST_TOOM4_FWD_DUAL_PACKED_LIMBS_KARATSUBA_X2_OOP -//#define TEST_TOOM4_FWD_DUAL_BOTTOM -//#define TEST_TOOM4_FWD_INV -//#define TEST_TOOM4_FWD_INV_DUAL_TOP -//#define TEST_TOOM4_FWD_INV_DUAL_TOP_OOP -//#define TEST_TOOM4_FWD_INV_DUAL_BOTTOM -//#define TEST_TOOM4_FWD_INV_DUAL_PACKED_LIMBS_OOP -//#define TEST_TOOM4_FWD_INV_DUAL_BOTTOM_OOP -//#define TEST_TOOM4_FWD_MUL_INV -//#define TEST_TOOM4_FWD_MUL_INV_DUAL_BOTTOM -//#define TEST_TOOM4_FWD_MUL_INV_DUAL_TOP +#define TEST_TOOM4_FWD +#define TEST_TOOM4_FWD_OOP +#define TEST_TOOM4_FWD_KARATSUBA_X1_OOP +#define TEST_TOOM4_FWD_KARATSUBA_X2_OOP +#define TEST_TOOM4_FWD_DUAL_TOP +#define TEST_TOOM4_FWD_DUAL_TOP_OOP +#define TEST_TOOM4_FWD_DUAL_PACKED_LIMBS_OOP +#define TEST_TOOM4_FWD_DUAL_PACKED_LIMBS_KARATSUBA_X1_OOP +#define TEST_TOOM4_FWD_DUAL_PACKED_LIMBS_KARATSUBA_X2_OOP +#define TEST_TOOM4_FWD_DUAL_BOTTOM +#define TEST_TOOM4_FWD_INV +#define TEST_TOOM4_FWD_INV_DUAL_TOP +#define TEST_TOOM4_FWD_INV_DUAL_TOP_OOP +#define TEST_TOOM4_FWD_INV_DUAL_BOTTOM +#define TEST_TOOM4_FWD_INV_DUAL_PACKED_LIMBS_OOP +#define TEST_TOOM4_FWD_INV_DUAL_BOTTOM_OOP +#define TEST_TOOM4_FWD_MUL_INV +#define TEST_TOOM4_FWD_MUL_INV_DUAL_BOTTOM +#define TEST_TOOM4_FWD_MUL_INV_DUAL_TOP #define TEST_TOOM4_FWD_MUL_INV_DUAL_PACKED_LIMBS_OOP -//#define TEST_TOOM4_STANDALONE +#define TEST_TOOM4_STANDALONE //#define TEST_ALL @@ -2477,6 +2477,8 @@ int unfold(test_toom_all) (void) return( 1 ); #endif + debug_printf( "ALL GOOD!\n" ); + return( 0 ); } diff --git a/tests/toom/misc.c b/tests/toom/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/toom/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/toom/misc.h b/tests/toom/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/toom/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/toom/poly.c b/tests/toom/poly.c new file mode 120000 index 0000000..57b8f97 --- /dev/null +++ b/tests/toom/poly.c @@ -0,0 +1 @@ +../common/poly.c \ No newline at end of file diff --git a/tests/toom/poly.h b/tests/toom/poly.h new file mode 120000 index 0000000..3a14842 --- /dev/null +++ b/tests/toom/poly.h @@ -0,0 +1 @@ +../common/poly.h \ No newline at end of file diff --git a/tests/toom/toom.mk b/tests/toom/toom.mk new file mode 100644 index 0000000..244cfc3 --- /dev/null +++ b/tests/toom/toom.mk @@ -0,0 +1,58 @@ +# Test name - needs to match the directory name +TESTS += toom + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +TOOM_PLATFORMS += m55-an547 +TOOM_PLATFORMS += m85-an555 + +# C sources required for this test +TOOM_SOURCES += main.c +TOOM_SOURCES += misc.c +TOOM_SOURCES += poly.c + + +# Assembly sources required for this test +# TODO: not all these are required; delete the other ones? +# TODO: should move those to the asm dir +TOOM_ASMS += auto/poly_u16_mul_64_toom4_mve.s +TOOM_ASMS += auto/poly_u16_mul_192_toom3_mve.s +TOOM_ASMS += auto/poly_u16_mul_256_toom4_mve.s +TOOM_ASMS += auto/poly_u16_mul_512_toom4_mve.s +TOOM_ASMS += auto/poly_u16_mul_768_toom3_mve.s +TOOM_ASMS += auto/poly_u16_mul_768_toom4_mve.s +TOOM_ASMS += auto/poly_u16_mul_832_toom4_mve.s +TOOM_ASMS += auto/poly_u16_toom3_fwd_192.s +TOOM_ASMS += auto/poly_u16_toom3_fwd_768.s +TOOM_ASMS += auto/poly_u16_toom3_inv_full_192.s +TOOM_ASMS += auto/poly_u16_toom3_inv_full_768.s +TOOM_ASMS += auto/poly_u16_toom3_inv_half_192.s +TOOM_ASMS += auto/poly_u16_toom3_inv_half_768.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_bottom.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_oop.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_top_oop.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_top.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_256.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_512.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_768.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_832.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s +TOOM_ASMS += auto/poly_u16_toom4_fwd_oop_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_dual_bottom_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_dual_bottom_oop_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_dual_top_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_dual_top_oop_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_full_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_full_512.s +TOOM_ASMS += auto/poly_u16_toom4_inv_full_768.s +TOOM_ASMS += auto/poly_u16_toom4_inv_full_832.s +TOOM_ASMS += auto/poly_u16_toom4_inv_half_256.s +TOOM_ASMS += auto/poly_u16_toom4_inv_half_512.s +TOOM_ASMS += auto/poly_u16_toom4_inv_half_768.s +TOOM_ASMS += auto/poly_u16_toom4_inv_half_832.s \ No newline at end of file diff --git a/tests/transpose/main.c b/tests/transpose/main.c index 2e647f1..0a60d87 100644 --- a/tests/transpose/main.c +++ b/tests/transpose/main.c @@ -104,6 +104,6 @@ int main (void) ret = test_transpose(); if( ret != 0 ) return( 1 ); - + debug_printf( "ALL GOOD!\n" ); return( 0 ); } diff --git a/tests/transpose/misc.c b/tests/transpose/misc.c new file mode 120000 index 0000000..9326b99 --- /dev/null +++ b/tests/transpose/misc.c @@ -0,0 +1 @@ +../common/misc.c \ No newline at end of file diff --git a/tests/transpose/misc.h b/tests/transpose/misc.h new file mode 120000 index 0000000..81b08e0 --- /dev/null +++ b/tests/transpose/misc.h @@ -0,0 +1 @@ +../common/misc.h \ No newline at end of file diff --git a/tests/transpose/transpose.mk b/tests/transpose/transpose.mk new file mode 100644 index 0000000..4861c6a --- /dev/null +++ b/tests/transpose/transpose.mk @@ -0,0 +1,15 @@ +# Test name - needs to match the directory name +TESTS += transpose + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +TRANSPOSE_PLATFORMS += m55-an547 +TRANSPOSE_PLATFORMS += m85-an555 + +# C sources required for this test +TRANSPOSE_SOURCES += main.c +TRANSPOSE_SOURCES += misc.c + +# Assembly sources required for this test +TRANSPOSE_ASMS += ../../asm/auto/gather/transpose_u16x8_4x4.s diff --git a/tools/README.md b/tools/README.md deleted file mode 100644 index 20b7868..0000000 --- a/tools/README.md +++ /dev/null @@ -1,3 +0,0 @@ -# Tools - -Convenience location to put various external tools required for the test environments. From 7d4ee1016a6f74c0360556527b3fd8cb5d3b5ac4 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 13:49:02 +0800 Subject: [PATCH 21/32] disable build that overflows flash --- tests/ntt-768/ntt-768.mk | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tests/ntt-768/ntt-768.mk b/tests/ntt-768/ntt-768.mk index d38fd5b..50d0de9 100644 --- a/tests/ntt-768/ntt-768.mk +++ b/tests/ntt-768/ntt-768.mk @@ -5,7 +5,9 @@ TESTS += ntt-768 # Platforms this test should run on (matching the directory name in envs/) NTT_768_PLATFORMS += m55-an547 -NTT_768_PLATFORMS += m85-an555 + +# TODO: Currently overflows flash, but can probably tweak the linker script to make it work; need to be tested on the board +#NTT_768_PLATFORMS += m85-an555 # C sources required for this test NTT_768_SOURCES += main.c From fc76c5cdb076af2da0e82b3c9faef7ff6a9634b4 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 14:34:30 +0800 Subject: [PATCH 22/32] move {poly,misc}.{c,h} back to common --- envs/m55-an547/Makefile | 6 +++++- envs/m85-an555/Makefile | 6 +++++- tests/chunk/chunk.mk | 1 - tests/chunk/misc.c | 1 - tests/chunk/misc.h | 1 - tests/crt/crt.mk | 1 - tests/crt/misc.c | 1 - tests/crt/misc.h | 1 - tests/flt-fft/flt-fft.mk | 1 - tests/flt-fft/misc.c | 1 - tests/flt-fft/misc.h | 1 - tests/fx-fft/fx-fft.mk | 2 -- tests/fx-fft/misc.c | 1 - tests/fx-fft/misc.h | 1 - tests/helloworld/helloworld.mk | 1 - tests/helloworld/misc.c | 1 - tests/helloworld/misc.h | 1 - tests/intmulntt/intmulntt.mk | 2 -- tests/intmulntt/misc.c | 1 - tests/intmulntt/misc.h | 1 - tests/intmulntt/poly.c | 1 - tests/intmulntt/poly.h | 1 - tests/karatsuba/karatsuba.mk | 1 - tests/karatsuba/misc.c | 1 - tests/karatsuba/misc.h | 1 - tests/karatsuba/poly.h | 1 - tests/montgomery/misc.c | 1 - tests/montgomery/misc.h | 1 - tests/montgomery/montgomery.mk | 2 -- tests/montgomery/poly.c | 1 - tests/montgomery/poly.h | 1 - tests/ntt-1024/misc.c | 1 - tests/ntt-1024/misc.h | 1 - tests/ntt-1024/ntt-1024.mk | 2 -- tests/ntt-1024/poly.c | 1 - tests/ntt-1024/poly.h | 1 - tests/ntt-192/misc.c | 1 - tests/ntt-192/misc.h | 1 - tests/ntt-192/ntt-192.mk | 2 -- tests/ntt-192/poly.c | 1 - tests/ntt-192/poly.h | 1 - tests/ntt-256/misc.c | 1 - tests/ntt-256/misc.h | 1 - tests/ntt-256/ntt-256.mk | 2 -- tests/ntt-256/poly.c | 1 - tests/ntt-256/poly.h | 1 - tests/ntt-384/misc.c | 1 - tests/ntt-384/misc.h | 1 - tests/ntt-384/ntt-384.mk | 2 -- tests/ntt-384/poly.c | 1 - tests/ntt-384/poly.h | 1 - tests/ntt-512/misc.c | 1 - tests/ntt-512/misc.h | 1 - tests/ntt-512/ntt-512.mk | 2 -- tests/ntt-512/poly.c | 1 - tests/ntt-512/poly.h | 1 - tests/ntt-768/misc.c | 1 - tests/ntt-768/misc.h | 1 - tests/ntt-768/ntt-768.mk | 2 -- tests/ntt-768/poly.c | 1 - tests/ntt-768/poly.h | 1 - tests/ntt-dilithium/misc.c | 1 - tests/ntt-dilithium/misc.h | 1 - tests/ntt-dilithium/ntt-dilithium.mk | 2 -- tests/ntt-dilithium/poly.c | 1 - tests/ntt-dilithium/poly.h | 1 - tests/ntt-kyber/misc.c | 1 - tests/ntt-kyber/misc.h | 1 - tests/ntt-kyber/ntt-kyber.mk | 2 -- tests/ntt-kyber/poly.c | 1 - tests/ntt-kyber/poly.h | 1 - tests/ntt-n256/misc.c | 1 - tests/ntt-n256/misc.h | 1 - tests/ntt-n256/ntt-n256.mk | 2 -- tests/ntt-n256/poly.c | 1 - tests/ntt-n256/poly.h | 1 - tests/permute/misc.c | 1 - tests/permute/misc.h | 1 - tests/permute/permute.mk | 1 - tests/poly/misc.c | 1 - tests/poly/misc.h | 1 - tests/poly/poly.c | 1 - tests/poly/poly.h | 1 - tests/poly/poly.mk | 2 -- tests/saber/misc.c | 1 - tests/saber/misc.h | 1 - tests/saber/poly.c | 1 - tests/saber/poly.h | 1 - tests/saber/saber.mk | 1 - tests/schoolbook/misc.c | 1 - tests/schoolbook/misc.h | 1 - tests/schoolbook/poly.c | 1 - tests/schoolbook/poly.h | 1 - tests/schoolbook/schoolbook.mk | 2 -- tests/sqmag/misc.c | 1 - tests/sqmag/misc.h | 1 - tests/sqmag/sqmag.mk | 1 - tests/toom/misc.c | 1 - tests/toom/misc.h | 1 - tests/toom/poly.c | 1 - tests/toom/poly.h | 1 - tests/toom/toom.mk | 3 --- tests/transpose/misc.c | 1 - tests/transpose/misc.h | 1 - tests/transpose/transpose.mk | 1 - 105 files changed, 10 insertions(+), 121 deletions(-) delete mode 120000 tests/chunk/misc.c delete mode 120000 tests/chunk/misc.h delete mode 120000 tests/crt/misc.c delete mode 120000 tests/crt/misc.h delete mode 120000 tests/flt-fft/misc.c delete mode 120000 tests/flt-fft/misc.h delete mode 120000 tests/fx-fft/misc.c delete mode 120000 tests/fx-fft/misc.h delete mode 120000 tests/helloworld/misc.c delete mode 120000 tests/helloworld/misc.h delete mode 120000 tests/intmulntt/misc.c delete mode 120000 tests/intmulntt/misc.h delete mode 120000 tests/intmulntt/poly.c delete mode 120000 tests/intmulntt/poly.h delete mode 120000 tests/karatsuba/misc.c delete mode 120000 tests/karatsuba/misc.h delete mode 120000 tests/karatsuba/poly.h delete mode 120000 tests/montgomery/misc.c delete mode 120000 tests/montgomery/misc.h delete mode 120000 tests/montgomery/poly.c delete mode 120000 tests/montgomery/poly.h delete mode 120000 tests/ntt-1024/misc.c delete mode 120000 tests/ntt-1024/misc.h delete mode 120000 tests/ntt-1024/poly.c delete mode 120000 tests/ntt-1024/poly.h delete mode 120000 tests/ntt-192/misc.c delete mode 120000 tests/ntt-192/misc.h delete mode 120000 tests/ntt-192/poly.c delete mode 120000 tests/ntt-192/poly.h delete mode 120000 tests/ntt-256/misc.c delete mode 120000 tests/ntt-256/misc.h delete mode 120000 tests/ntt-256/poly.c delete mode 120000 tests/ntt-256/poly.h delete mode 120000 tests/ntt-384/misc.c delete mode 120000 tests/ntt-384/misc.h delete mode 120000 tests/ntt-384/poly.c delete mode 120000 tests/ntt-384/poly.h delete mode 120000 tests/ntt-512/misc.c delete mode 120000 tests/ntt-512/misc.h delete mode 120000 tests/ntt-512/poly.c delete mode 120000 tests/ntt-512/poly.h delete mode 120000 tests/ntt-768/misc.c delete mode 120000 tests/ntt-768/misc.h delete mode 120000 tests/ntt-768/poly.c delete mode 120000 tests/ntt-768/poly.h delete mode 120000 tests/ntt-dilithium/misc.c delete mode 120000 tests/ntt-dilithium/misc.h delete mode 120000 tests/ntt-dilithium/poly.c delete mode 120000 tests/ntt-dilithium/poly.h delete mode 120000 tests/ntt-kyber/misc.c delete mode 120000 tests/ntt-kyber/misc.h delete mode 120000 tests/ntt-kyber/poly.c delete mode 120000 tests/ntt-kyber/poly.h delete mode 120000 tests/ntt-n256/misc.c delete mode 120000 tests/ntt-n256/misc.h delete mode 120000 tests/ntt-n256/poly.c delete mode 120000 tests/ntt-n256/poly.h delete mode 120000 tests/permute/misc.c delete mode 120000 tests/permute/misc.h delete mode 120000 tests/poly/misc.c delete mode 120000 tests/poly/misc.h delete mode 120000 tests/poly/poly.c delete mode 120000 tests/poly/poly.h delete mode 120000 tests/saber/misc.c delete mode 120000 tests/saber/misc.h delete mode 120000 tests/saber/poly.c delete mode 120000 tests/saber/poly.h delete mode 120000 tests/schoolbook/misc.c delete mode 120000 tests/schoolbook/misc.h delete mode 120000 tests/schoolbook/poly.c delete mode 120000 tests/schoolbook/poly.h delete mode 120000 tests/sqmag/misc.c delete mode 120000 tests/sqmag/misc.h delete mode 120000 tests/toom/misc.c delete mode 120000 tests/toom/misc.h delete mode 120000 tests/toom/poly.c delete mode 120000 tests/toom/poly.h delete mode 120000 tests/transpose/misc.c delete mode 120000 tests/transpose/misc.h diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index e27efa1..38fb522 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -8,6 +8,7 @@ BUILD_DIR=./build COMMON_INC=../common/inc/ ENV_INC=./inc/ +TEST_COMMON=../../tests/common/ SYSROOT := $(shell $(CC) --print-sysroot) CFLAGS += \ @@ -22,6 +23,7 @@ CFLAGS += \ -I$(ENV_INC) \ -I$(SRC_DIR) \ -I$(TESTDIR) \ + -I$(TEST_COMMON) \ -I$(SRC_DIR)/platform ARCH_FLAGS += \ @@ -52,8 +54,10 @@ all: $(TARGET) HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) +TEST_COMMON_SOURCES = $(wildcard $(TEST_COMMON)/*.c) +OBJECTS_TEST_COMMON = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(TEST_COMMON_SOURCES))) OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) -OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) +OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) $(OBJECTS_TEST_COMMON) OBJECTS_ASM = $(patsubst %.s, $(BUILD_DIR)/%.s.o, $(abspath $(ASMS))) OBJECTS = $(OBJECTS_C) $(OBJECTS_ASM) diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 13e0851..838eed4 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -8,6 +8,7 @@ BUILD_DIR=./build COMMON_INC=../common/inc/ ENV_INC=./inc/ +TEST_COMMON=../../tests/common/ SYSROOT := $(shell $(CC) --print-sysroot) CFLAGS += \ @@ -22,6 +23,7 @@ CFLAGS += \ -I$(ENV_INC) \ -I$(SRC_DIR) \ -I$(TESTDIR) \ + -I$(TEST_COMMON) \ -I$(SRC_DIR)/platform ARCH_FLAGS += \ @@ -52,8 +54,10 @@ all: $(TARGET) HAL_SOURCES = $(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard ../common/src/*.c) OBJECTS_HAL = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(HAL_SOURCES))) +TEST_COMMON_SOURCES = $(wildcard $(TEST_COMMON)/*.c) +OBJECTS_TEST_COMMON = $(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(TEST_COMMON_SOURCES))) OBJECTS_SOURCES=$(patsubst %.c, $(BUILD_DIR)/%.c.o, $(abspath $(SOURCES))) -OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) +OBJECTS_C = $(OBJECTS_SOURCES) $(OBJECTS_HAL) $(OBJECTS_TEST_COMMON) OBJECTS_ASM = $(patsubst %.s, $(BUILD_DIR)/%.s.o, $(abspath $(ASMS))) OBJECTS = $(OBJECTS_C) $(OBJECTS_ASM) diff --git a/tests/chunk/chunk.mk b/tests/chunk/chunk.mk index e3bca21..7d3c441 100644 --- a/tests/chunk/chunk.mk +++ b/tests/chunk/chunk.mk @@ -9,7 +9,6 @@ CHUNK_PLATFORMS += m85-an555 # C sources required for this test CHUNK_SOURCES += main.c -CHUNK_SOURCES += misc.c # Assembly sources required for this test CHUNK_ASMS += chunk.s diff --git a/tests/chunk/misc.c b/tests/chunk/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/chunk/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/chunk/misc.h b/tests/chunk/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/chunk/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/crt/crt.mk b/tests/crt/crt.mk index b99df4b..91d35e1 100644 --- a/tests/crt/crt.mk +++ b/tests/crt/crt.mk @@ -9,7 +9,6 @@ CRT_PLATFORMS += m85-an555 # C sources required for this test CRT_SOURCES += main.c -CRT_SOURCES += misc.c # Assembly sources required for this test CRT_ASMS += crt.s diff --git a/tests/crt/misc.c b/tests/crt/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/crt/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/crt/misc.h b/tests/crt/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/crt/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/flt-fft/flt-fft.mk b/tests/flt-fft/flt-fft.mk index e4a80dc..838868d 100644 --- a/tests/flt-fft/flt-fft.mk +++ b/tests/flt-fft/flt-fft.mk @@ -9,7 +9,6 @@ FLT_FFT_PLATFORMS += m85-an555 # C sources required for this test FLT_FFT_SOURCES += main.c -FLT_FFT_SOURCES += misc.c # Assembly sources required for this test FLT_FFT_ASMS += base_ref.s diff --git a/tests/flt-fft/misc.c b/tests/flt-fft/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/flt-fft/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/flt-fft/misc.h b/tests/flt-fft/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/flt-fft/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/fx-fft/fx-fft.mk b/tests/fx-fft/fx-fft.mk index 793a1b2..11eb300 100644 --- a/tests/fx-fft/fx-fft.mk +++ b/tests/fx-fft/fx-fft.mk @@ -9,8 +9,6 @@ FX_FFT_PLATFORMS += m85-an555 # C sources required for this test FX_FFT_SOURCES += main.c -FX_FFT_SOURCES += misc.c - # Assembly sources required for this test FX_FFT_ASMS += base_concrete.s diff --git a/tests/fx-fft/misc.c b/tests/fx-fft/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/fx-fft/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/fx-fft/misc.h b/tests/fx-fft/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/fx-fft/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/helloworld/helloworld.mk b/tests/helloworld/helloworld.mk index f41021c..1454763 100644 --- a/tests/helloworld/helloworld.mk +++ b/tests/helloworld/helloworld.mk @@ -9,7 +9,6 @@ HELLOWORLD_PLATFORMS += m85-an555 # C sources required for this test HELLOWORLD_SOURCES += main.c -HELLOWORLD_SOURCES += misc.c # Assembly sources required for this test HELLOWORLD_ASMS += mve_test.s diff --git a/tests/helloworld/misc.c b/tests/helloworld/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/helloworld/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/helloworld/misc.h b/tests/helloworld/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/helloworld/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/intmulntt/intmulntt.mk b/tests/intmulntt/intmulntt.mk index c978e10..a7250a0 100644 --- a/tests/intmulntt/intmulntt.mk +++ b/tests/intmulntt/intmulntt.mk @@ -9,8 +9,6 @@ INTMULNTT_PLATFORMS += m85-an555 # C sources required for this test INTMULNTT_SOURCES += main.c -INTMULNTT_SOURCES += misc.c -INTMULNTT_SOURCES += poly.c # Assembly sources required for this test INTMULNTT_ASMS += crt.s diff --git a/tests/intmulntt/misc.c b/tests/intmulntt/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/intmulntt/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/intmulntt/misc.h b/tests/intmulntt/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/intmulntt/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/intmulntt/poly.c b/tests/intmulntt/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/intmulntt/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/intmulntt/poly.h b/tests/intmulntt/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/intmulntt/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/karatsuba/karatsuba.mk b/tests/karatsuba/karatsuba.mk index 2abccb0..eca2226 100644 --- a/tests/karatsuba/karatsuba.mk +++ b/tests/karatsuba/karatsuba.mk @@ -9,7 +9,6 @@ KARATSUBA_PLATFORMS += m85-an555 # C sources required for this test KARATSUBA_SOURCES += main.c -KARATSUBA_SOURCES += misc.c # Assembly sources required for this test KARATSUBA_ASMS += karatsuba.s diff --git a/tests/karatsuba/misc.c b/tests/karatsuba/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/karatsuba/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/karatsuba/misc.h b/tests/karatsuba/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/karatsuba/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/karatsuba/poly.h b/tests/karatsuba/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/karatsuba/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/montgomery/misc.c b/tests/montgomery/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/montgomery/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/montgomery/misc.h b/tests/montgomery/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/montgomery/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/montgomery/montgomery.mk b/tests/montgomery/montgomery.mk index 39dd4b4..7c8c893 100644 --- a/tests/montgomery/montgomery.mk +++ b/tests/montgomery/montgomery.mk @@ -9,8 +9,6 @@ MONTGOMERY_PLATFORMS += m85-an555 # C sources required for this test MONTGOMERY_SOURCES += main.c -MONTGOMERY_SOURCES += misc.c -MONTGOMERY_SOURCES += poly.c # Assembly sources required for this test MONTGOMERY_ASMS += montgomery.s diff --git a/tests/montgomery/poly.c b/tests/montgomery/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/montgomery/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/montgomery/poly.h b/tests/montgomery/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/montgomery/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-1024/misc.c b/tests/ntt-1024/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-1024/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-1024/misc.h b/tests/ntt-1024/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-1024/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-1024/ntt-1024.mk b/tests/ntt-1024/ntt-1024.mk index 9b6ad8b..f83721c 100644 --- a/tests/ntt-1024/ntt-1024.mk +++ b/tests/ntt-1024/ntt-1024.mk @@ -9,8 +9,6 @@ NTT_1024_PLATFORMS += m85-an555 # C sources required for this test NTT_1024_SOURCES += main.c -NTT_1024_SOURCES += misc.c -NTT_1024_SOURCES += poly.c # Assembly sources required for this test NTT_1024_ASM_DIR = ../../asm/auto/ntt_1024 diff --git a/tests/ntt-1024/poly.c b/tests/ntt-1024/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-1024/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-1024/poly.h b/tests/ntt-1024/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-1024/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-192/misc.c b/tests/ntt-192/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-192/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-192/misc.h b/tests/ntt-192/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-192/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-192/ntt-192.mk b/tests/ntt-192/ntt-192.mk index 72561bb..ac4c5d2 100644 --- a/tests/ntt-192/ntt-192.mk +++ b/tests/ntt-192/ntt-192.mk @@ -9,8 +9,6 @@ NTT_192_PLATFORMS += m85-an555 # C sources required for this test NTT_192_SOURCES += main.c -NTT_192_SOURCES += misc.c -NTT_192_SOURCES += poly.c # Assembly sources required for this test NTT_192_ASM_DIR = ../../asm/auto/ntt_192 diff --git a/tests/ntt-192/poly.c b/tests/ntt-192/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-192/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-192/poly.h b/tests/ntt-192/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-192/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-256/misc.c b/tests/ntt-256/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-256/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-256/misc.h b/tests/ntt-256/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-256/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-256/ntt-256.mk b/tests/ntt-256/ntt-256.mk index 946d81f..a8374f3 100644 --- a/tests/ntt-256/ntt-256.mk +++ b/tests/ntt-256/ntt-256.mk @@ -9,8 +9,6 @@ NTT_256_PLATFORMS += m85-an555 # C sources required for this test NTT_256_SOURCES += main.c -NTT_256_SOURCES += misc.c -NTT_256_SOURCES += poly.c # Assembly sources required for this test NTT_256_ASM_DIR = ../../asm/auto/ntt_256 diff --git a/tests/ntt-256/poly.c b/tests/ntt-256/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-256/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-256/poly.h b/tests/ntt-256/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-256/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-384/misc.c b/tests/ntt-384/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-384/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-384/misc.h b/tests/ntt-384/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-384/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-384/ntt-384.mk b/tests/ntt-384/ntt-384.mk index 1b43ef1..44d9767 100644 --- a/tests/ntt-384/ntt-384.mk +++ b/tests/ntt-384/ntt-384.mk @@ -9,8 +9,6 @@ NTT_384_PLATFORMS += m85-an555 # C sources required for this test NTT_384_SOURCES += main.c -NTT_384_SOURCES += misc.c -NTT_384_SOURCES += poly.c # Assembly sources required for this test NTT_384_ASM_DIR = ../../asm/auto/ntt_384 diff --git a/tests/ntt-384/poly.c b/tests/ntt-384/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-384/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-384/poly.h b/tests/ntt-384/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-384/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-512/misc.c b/tests/ntt-512/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-512/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-512/misc.h b/tests/ntt-512/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-512/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-512/ntt-512.mk b/tests/ntt-512/ntt-512.mk index d240852..9762902 100644 --- a/tests/ntt-512/ntt-512.mk +++ b/tests/ntt-512/ntt-512.mk @@ -9,8 +9,6 @@ NTT_512_PLATFORMS += m85-an555 # C sources required for this test NTT_512_SOURCES += main.c -NTT_512_SOURCES += misc.c -NTT_512_SOURCES += poly.c # Assembly sources required for this test NTT_512_ASM_DIR = ../../asm/auto/ntt_512 diff --git a/tests/ntt-512/poly.c b/tests/ntt-512/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-512/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-512/poly.h b/tests/ntt-512/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-512/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-768/misc.c b/tests/ntt-768/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-768/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-768/misc.h b/tests/ntt-768/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-768/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-768/ntt-768.mk b/tests/ntt-768/ntt-768.mk index 50d0de9..5e934f4 100644 --- a/tests/ntt-768/ntt-768.mk +++ b/tests/ntt-768/ntt-768.mk @@ -11,8 +11,6 @@ NTT_768_PLATFORMS += m55-an547 # C sources required for this test NTT_768_SOURCES += main.c -NTT_768_SOURCES += misc.c -NTT_768_SOURCES += poly.c # Assembly sources required for this test NTT_768_ASM_DIR = ../../asm/auto/ntt_768 diff --git a/tests/ntt-768/poly.c b/tests/ntt-768/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-768/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-768/poly.h b/tests/ntt-768/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-768/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-dilithium/misc.c b/tests/ntt-dilithium/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-dilithium/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-dilithium/misc.h b/tests/ntt-dilithium/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-dilithium/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-dilithium/ntt-dilithium.mk b/tests/ntt-dilithium/ntt-dilithium.mk index 560b09f..9c81b78 100644 --- a/tests/ntt-dilithium/ntt-dilithium.mk +++ b/tests/ntt-dilithium/ntt-dilithium.mk @@ -9,8 +9,6 @@ NTT_DILITHIUM_PLATFORMS += m85-an555 # C sources required for this test NTT_DILITHIUM_SOURCES += main.c -NTT_DILITHIUM_SOURCES += poly.c -NTT_DILITHIUM_SOURCES += misc.c # Assembly sources required for this test NTT_DILITHIUM_ASM_DIR = ../../asm/manual/ntt_dilithium diff --git a/tests/ntt-dilithium/poly.c b/tests/ntt-dilithium/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-dilithium/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-dilithium/poly.h b/tests/ntt-dilithium/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-dilithium/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-kyber/misc.c b/tests/ntt-kyber/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-kyber/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-kyber/misc.h b/tests/ntt-kyber/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-kyber/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-kyber/ntt-kyber.mk b/tests/ntt-kyber/ntt-kyber.mk index 1bf86d0..bc1579f 100644 --- a/tests/ntt-kyber/ntt-kyber.mk +++ b/tests/ntt-kyber/ntt-kyber.mk @@ -9,8 +9,6 @@ NTT_KYBER_PLATFORMS += m85-an555 # C sources required for this test NTT_KYBER_SOURCES += main.c -NTT_KYBER_SOURCES += poly.c -NTT_KYBER_SOURCES += misc.c # Assembly sources required for this test NTT_KYBER_ASM_DIR=../../asm/manual/ntt_kyber diff --git a/tests/ntt-kyber/poly.c b/tests/ntt-kyber/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-kyber/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-kyber/poly.h b/tests/ntt-kyber/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-kyber/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/ntt-n256/misc.c b/tests/ntt-n256/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/ntt-n256/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/ntt-n256/misc.h b/tests/ntt-n256/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/ntt-n256/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/ntt-n256/ntt-n256.mk b/tests/ntt-n256/ntt-n256.mk index 924a601..52a0d32 100644 --- a/tests/ntt-n256/ntt-n256.mk +++ b/tests/ntt-n256/ntt-n256.mk @@ -9,8 +9,6 @@ NTT_N256_PLATFORMS += m85-an555 # C sources required for this test NTT_N256_SOURCES += main.c -NTT_N256_SOURCES += misc.c -NTT_N256_SOURCES += poly.c # Assembly sources required for this test NTT_N256_ASM_DIR = ../../asm/auto/ntt_n256 diff --git a/tests/ntt-n256/poly.c b/tests/ntt-n256/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/ntt-n256/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/ntt-n256/poly.h b/tests/ntt-n256/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/ntt-n256/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/permute/misc.c b/tests/permute/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/permute/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/permute/misc.h b/tests/permute/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/permute/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/permute/permute.mk b/tests/permute/permute.mk index 8277946..4c9e000 100644 --- a/tests/permute/permute.mk +++ b/tests/permute/permute.mk @@ -9,7 +9,6 @@ PERMUTE_PLATFORMS += m85-an555 # C sources required for this test PERMUTE_SOURCES += main.c -PERMUTE_SOURCES += misc.c # Assembly sources required for this test PERMUTE_ASM_DIR = ../../asm/auto/permute diff --git a/tests/poly/misc.c b/tests/poly/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/poly/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/poly/misc.h b/tests/poly/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/poly/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/poly/poly.c b/tests/poly/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/poly/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/poly/poly.h b/tests/poly/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/poly/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/poly/poly.mk b/tests/poly/poly.mk index 720cae7..2fe77a7 100644 --- a/tests/poly/poly.mk +++ b/tests/poly/poly.mk @@ -9,8 +9,6 @@ POLY_PLATFORMS += m85-an555 # C sources required for this test POLY_SOURCES += main.c -POLY_SOURCES += misc.c -POLY_SOURCES += poly.c # Assembly sources required for this test POLY_ASM_DIR = ./auto diff --git a/tests/saber/misc.c b/tests/saber/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/saber/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/saber/misc.h b/tests/saber/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/saber/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/saber/poly.c b/tests/saber/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/saber/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/saber/poly.h b/tests/saber/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/saber/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/saber/saber.mk b/tests/saber/saber.mk index 08f9329..a2d1a4a 100644 --- a/tests/saber/saber.mk +++ b/tests/saber/saber.mk @@ -9,7 +9,6 @@ SABER_PLATFORMS += m85-an555 # C sources required for this test SABER_SOURCES += main.c -SABER_SOURCES += misc.c SABER_SOURCES += kem.c SABER_SOURCES += fips202.c SABER_SOURCES += verify.c diff --git a/tests/schoolbook/misc.c b/tests/schoolbook/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/schoolbook/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/schoolbook/misc.h b/tests/schoolbook/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/schoolbook/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/schoolbook/poly.c b/tests/schoolbook/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/schoolbook/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/schoolbook/poly.h b/tests/schoolbook/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/schoolbook/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/schoolbook/schoolbook.mk b/tests/schoolbook/schoolbook.mk index 153953e..997ad85 100644 --- a/tests/schoolbook/schoolbook.mk +++ b/tests/schoolbook/schoolbook.mk @@ -9,8 +9,6 @@ SCHOOLBOOK_PLATFORMS += m85-an555 # C sources required for this test SCHOOLBOOK_SOURCES += main.c -SCHOOLBOOK_SOURCES += misc.c -SCHOOLBOOK_SOURCES += poly.c # Assembly sources required for this test SCHOOLBOOK_ASMS += poly_u16_32_acc.s diff --git a/tests/sqmag/misc.c b/tests/sqmag/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/sqmag/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/sqmag/misc.h b/tests/sqmag/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/sqmag/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/sqmag/sqmag.mk b/tests/sqmag/sqmag.mk index 03423a8..42f81f8 100644 --- a/tests/sqmag/sqmag.mk +++ b/tests/sqmag/sqmag.mk @@ -9,7 +9,6 @@ SQMAG_PLATFORMS += m85-an555 # C sources required for this test SQMAG_SOURCES += main.c -SQMAG_SOURCES += misc.c # Assembly sources required for this test SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll1.s diff --git a/tests/toom/misc.c b/tests/toom/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/toom/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/toom/misc.h b/tests/toom/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/toom/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/toom/poly.c b/tests/toom/poly.c deleted file mode 120000 index 57b8f97..0000000 --- a/tests/toom/poly.c +++ /dev/null @@ -1 +0,0 @@ -../common/poly.c \ No newline at end of file diff --git a/tests/toom/poly.h b/tests/toom/poly.h deleted file mode 120000 index 3a14842..0000000 --- a/tests/toom/poly.h +++ /dev/null @@ -1 +0,0 @@ -../common/poly.h \ No newline at end of file diff --git a/tests/toom/toom.mk b/tests/toom/toom.mk index 244cfc3..776d6b1 100644 --- a/tests/toom/toom.mk +++ b/tests/toom/toom.mk @@ -9,9 +9,6 @@ TOOM_PLATFORMS += m85-an555 # C sources required for this test TOOM_SOURCES += main.c -TOOM_SOURCES += misc.c -TOOM_SOURCES += poly.c - # Assembly sources required for this test # TODO: not all these are required; delete the other ones? diff --git a/tests/transpose/misc.c b/tests/transpose/misc.c deleted file mode 120000 index 9326b99..0000000 --- a/tests/transpose/misc.c +++ /dev/null @@ -1 +0,0 @@ -../common/misc.c \ No newline at end of file diff --git a/tests/transpose/misc.h b/tests/transpose/misc.h deleted file mode 120000 index 81b08e0..0000000 --- a/tests/transpose/misc.h +++ /dev/null @@ -1 +0,0 @@ -../common/misc.h \ No newline at end of file diff --git a/tests/transpose/transpose.mk b/tests/transpose/transpose.mk index 4861c6a..2d0edac 100644 --- a/tests/transpose/transpose.mk +++ b/tests/transpose/transpose.mk @@ -9,7 +9,6 @@ TRANSPOSE_PLATFORMS += m85-an555 # C sources required for this test TRANSPOSE_SOURCES += main.c -TRANSPOSE_SOURCES += misc.c # Assembly sources required for this test TRANSPOSE_ASMS += ../../asm/auto/gather/transpose_u16x8_4x4.s From 2cf68e8706a745486c2ad4c6729e3ca38c6c1ce4 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 14:46:06 +0800 Subject: [PATCH 23/32] remove more duplicated assembly --- ...inv_ntt_u32_33556993_28678040_incomplete.s | 0 .../ntt_u32_33556993_28678040_incomplete.s | 0 ..._u32_33556993_28678040_incomplete_double.s | 0 tests/chunk/chunk.mk | 2 +- tests/chunk/chunk.s | 1112 -------- tests/karatsuba/karatsuba.mk | 2 +- tests/karatsuba/karatsuba.s | 873 ------ tests/poly/karatsuba.s | 873 ------ tests/poly/poly.mk | 13 +- tests/poly/poly_u16_32.s | 1051 ------- tests/poly/poly_u16_32_acc.s | 1075 ------- ...inv_ntt_u32_33556993_28678040_incomplete.s | 2535 ----------------- .../ntt_u32_33556993_28678040_incomplete.s | 2035 ------------- ..._u32_33556993_28678040_incomplete_double.s | 2342 --------------- tests/saber/karatsuba.s | 873 ------ tests/saber/saber.mk | 7 +- ...mul_32_anticyclic_karatsuba_fwd_mve_simd.s | 268 -- .../poly_u16_mul_32_anticyclic_mve_simd.s | 274 -- .../auto/poly_u16_mul_32_mve_simd.s | 386 --- tests/schoolbook/poly_u16_32_acc.s | 1062 ------- tests/schoolbook/schoolbook.mk | 9 +- 21 files changed, 18 insertions(+), 14774 deletions(-) rename {tests/poly/auto => asm/auto/saber}/inv_ntt_u32_33556993_28678040_incomplete.s (100%) rename {tests/poly/auto => asm/auto/saber}/ntt_u32_33556993_28678040_incomplete.s (100%) rename {tests/poly/auto => asm/auto/saber}/ntt_u32_33556993_28678040_incomplete_double.s (100%) delete mode 100644 tests/chunk/chunk.s delete mode 100644 tests/karatsuba/karatsuba.s delete mode 100644 tests/poly/karatsuba.s delete mode 100644 tests/poly/poly_u16_32.s delete mode 100644 tests/poly/poly_u16_32_acc.s delete mode 100644 tests/saber/auto/inv_ntt_u32_33556993_28678040_incomplete.s delete mode 100644 tests/saber/auto/ntt_u32_33556993_28678040_incomplete.s delete mode 100644 tests/saber/auto/ntt_u32_33556993_28678040_incomplete_double.s delete mode 100644 tests/saber/karatsuba.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_anticyclic_mve_simd.s delete mode 100644 tests/schoolbook/auto/poly_u16_mul_32_mve_simd.s delete mode 100644 tests/schoolbook/poly_u16_32_acc.s diff --git a/tests/poly/auto/inv_ntt_u32_33556993_28678040_incomplete.s b/asm/auto/saber/inv_ntt_u32_33556993_28678040_incomplete.s similarity index 100% rename from tests/poly/auto/inv_ntt_u32_33556993_28678040_incomplete.s rename to asm/auto/saber/inv_ntt_u32_33556993_28678040_incomplete.s diff --git a/tests/poly/auto/ntt_u32_33556993_28678040_incomplete.s b/asm/auto/saber/ntt_u32_33556993_28678040_incomplete.s similarity index 100% rename from tests/poly/auto/ntt_u32_33556993_28678040_incomplete.s rename to asm/auto/saber/ntt_u32_33556993_28678040_incomplete.s diff --git a/tests/poly/auto/ntt_u32_33556993_28678040_incomplete_double.s b/asm/auto/saber/ntt_u32_33556993_28678040_incomplete_double.s similarity index 100% rename from tests/poly/auto/ntt_u32_33556993_28678040_incomplete_double.s rename to asm/auto/saber/ntt_u32_33556993_28678040_incomplete_double.s diff --git a/tests/chunk/chunk.mk b/tests/chunk/chunk.mk index 7d3c441..a1cec22 100644 --- a/tests/chunk/chunk.mk +++ b/tests/chunk/chunk.mk @@ -11,5 +11,5 @@ CHUNK_PLATFORMS += m85-an555 CHUNK_SOURCES += main.c # Assembly sources required for this test -CHUNK_ASMS += chunk.s +CHUNK_ASMS += ../../asm/manual/chunk/chunk.s diff --git a/tests/chunk/chunk.s b/tests/chunk/chunk.s deleted file mode 100644 index 9d8f2f2..0000000 --- a/tests/chunk/chunk.s +++ /dev/null @@ -1,1112 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "chunk_const.h" - -.syntax unified - -/*======================== Basic M4 version ============================ */ - -.type radix11_reduce_x4_asm_m4, %function -.global radix11_reduce_x4_asm_m4 - - src .req r0 - cur .req r1 - carry .req r2 - mask .req r3 - inner .req r4 - outer .req r5 - -/*-------------------------- Code ---------------------------------------*/ - -radix11_reduce_x4_asm_m4: - push {r4,r5,lr} - mov outer, #4 - mov mask, #0x7FF -radix11_reduce_x4_asm_m4__outer_loop_start: - mov carry, #0 - mov inner, #(SIZE/4) - -radix11_reduce_x4_asm_m4__inner_loop_start: - ldr cur, [src] - add carry, cur, carry, LSR #11 - and cur, carry, mask - str cur, [src], #+4 - subs inner, inner, #1 - bne radix11_reduce_x4_asm_m4__inner_loop_start - - add src, src, #4 - subs outer, outer, #1 - bne radix11_reduce_x4_asm_m4__outer_loop_start - pop {r4,r5,pc} - -/* ------------------------------------------------------------------------*/ - - .unreq src - .unreq carry - .unreq cur - .unreq mask - .unreq inner - .unreq outer - -/*======================== Basic M4_V2 version ============================ */ - -.type radix11_reduce_x4_asm_m4_v2, %function -.global radix11_reduce_x4_asm_m4_v2 - - src .req r0 - cur .req r1 - carry .req r2 - mask .req r3 - inner .req r4 - outer .req r5 - -/*-------------------------- Code ---------------------------------------*/ - -radix11_reduce_x4_asm_m4_v2: - push {r4,r5,lr} - mov outer, #4 - mov mask, #0x7FF -radix11_reduce_x4_asm_m4_v2__outer_loop_start: - mov carry, #0 - mov inner, #(SIZE/4/4) - -radix11_reduce_x4_asm_m4_v2__inner_loop_start: - ldr cur, [src] - add carry, cur, carry, LSR #11 - and cur, carry, mask - str cur, [src], #+4 - - ldr cur, [src] - add carry, cur, carry, LSR #11 - and cur, carry, mask - str cur, [src], #+4 - - ldr cur, [src] - add carry, cur, carry, LSR #11 - and cur, carry, mask - str cur, [src], #+4 - - ldr cur, [src] - add carry, cur, carry, LSR #11 - and cur, carry, mask - str cur, [src], #+4 - - subs inner, inner, #1 - bne radix11_reduce_x4_asm_m4_v2__inner_loop_start - - add src, src, #4 - subs outer, outer, #1 - bne radix11_reduce_x4_asm_m4_v2__outer_loop_start - pop {r4,r5,pc} - -/* ------------------------------------------------------------------------*/ - - .unreq src - .unreq carry - .unreq cur - .unreq mask - .unreq inner - .unreq outer - -/*======================== Basic M4_V3 version ============================ */ - -.type radix11_reduce_x4_asm_m4_v3, %function -.global radix11_reduce_x4_asm_m4_v3 - - src .req r0 - curA .req r1 - carry .req r2 - mask .req r3 - inner .req r4 - outer .req r5 - - curB .req r6 - curC .req r7 - curD .req r8 - -/*-------------------------- Code ---------------------------------------*/ - -radix11_reduce_x4_asm_m4_v3: - push {r4-r8, lr} - mov outer, #4 - mov mask, #0x7FF -radix11_reduce_x4_asm_m4_v3__outer_loop_start: - mov carry, #0 - mov inner, #(SIZE/4/4) - -radix11_reduce_x4_asm_m4_v3__inner_loop_start: - - ldr curA, [src, #0*4] - ldr curB, [src, #1*4] - ldr curC, [src, #2*4] - ldr curD, [src, #3*4] - - add carry, curA, carry, LSR #11 - and curA, carry, mask - add carry, curB, carry, LSR #11 - and curB, carry, mask - add carry, curC, carry, LSR #11 - and curC, carry, mask - add carry, curD, carry, LSR #11 - and curD, carry, mask - - str curB, [src, #1*4] - str curC, [src, #2*4] - str curD, [src, #3*4] - str curA, [src], #+16 - - subs inner, inner, #1 - bne radix11_reduce_x4_asm_m4_v3__inner_loop_start - - add src, src, #4 - subs outer, outer, #1 - bne radix11_reduce_x4_asm_m4_v3__outer_loop_start - pop {r4-r8,pc} - -/* ------------------------------------------------------------------------*/ - - .unreq src - .unreq carry - .unreq curA - .unreq curB - .unreq curC - .unreq curD - .unreq mask - .unreq inner - .unreq outer - -/*====================== Using low overhead loops======================== */ - -.type radix11_reduce_x4_asm_lob, %function -.global radix11_reduce_x4_asm_lob - - src .req r0 - cur .req r1 - carry .req r2 - mask .req r3 - outer .req r4 - inner .req r14 - -/* -------------------------- Code ----------------------------------------*/ - -radix11_reduce_x4_asm_lob: - push {r4,lr} - mov mask, #0x7FF - mov outer, #4 -radix11_reduce_x4_asm_lob__outer_loop_start: - mov inner, #(SIZE/4) - mov carry, #0 - wls inner, inner, radix11_reduce_x4_asm_lob__inner_loop_end - - .align 2 -radix11_reduce_x4_asm_lob__inner_loop_start: - ldr cur, [src] - add carry, cur, carry, LSR #11 - and cur, carry, mask - str cur, [src], #+4 - le inner, radix11_reduce_x4_asm_lob__inner_loop_start - -radix11_reduce_x4_asm_lob__inner_loop_end: - add src, src, #4 - subs outer, outer, #1 - bne radix11_reduce_x4_asm_lob__outer_loop_start - pop {r4,pc} - -/* ------------------------------------------------------------------------*/ - - .unreq src - .unreq carry - .unreq cur - .unreq mask - .unreq inner - .unreq outer - -/* ================== Leveraging 64-bit data path ======================- */ - -.type radix11_reduce_x4_asm_lob_64bit, %function -.global radix11_reduce_x4_asm_lob_64bit - - src .req r0 - curA .req r1 - curB .req r2 - carry .req r3 - mask .req r4 - outer .req r5 - inner .req r14 - -/* ------------------------- Code -----------------------------------------*/ - -radix11_reduce_x4_asm_lob_64bit: - push {r4,r5,lr} - mov mask, #0x7FF - mov outer, #4 -radix11_reduce_x4_asm_lob_64bit__outer_loop_start: - mov inner, #(SIZE/8) - mov carry, #0 - wls inner, inner, radix11_reduce_x4_lob_64bit__inner_loop_end - - .align 2 -radix11_reduce_x4_lob_64bit__inner_loop_start: - ldm src, {curA, curB} - add carry, curA, carry, LSR #11 - and curA, carry, mask - add carry, curB, carry, LSR #11 - and curB, carry, mask - stm src!, {curA, curB} - le inner, radix11_reduce_x4_lob_64bit__inner_loop_start - -radix11_reduce_x4_lob_64bit__inner_loop_end: - add src, src, #4 - subs outer, outer, #1 - bne radix11_reduce_x4_asm_lob_64bit__outer_loop_start - pop {r4,r5,pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq carry - .unreq curA - .unreq curB - .unreq mask - .unreq inner - .unreq outer - -/* ====================== Basic MVE version ============================- */ - -.type radix11_reduce_x4_asm_mve_basic, %function -.global radix11_reduce_x4_asm_mve_basic - - .align 4 -radix11_reduce_x4_asm_mve_basic_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - mask .req r1 - addr .req r2 - inner .req r14 - - offsets .req Q0 - cur .req Q1 - carry .req Q2 - qmask .req Q3 - -/* ------------------------- Code ---------------------------------------- */ - -radix11_reduce_x4_asm_mve_basic: - push {lr} - vpush {d0-d7} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_basic_offsets - mov mask, #0x7FF - vldrh.u32 offsets, [addr] - vdup.u32 qmask, mask - - /* Initialize loop */ - vmov.u32 carry, #0 - mov inner, #(SIZE/4) - wls inner, inner, radix11_reduce_x4_asm_mve_basic__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_basic__loop_start: - vldrw.u32 cur, [src, offsets] - vshr.u32 carry, carry, #11 - vadd.u32 carry, cur, carry - vand cur, carry, qmask - vstrw.32 cur, [src, offsets] - - add src, src, #4 - le inner, radix11_reduce_x4_asm_mve_basic__loop_start -radix11_reduce_x4_asm_mve_basic__loop_end: - - vpop {d0-d7} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq mask - .unreq addr - .unreq inner - - .unreq offsets - .unreq qmask - .unreq cur - .unreq carry - -/* ====================== Trading VADD for VMLA ============================ */ - -.type radix11_reduce_x4_asm_mve_vmla, %function -.global radix11_reduce_x4_asm_mve_vmla - - .align 4 -radix11_reduce_x4_asm_mve_vmla_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - mask .req r1 - addr .req r2 - const1 .req r3 - inner .req r14 - - offsets .req Q0 - cur .req Q1 - carry .req Q2 - qmask .req Q3 - -/* ------------------------- Code ------------------------------------------- */ - -radix11_reduce_x4_asm_mve_vmla: - push {lr} - vpush {d0-d7} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vmla_offsets - mov mask, #0x7FF - mov const1, #1 - vldrh.u32 offsets, [addr] - vdup.u32 qmask, mask - - /* Initialize loop */ - vmov.u32 carry, #0 - mov inner, #(SIZE/4) - wls inner, inner, radix11_reduce_x4_asm_mve_vmla__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vmla__loop_start: - vldrw.u32 cur, [src, offsets] - vshr.u32 carry, carry, #11 - vmla.s32 carry, cur, const1 - vand cur, carry, qmask - vstrw.32 cur, [src, offsets] - - add src, src, #4 - le inner, radix11_reduce_x4_asm_mve_vmla__loop_start -radix11_reduce_x4_asm_mve_vmla__loop_end: - - vpop {d0-d7} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq mask - .unreq const1 - .unreq addr - .unreq inner - - .unreq offsets - .unreq qmask - .unreq cur - .unreq carry - -/* ====================== Trading VADD for VMLA, version 2 ================= */ - -.type radix11_reduce_x4_asm_mve_vmla_v2, %function -.global radix11_reduce_x4_asm_mve_vmla_v2 - - .align 4 -radix11_reduce_x4_asm_mve_vmla_v2_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - mask .req r1 - addr .req r2 - const1 .req r3 - const4 .req r4 - inner .req r14 - - qsrc .req Q0 - cur .req Q1 - carry .req Q2 - qmask .req Q3 - store .req Q4 - -/* ------------------------- Code ------------------------------------------- */ - -radix11_reduce_x4_asm_mve_vmla_v2: - push {r4,lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vmla_v2_offsets - mov mask, #0x7FF - mov const1, #1 - mov const4, #4 - vldrh.u32 qsrc, [addr] - vadd.u32 qsrc, qsrc, src - vdup.u32 qmask, mask - - /* Initialize loop */ - vmov.u32 carry, #0 - mov inner, #(SIZE/4) - wls inner, inner, radix11_reduce_x4_asm_mve_vmla_v2__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vmla_v2__loop_start: - vldrw.u32 cur, [qsrc] - vshr.u32 carry, carry, #11 - vmla.s32 carry, cur, const1 - vand cur, carry, qmask - vstrw.32 cur, [qsrc] - - vadd.u32 qsrc, qsrc, const4 - - le inner, radix11_reduce_x4_asm_mve_vmla_v2__loop_start -radix11_reduce_x4_asm_mve_vmla_v2__loop_end: - - vpop {d0-d9} - pop {r4,pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq mask - .unreq const1 - .unreq const4 - .unreq addr - .unreq inner - - .unreq qsrc - .unreq qmask - .unreq cur - .unreq store - .unreq carry - -/* ====================== Trading VADD for VMLA, version 3 ================= */ - -.type radix11_reduce_x4_asm_mve_vmla_v3, %function -.global radix11_reduce_x4_asm_mve_vmla_v3 - - .align 4 -radix11_reduce_x4_asm_mve_vmla_v3_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - mask .req r1 - addr .req r2 - const1 .req r3 - const8 .req r4 - inner .req r14 - - qsrc .req Q0 - cur .req Q1 - carry .req Q2 - qmask .req Q3 - store .req Q4 - -/* ------------------------- Code ------------------------------------------- */ - -radix11_reduce_x4_asm_mve_vmla_v3: - push {r4-r5,lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vmla_v3_offsets - mov mask, #0x7FF - mov const1, #1 - mov const8, #8 - vldrh.u32 qsrc, [addr] - vadd.u32 qsrc, qsrc, src - vdup.u32 qmask, mask - - /* Initialize loop */ - mov inner, #(SIZE/8) - vmov.u32 carry, #0 - vldrw.u32 cur, [qsrc] - - wls inner, inner, radix11_reduce_x4_asm_mve_vmla_v3__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vmla_v3__loop_start: - vshr.u32 carry, carry, #11 - vmla.s32 carry, cur, const1 - vldrw.u32 cur, [qsrc, #4] - vand store, carry, qmask - vstrw.32 store, [qsrc] - - vshr.u32 carry, carry, #11 - vmla.s32 carry, cur, const1 - vldrw.u32 cur, [qsrc, #8] - vand store, carry, qmask - vstrw.32 store, [qsrc, #4] - - vadd.u32 qsrc, qsrc, const8 - - le inner, radix11_reduce_x4_asm_mve_vmla_v3__loop_start -radix11_reduce_x4_asm_mve_vmla_v3__loop_end: - - vpop {d0-d9} - pop {r4-r5,pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq mask - .unreq const1 - .unreq const8 - .unreq addr - .unreq inner - - .unreq qsrc - .unreq qmask - .unreq cur - .unreq store - .unreq carry - -/* ====================== Trading VADD for VMLA, version 4 ================= */ - -.type radix11_reduce_x4_asm_mve_vmla_v4, %function -.global radix11_reduce_x4_asm_mve_vmla_v4 - - .align 4 -radix11_reduce_x4_asm_mve_vmla_v4_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - mask .req r1 - addr .req r2 - const1 .req r3 - const8 .req r4 - const4 .req r5 - inner .req r14 - - qsrc .req Q0 - cur .req Q1 - carry .req Q2 - qmask .req Q3 - store .req Q4 - -/* ------------------------- Code ------------------------------------------- */ - -radix11_reduce_x4_asm_mve_vmla_v4: - push {r4-r5,lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vmla_v4_offsets - mov mask, #0x7FF - mov const1, #1 - mov const4, #4 - mov const8, #8 - vldrh.u32 qsrc, [addr] - vadd.u32 qsrc, qsrc, src - vdup.u32 qmask, mask - - /* Initialize loop */ - mov inner, #(SIZE/8) - vmov.u32 carry, #0 - vldrw.u32 cur, [qsrc] - - wls inner, inner, radix11_reduce_x4_asm_mve_vmla_v4__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vmla_v4__loop_start: - vshr.u32 carry, carry, #11 - vmla.s32 carry, cur, const1 - vldrw.u32 cur, [const4, qsrc] - vand store, carry, qmask - vstrw.32 store, [qsrc] - - vshr.u32 carry, carry, #11 - vmla.s32 carry, cur, const1 - vldrw.u32 cur, [const8, qsrc] - vand store, carry, qmask - vstrw.32 store, [const4, qsrc] - - vadd.u32 qsrc, qsrc, const8 - - le inner, radix11_reduce_x4_asm_mve_vmla_v4__loop_start -radix11_reduce_x4_asm_mve_vmla_v4__loop_end: - - vpop {d0-d9} - pop {r4-r5,pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq mask - .unreq const1 - .unreq const8 - .unreq addr - .unreq inner - - .unreq qsrc - .unreq qmask - .unreq cur - .unreq store - .unreq carry - -/* ================== Using VQDMLAH for shift+accumulate, v1 ==================== */ - -.type radix11_reduce_x4_asm_mve_vqdmlah, %function -.global radix11_reduce_x4_asm_mve_vqdmlah - - .align 4 -radix11_reduce_x4_asm_mve_vqdmlah_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - fpconst .req r1 - mask .req r2 - addr .req r3 - inner .req r14 - - offsets .req Q0 - curA .req Q1 - curB .req Q2 - store .req Q3 - qmask .req Q4 - -/* ----------------------------- Code --------------------------------------*/ - -radix11_reduce_x4_asm_mve_vqdmlah: - push {lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vqdmlah_offsets - mov fpconst, #(1 << (31 - 11)) - mov mask, #0x7FF - vldrh.u32 offsets, [addr] - vdup.u32 qmask, mask - - /* Iniitalize loop */ - vmov.u32 curB, #0 - mov inner, #(SIZE/8) - wls inner, inner, radix11_reduce_x4_asm_mve_vqdmlah__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah__loop_start: - /* Need to unroll for carry handling */ - vldrw.u32 curA, [src, offsets] - vqdmlah.s32 curA, curB, fpconst - vand store, curA, qmask - vstrw.32 store, [src, offsets] - add src, src, #4 - - vldrw.u32 curB, [src, offsets] - vqdmlah.s32 curB, curA, fpconst - vand store, curB, qmask - vstrw.32 store, [src, offsets] - add src, src, #4 - - le inner, radix11_reduce_x4_asm_mve_vqdmlah__loop_start -radix11_reduce_x4_asm_mve_vqdmlah__loop_end: - - vpop {d0-d9} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq fpconst - .unreq mask - .unreq addr - .unreq inner - - .unreq offsets - .unreq curB - .unreq curA - .unreq store - .unreq qmask - - -/* ================== Using VQDMLAH for shift+accumulate, v3 ================= */ - -.type radix11_reduce_x4_asm_mve_vqdmlah_v3, %function -.global radix11_reduce_x4_asm_mve_vqdmlah_v3 - - .align 4 -radix11_reduce_x4_asm_mve_vqdmlah_v3_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - .short 0*4*(SIZE/4) + 4 - .short 1*4*(SIZE/4) + 1*4 + 4 - .short 2*4*(SIZE/4) + 2*4 + 4 - .short 3*4*(SIZE/4) + 3*4 + 4 - - src .req r0 - fpconst .req r1 - mask .req r2 - addr .req r3 - inner .req r14 - - offsetsA .req Q0 - offsetsB .req Q1 - curA .req Q2 - curB .req Q3 - store .req Q4 - qmask .req Q5 - -/* ----------------------------- Code --------------------------------------*/ - -radix11_reduce_x4_asm_mve_vqdmlah_v3: - push {lr} - vpush {d0-d11} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vqdmlah_v3_offsets - mov fpconst, #(1 << (31 - 11)) - mov mask, #0x7FF - vldrh.u32 offsetsA, [addr] - vldrh.u32 offsetsB, [addr, #8] - vdup.u32 qmask, mask - - /* Iniitalize loop */ - vmov.u32 curB, #0 - mov inner, #(SIZE/8) - wls inner, inner, radix11_reduce_x4_asm_mve_vqdmlah_v3__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah_v3__loop_start: - /* Need to unroll for carry handling */ - vldrw.u32 curA, [src, offsetsA] - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [src, offsetsB] - vand store, curA, qmask - vqdmlah.s32 curB, curA, fpconst - vstrw.32 store, [src, offsetsA] - vand store, curB, qmask - vstrw.32 store, [src, offsetsB] - add.w src, src, #8 - - le inner, radix11_reduce_x4_asm_mve_vqdmlah_v3__loop_start -radix11_reduce_x4_asm_mve_vqdmlah_v3__loop_end: - - vpop {d0-d11} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq fpconst - .unreq mask - .unreq addr - .unreq inner - - .unreq offsetsA - .unreq offsetsB - .unreq curB - .unreq curA - .unreq store - .unreq qmask - -/* ================== Using VQDMLAH for shift+accumulate, v4 ================= */ - -.type radix11_reduce_x4_asm_mve_vqdmlah_v4, %function -.global radix11_reduce_x4_asm_mve_vqdmlah_v4 - - .align 4 -radix11_reduce_x4_asm_mve_vqdmlah_v4_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - fpconst .req r1 - mask .req r2 - addr .req r3 - inner .req r14 - - offsets .req Q0 - curA .req Q1 - curB .req Q2 - store .req Q3 - qmask .req Q4 - -/* ----------------------------- Code --------------------------------------*/ - -radix11_reduce_x4_asm_mve_vqdmlah_v4: - push {lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vqdmlah_v4_offsets - mov fpconst, #(1 << (31 - 11)) - mov mask, #0x7FF - vdup.u32 qmask, mask - vldrh.u32 offsets, [addr] - sub src, src, #8 - vadd.u32 offsets, offsets, src - - /* Iniitalize loop */ - vmov.u32 curB, #0 - mov inner, #(SIZE/8) - wls inner, inner, radix11_reduce_x4_asm_mve_vqdmlah_v4__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah_v4__loop_start: - vldrw.u32 curA, [offsets, #8]! - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [offsets, #4] - vand store, curA, qmask - vqdmlah.s32 curB, curA, fpconst - vstrw.32 store, [offsets] - vand store, curB, qmask - vstrw.32 store, [offsets, #4] - - le inner, radix11_reduce_x4_asm_mve_vqdmlah_v4__loop_start -radix11_reduce_x4_asm_mve_vqdmlah_v4__loop_end: - - vpop {d0-d9} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq fpconst - .unreq mask - .unreq addr - .unreq const4 - .unreq inner - - .unreq offsets - .unreq store - .unreq curB - .unreq curA - .unreq qmask - -/* ================== Using VQDMLAH for shift+accumulate, v5 ================= */ - -.type radix11_reduce_x4_asm_mve_vqdmlah_v5, %function -.global radix11_reduce_x4_asm_mve_vqdmlah_v5 - - .align 4 -radix11_reduce_x4_asm_mve_vqdmlah_v5_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - fpconst .req r1 - mask .req r2 - addr .req r3 - inner .req r14 - - offsets .req Q0 - curA .req Q1 - curB .req Q2 - storeA .req Q3 - storeB .req Q4 - qmask .req Q5 - -/* ----------------------------- Code --------------------------------------*/ - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah_v5: - push {lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vqdmlah_v5_offsets - vldrh.u32 offsets, [addr] - vadd.u32 offsets, offsets, src - - mov mask, #0x7FF - vdup.u32 qmask, mask - mov fpconst, #(1 << (31 - 11)) - - /* Pull out first iteration */ - vldrw.u32 curA, [offsets] - vand storeA, curA, qmask - vldrw.u32 curB, [offsets, #4] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeA, [offsets] - vand storeB, curB, qmask - - mov inner, #((SIZE/8)-1) - wls inner, inner, radix11_reduce_x4_asm_mve_vqdmlah_v5__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah_v5__loop_start: - vldrw.u32 curA, [offsets, #8]! - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [offsets, #4] - vand storeA, curA, qmask - vstrw.32 storeA, [offsets] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeB, [offsets, #-4] - vand storeB, curB, qmask - - le inner, radix11_reduce_x4_asm_mve_vqdmlah_v5__loop_start -radix11_reduce_x4_asm_mve_vqdmlah_v5__loop_end: - - vstrw.32 storeB, [offsets, #4] - - vpop {d0-d9} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq fpconst - .unreq mask - .unreq addr - .unreq inner - - .unreq offsets - .unreq storeA - .unreq storeB - .unreq curB - .unreq curA - .unreq qmask - -/* ================== Using VQDMLAH for shift+accumulate, v6 ================= */ - -.type radix11_reduce_x4_asm_mve_vqdmlah_v6, %function -.global radix11_reduce_x4_asm_mve_vqdmlah_v6 - - .align 4 -radix11_reduce_x4_asm_mve_vqdmlah_v6_offsets: - .short 0*4*(SIZE/4) - .short 1*4*(SIZE/4) + 1*4 - .short 2*4*(SIZE/4) + 2*4 - .short 3*4*(SIZE/4) + 3*4 - - src .req r0 - fpconst .req r1 - mask .req r2 - addr .req r3 - inner .req r14 - - offsets .req Q0 - curA .req Q1 - curB .req Q2 - storeA .req Q3 - storeB .req Q4 - qmask .req Q5 - -/* ----------------------------- Code --------------------------------------*/ - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah_v6: - push {lr} - vpush {d0-d9} - - /* Load constants */ - adr addr, radix11_reduce_x4_asm_mve_vqdmlah_v6_offsets - vldrh.u32 offsets, [addr] - vadd.u32 offsets, offsets, src - - mov mask, #0x7FF - vdup.u32 qmask, mask - mov fpconst, #(1 << (31 - 11)) - - /* Pull out first iteration */ - vldrw.u32 curA, [offsets] - vand storeA, curA, qmask - vldrw.u32 curB, [offsets, #4] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeA, [offsets] - vand storeB, curB, qmask - - vldrw.u32 curA, [offsets, #8] - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [offsets, #12] - vand storeA, curA, qmask - vstrw.32 storeA, [offsets,#8] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeB, [offsets, #4] - vand storeB, curB, qmask - - mov inner, #((SIZE/16)-1) - wls inner, inner, radix11_reduce_x4_asm_mve_vqdmlah_v6__loop_end - - .align 2 -radix11_reduce_x4_asm_mve_vqdmlah_v6__loop_start: - vldrw.u32 curA, [offsets, #16]! - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [offsets, #4] - vand storeA, curA, qmask - vstrw.32 storeA, [offsets] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeB, [offsets, #-4] - vand storeB, curB, qmask - - vldrw.u32 curA, [offsets, #8] - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [offsets, #12] - vand storeA, curA, qmask - vstrw.32 storeA, [offsets, #8] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeB, [offsets, #4] - vand storeB, curB, qmask - - le inner, radix11_reduce_x4_asm_mve_vqdmlah_v6__loop_start -radix11_reduce_x4_asm_mve_vqdmlah_v6__loop_end: - - vldrw.u32 curA, [offsets, #16]! - vqdmlah.s32 curA, curB, fpconst - vldrw.u32 curB, [offsets, #4] - vand storeA, curA, qmask - vstrw.32 storeA, [offsets] - vqdmlah.s32 curB, curA, fpconst - vstrw.32 storeB, [offsets, #-4] - vand storeB, curB, qmask - - vstrw.32 storeB, [offsets, #4] - - vpop {d0-d9} - pop {pc} - -/* ---------------------------------------------------------------------- */ - - .unreq src - .unreq fpconst - .unreq mask - .unreq addr - .unreq inner - - .unreq offsets - .unreq storeA - .unreq storeB - .unreq curB - .unreq curA - .unreq qmask - -/* ====================================================================== */ diff --git a/tests/karatsuba/karatsuba.mk b/tests/karatsuba/karatsuba.mk index eca2226..8cb9321 100644 --- a/tests/karatsuba/karatsuba.mk +++ b/tests/karatsuba/karatsuba.mk @@ -11,5 +11,5 @@ KARATSUBA_PLATFORMS += m85-an555 KARATSUBA_SOURCES += main.c # Assembly sources required for this test -KARATSUBA_ASMS += karatsuba.s +KARATSUBA_ASMS += ../../asm/manual/karatsuba/karatsuba.s diff --git a/tests/karatsuba/karatsuba.s b/tests/karatsuba/karatsuba.s deleted file mode 100644 index 98785db..0000000 --- a/tests/karatsuba/karatsuba.s +++ /dev/null @@ -1,873 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "karatsuba_const.h" - -.syntax unified - -/* Template: - * - * .type FUNCNAME, %function - * .global FUNCNAME - * FUNCNAME: - * push {r4-r12,lr} - * vpush {d0-d15} - * - * foo .req r0 - * .unreq bar - * - * vpop {d0-d15} - * pop {r4-r12,lr} - * bx lr - */ - -/* - * Karatsuba evaluation - */ - -.type karatsuba_fwd_dual_32_loop, %function -.global karatsuba_fwd_dual_32_loop -karatsuba_fwd_dual_32_loop: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define VECTOR_SIZE 16 - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - #define NUM_LIMBS 3 - - even_a .req q0 - odd_a .req q1 - sum_a .req q2 - - even_b .req q3 - odd_b .req q4 - sum_b .req q5 - - loop_cnt .req r14 - mov loop_cnt, #(KARATSUBA_FWD_ITERATIONS-2) - - /* First iteration */ - #define OFFSET 0 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - #define SHIFT (-NUM_LIMBS*LIMB_BYTE_SIZE) - - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - wls loop_cnt, loop_cnt, karatsuba_fwd_dual_32_loop_end -karatsuba_fwd_dual_32_loop_start: - - #define OFFSET 0 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - le loop_cnt, karatsuba_fwd_dual_32_loop_start -karatsuba_fwd_dual_32_loop_end: - - /* Last iteration */ - - #define OFFSET 0 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - .unreq even_a - .unreq odd_a - .unreq sum_a - .unreq even_b - .unreq odd_b - .unreq sum_b - - .unreq src - .unreq dst - .unreq loop_cnt - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* - * Karatsuba interpolation - */ - -.type karatsuba_naive_inv_dual_32, %function -.global karatsuba_naive_inv_dual_32 -karatsuba_naive_inv_dual_32: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - mov carry, #0 - - even_even .req q0 - sum_even .req q1 - even_odd .req q2 - sum_odd .req q3 - odd_even .req q4 - odd_odd .req q5 - - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_even, sum_even, odd_even - vsub.u16 sum_odd, sum_odd, odd_odd - vsub.u16 sum_even, sum_even, even_even - vsub.u16 sum_odd, sum_odd, even_odd - - vadd.u16 even_odd, even_odd, odd_even - vshlc odd_odd, carry, #16 - vadd.u16 even_even, even_even, odd_odd - - vst40.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst41.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst42.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst43.s16 {even_even, sum_even, even_odd, sum_odd}, [dst]! - - .unreq even_even - .unreq even_odd - .unreq odd_even - .unreq odd_odd - .unreq sum_even - .unreq sum_odd - - add src, src, #16 - - even_even .req q0 - sum_even .req q1 - even_odd .req q2 - sum_odd .req q3 - odd_even .req q4 - odd_odd .req q5 - - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_even, sum_even, odd_even - vsub.u16 sum_odd, sum_odd, odd_odd - vsub.u16 sum_even, sum_even, even_even - vsub.u16 sum_odd, sum_odd, even_odd - - vadd.u16 even_odd, even_odd, odd_even - vshlc odd_odd, carry, #16 - vadd.u16 even_even, even_even, odd_odd - - vst40.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst41.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst42.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst43.s16 {even_even, sum_even, even_odd, sum_odd}, [dst]! - - carry_correct .req r11 - ldrh carry_correct, [dst, #(-128)] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-128)] - .unreq carry_correct - - .unreq even_even - .unreq even_odd - .unreq odd_even - .unreq odd_odd - .unreq sum_even - .unreq sum_odd - - .unreq src - .unreq dst - .unreq carry - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* Slightly pipeline optimized version of Karatsuba interpolation. */ - -.type karatsuba_inv_dual_32, %function -.global karatsuba_inv_dual_32 -karatsuba_inv_dual_32: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - sum_even .req q4 // alloc q4 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - mov carry, #0 - - odd_even .req q6 // alloc q4, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q4, q5, q6 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - add src, src, #16 - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - vshlc odd_odd, carry, #16 - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - // Correction of initial coefficient after we know the wraparound - carry_correct .req r11 - ldrh carry_correct, [dst, #(-64)] - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - .unreq carry_correct - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* Slightly pipelined and looping version of Karatsuba interpolation. */ - -.type karatsuba_inv_dual_32_loop, %function -.global karatsuba_inv_dual_32_loop -karatsuba_inv_dual_32_loop: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - carry_correct .req r11 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - #define NUM_LIMBS 3 - #define TOTAL_SIZE_BYTES (NUM_LIMBS*LIMB_BYTE_SIZE) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - /* INITIAL ITERATION */ - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - #define SHIFT 0 - - sum_even .req q7 // alloc q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - mov carry, #0 - - odd_even .req q6 // alloc q6, q7 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - - #if SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexepected offset - #endif - - vldrh.u16 even_even, [src], #(TOTAL_SIZE_BYTES) //[src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - #undef SHIFT - #define SHIFT (-TOTAL_SIZE_BYTES) - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT (16 - TOTAL_SIZE_BYTES) - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT 0 - - // Preload for next iteration - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - // Preload for next iteration - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - // Correction of initial coefficient after we know the wraparound - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - mov carry, #0 - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - /* LOOP */ - - loop_cnt .req r14 - mov loop_cnt, #(KARATSUBA_INV_ITERATIONS-2) - - wls loop_cnt, loop_cnt, karatsuba_inv_dual_32_loop_end - -karatsuba_inv_dual_32_loop_start: - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even and odd_even preloaded - vsub.u16 sum_even, sum_even, odd_even - - #if SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexepected offset - #endif - - even_even .req q5 // alloc q7, q5, q6 - vldrh.u16 even_even, [src], #TOTAL_SIZE_BYTES // [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - #undef SHIFT - #define SHIFT (-TOTAL_SIZE_BYTES) - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT (16-TOTAL_SIZE_BYTES) - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - // Preload for next iteration - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - // Preload for next iteration - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - // Correction of initial coefficient after we know the wraparound - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - mov carry, #0 - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - le loop_cnt, karatsuba_inv_dual_32_loop_start - -karatsuba_inv_dual_32_loop_end: - - /* LAST ITERATION */ - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even and odd_even preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q4, q5, q6 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT 16 - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - vshlc odd_odd, carry, #16 - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - .unreq src - .unreq dst - .unreq carry - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/poly/karatsuba.s b/tests/poly/karatsuba.s deleted file mode 100644 index 98785db..0000000 --- a/tests/poly/karatsuba.s +++ /dev/null @@ -1,873 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "karatsuba_const.h" - -.syntax unified - -/* Template: - * - * .type FUNCNAME, %function - * .global FUNCNAME - * FUNCNAME: - * push {r4-r12,lr} - * vpush {d0-d15} - * - * foo .req r0 - * .unreq bar - * - * vpop {d0-d15} - * pop {r4-r12,lr} - * bx lr - */ - -/* - * Karatsuba evaluation - */ - -.type karatsuba_fwd_dual_32_loop, %function -.global karatsuba_fwd_dual_32_loop -karatsuba_fwd_dual_32_loop: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define VECTOR_SIZE 16 - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - #define NUM_LIMBS 3 - - even_a .req q0 - odd_a .req q1 - sum_a .req q2 - - even_b .req q3 - odd_b .req q4 - sum_b .req q5 - - loop_cnt .req r14 - mov loop_cnt, #(KARATSUBA_FWD_ITERATIONS-2) - - /* First iteration */ - #define OFFSET 0 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - #define SHIFT (-NUM_LIMBS*LIMB_BYTE_SIZE) - - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - wls loop_cnt, loop_cnt, karatsuba_fwd_dual_32_loop_end -karatsuba_fwd_dual_32_loop_start: - - #define OFFSET 0 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - le loop_cnt, karatsuba_fwd_dual_32_loop_start -karatsuba_fwd_dual_32_loop_end: - - /* Last iteration */ - - #define OFFSET 0 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - .unreq even_a - .unreq odd_a - .unreq sum_a - .unreq even_b - .unreq odd_b - .unreq sum_b - - .unreq src - .unreq dst - .unreq loop_cnt - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* - * Karatsuba interpolation - */ - -.type karatsuba_naive_inv_dual_32, %function -.global karatsuba_naive_inv_dual_32 -karatsuba_naive_inv_dual_32: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - mov carry, #0 - - even_even .req q0 - sum_even .req q1 - even_odd .req q2 - sum_odd .req q3 - odd_even .req q4 - odd_odd .req q5 - - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_even, sum_even, odd_even - vsub.u16 sum_odd, sum_odd, odd_odd - vsub.u16 sum_even, sum_even, even_even - vsub.u16 sum_odd, sum_odd, even_odd - - vadd.u16 even_odd, even_odd, odd_even - vshlc odd_odd, carry, #16 - vadd.u16 even_even, even_even, odd_odd - - vst40.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst41.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst42.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst43.s16 {even_even, sum_even, even_odd, sum_odd}, [dst]! - - .unreq even_even - .unreq even_odd - .unreq odd_even - .unreq odd_odd - .unreq sum_even - .unreq sum_odd - - add src, src, #16 - - even_even .req q0 - sum_even .req q1 - even_odd .req q2 - sum_odd .req q3 - odd_even .req q4 - odd_odd .req q5 - - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_even, sum_even, odd_even - vsub.u16 sum_odd, sum_odd, odd_odd - vsub.u16 sum_even, sum_even, even_even - vsub.u16 sum_odd, sum_odd, even_odd - - vadd.u16 even_odd, even_odd, odd_even - vshlc odd_odd, carry, #16 - vadd.u16 even_even, even_even, odd_odd - - vst40.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst41.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst42.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst43.s16 {even_even, sum_even, even_odd, sum_odd}, [dst]! - - carry_correct .req r11 - ldrh carry_correct, [dst, #(-128)] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-128)] - .unreq carry_correct - - .unreq even_even - .unreq even_odd - .unreq odd_even - .unreq odd_odd - .unreq sum_even - .unreq sum_odd - - .unreq src - .unreq dst - .unreq carry - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* Slightly pipeline optimized version of Karatsuba interpolation. */ - -.type karatsuba_inv_dual_32, %function -.global karatsuba_inv_dual_32 -karatsuba_inv_dual_32: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - sum_even .req q4 // alloc q4 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - mov carry, #0 - - odd_even .req q6 // alloc q4, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q4, q5, q6 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - add src, src, #16 - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - vshlc odd_odd, carry, #16 - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - // Correction of initial coefficient after we know the wraparound - carry_correct .req r11 - ldrh carry_correct, [dst, #(-64)] - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - .unreq carry_correct - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* Slightly pipelined and looping version of Karatsuba interpolation. */ - -.type karatsuba_inv_dual_32_loop, %function -.global karatsuba_inv_dual_32_loop -karatsuba_inv_dual_32_loop: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - carry_correct .req r11 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - #define NUM_LIMBS 3 - #define TOTAL_SIZE_BYTES (NUM_LIMBS*LIMB_BYTE_SIZE) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - /* INITIAL ITERATION */ - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - #define SHIFT 0 - - sum_even .req q7 // alloc q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - mov carry, #0 - - odd_even .req q6 // alloc q6, q7 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - - #if SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexepected offset - #endif - - vldrh.u16 even_even, [src], #(TOTAL_SIZE_BYTES) //[src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - #undef SHIFT - #define SHIFT (-TOTAL_SIZE_BYTES) - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT (16 - TOTAL_SIZE_BYTES) - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT 0 - - // Preload for next iteration - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - // Preload for next iteration - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - // Correction of initial coefficient after we know the wraparound - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - mov carry, #0 - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - /* LOOP */ - - loop_cnt .req r14 - mov loop_cnt, #(KARATSUBA_INV_ITERATIONS-2) - - wls loop_cnt, loop_cnt, karatsuba_inv_dual_32_loop_end - -karatsuba_inv_dual_32_loop_start: - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even and odd_even preloaded - vsub.u16 sum_even, sum_even, odd_even - - #if SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexepected offset - #endif - - even_even .req q5 // alloc q7, q5, q6 - vldrh.u16 even_even, [src], #TOTAL_SIZE_BYTES // [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - #undef SHIFT - #define SHIFT (-TOTAL_SIZE_BYTES) - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT (16-TOTAL_SIZE_BYTES) - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - // Preload for next iteration - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - // Preload for next iteration - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - // Correction of initial coefficient after we know the wraparound - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - mov carry, #0 - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - le loop_cnt, karatsuba_inv_dual_32_loop_start - -karatsuba_inv_dual_32_loop_end: - - /* LAST ITERATION */ - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even and odd_even preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q4, q5, q6 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT 16 - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - vshlc odd_odd, carry, #16 - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - .unreq src - .unreq dst - .unreq carry - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/poly/poly.mk b/tests/poly/poly.mk index 2fe77a7..d7175ab 100644 --- a/tests/poly/poly.mk +++ b/tests/poly/poly.mk @@ -13,13 +13,14 @@ POLY_SOURCES += main.c # Assembly sources required for this test POLY_ASM_DIR = ./auto POLY_ASMS += montgomery.s -POLY_ASMS += karatsuba.s -POLY_ASMS += poly_u16_32.s -POLY_ASMS += poly_u16_32_acc.s -POLY_ASMS += $(POLY_ASM_DIR)/inv_ntt_u32_33556993_28678040_incomplete.s -POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_incomplete_double.s -POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_incomplete.s +POLY_ASMS += ../../asm/manual/karatsuba/karatsuba.s +POLY_ASMS += ../../asm/manual/schoolbook/poly_u16_32.s +POLY_ASMS += ../../asm/manual/schoolbook/poly_u16_32_acc.s POLY_ASMS += $(POLY_ASM_DIR)/inv_ntt_u32_33556993_28678040_complete.s POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_complete.s POLY_ASMS += $(POLY_ASM_DIR)/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s POLY_ASMS += $(POLY_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s +SABER_ASM_DIR = ../../asm/auto/saber +POLY_ASMS += $(SABER_ASM_DIR)/inv_ntt_u32_33556993_28678040_incomplete.s +POLY_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete.s +POLY_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete_double.s \ No newline at end of file diff --git a/tests/poly/poly_u16_32.s b/tests/poly/poly_u16_32.s deleted file mode 100644 index e239604..0000000 --- a/tests/poly/poly_u16_32.s +++ /dev/null @@ -1,1051 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_mve_simd_handshuffle, %function -.global poly_u16_mul_32_anticyclic_karatsuba_mve_simd_handshuffle -poly_u16_mul_32_anticyclic_karatsuba_mve_simd_handshuffle: -push {r4-r11,lr} -vpush {d0-d15} -vld20.u16 {Q4, Q5}, [r2] -sub sp, sp, #224 -vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -mov r12, #0 -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -mov r14, #19 -wls r14, r14, loop_end -loop_start: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -le r14, loop_start -loop_end: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -// vld20.u16 {Q4, Q5}, [r2] -// vld21.u16 {Q4, Q5}, [r2]! -// vld20.u16 {Q6, Q7}, [r2] -// vld21.u16 {Q6, Q7}, [r2]! -// vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -// vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -// mov r12, #0 -// mov r11, sp -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r1, #24] -// vmul.u16 Q2, Q4, r9 -// ldrd r8, r7, [r1, #56] -// vmul.u16 Q3, Q4, r7 -// vneg.s16 Q7, Q6 -// vmla.s16 Q2, Q7, r7 -// ldrd r6, r5, [r1, #16] -// vmla.s16 Q3, Q6, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r8 -// ldrd r9, r7, [r1, #48] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r10, r8, [r1, #8] -// vmla.s16 Q3, Q6, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r9 -// ldrd r7, r5, [r1, #40] -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r9, r6, [r1, #0] -// vmla.s16 Q3, Q6, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r8, r5, [r1, #32] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q6, r9 -// vstrh.u16 Q3, [r11,#(144)] -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r8 -// vstrh.u16 Q2, [r11,#(128)] -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -// vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r11, #-104] -// vmul.u16 Q2, Q0, r10 -// ldrd r8, r7, [r11, #-40] -// vmul.u16 Q3, Q0, r8 -// vneg.s16 Q7, Q1 -// vmla.s16 Q2, Q7, r8 -// ldrd r6, r5, [r11, #-112] -// ldrd r4, r3, [r11, #-48] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r10, r8, [r11, #-120] -// vmla.s16 Q3, Q1, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// ldrd r5, r3, [r11, #-56] -// vmla.s16 Q3, Q1, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r6, r4, [r11, #-64] -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r8, r3, [r11, #-128] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r3 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// vmla.s16 Q3, Q1, r3 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r6 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r6 -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// neg r7, r7 -// vmla.s16 Q2, Q0, r7 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q1, r7 -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r9 -// vadd.u16 Q4, Q4, Q0 -// vldrh.u16 Q5, [r11,#0] -// vadd.u16 Q5, Q5, Q2 -// vldrh.u16 Q7, [r11,#16] -// vadd.u16 Q7, Q7, Q3 -// vstrh.u16 Q5, [r0, #0] -// vstrh.u16 Q7, [r0, #16] -// vadd.u16 Q6, Q6, Q1 -// vneg.s16 Q3, Q3 -// vmov.u16 Q0, #0 -// mov r12, #0 -// ldrd r10, r9, [r11, #-72] -// vmla.s16 Q3, Q4, r9 -// ldrd r8, r7, [r11, #-8] -// vmla.s16 Q2, Q4, r7 -// vneg.s16 Q1, Q6 -// vmla.s16 Q3, Q1, r7 -// ldrd r6, r5, [r11, #-80] -// vmla.s16 Q2, Q6, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r8 -// ldrd r9, r7, [r11, #-16] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r10, r8, [r11, #-88] -// vmla.s16 Q2, Q6, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r9 -// ldrd r7, r5, [r11, #-24] -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// ldrd r9, r6, [r11, #-96] -// vmla.s16 Q2, Q6, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r8, r5, [r11, #-32] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q2, Q6, r9 -// vsub.u16 Q3, Q3, Q0 -// vmla.s16 Q3, Q1, r8 -// vldrh.u16 Q0, [r11,#0] -// vldrh.u16 Q1, [r11,#16] -// vsub.u16 Q0, Q3, Q0 -// vsub.u16 Q1, Q2, Q1 -// vstrh.u16 Q0, [r0, #32] -// vstrh.u16 Q1, [r0, #48] - -add sp, sp, #224 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr diff --git a/tests/poly/poly_u16_32_acc.s b/tests/poly/poly_u16_32_acc.s deleted file mode 100644 index da30566..0000000 --- a/tests/poly/poly_u16_32_acc.s +++ /dev/null @@ -1,1075 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd_handshuffle, %function -.global poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd_handshuffle -poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd_handshuffle: -push {r4-r11,lr} -vpush {d0-d15} -vld20.u16 {Q4, Q5}, [r2] -sub sp, sp, #224 -vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -mov r12, #0 -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -mov r14, #19 -wls r14, r14, loop_end -loop_start: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -le r14, loop_start -loop_end: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vldrh.u16 Q6, [r0, #32] -vadd.u16 Q0, Q6, Q0 -vldrh.u16 Q6, [r0, #48] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q0, [r0, #32] -vadd.u16 Q1, Q6, Q1 -vstrh.u16 Q1, [r0, #48] - -// vld20.u16 {Q4, Q5}, [r2] -// vld21.u16 {Q4, Q5}, [r2]! -// vld20.u16 {Q6, Q7}, [r2] -// vld21.u16 {Q6, Q7}, [r2]! -// vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -// vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -// mov r12, #0 -// mov r11, sp -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r1, #24] -// vmul.u16 Q2, Q4, r9 -// ldrd r8, r7, [r1, #56] -// vmul.u16 Q3, Q4, r7 -// vneg.s16 Q7, Q6 -// vmla.s16 Q2, Q7, r7 -// ldrd r6, r5, [r1, #16] -// vmla.s16 Q3, Q6, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r8 -// ldrd r9, r7, [r1, #48] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r10, r8, [r1, #8] -// vmla.s16 Q3, Q6, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r9 -// ldrd r7, r5, [r1, #40] -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r9, r6, [r1, #0] -// vmla.s16 Q3, Q6, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r8, r5, [r1, #32] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q6, r9 -// vstrh.u16 Q3, [r11,#(144)] -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r8 -// vstrh.u16 Q2, [r11,#(128)] -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -// vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r11, #-104] -// vmul.u16 Q2, Q0, r10 -// ldrd r8, r7, [r11, #-40] -// vmul.u16 Q3, Q0, r8 -// vneg.s16 Q7, Q1 -// vmla.s16 Q2, Q7, r8 -// ldrd r6, r5, [r11, #-112] -// ldrd r4, r3, [r11, #-48] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r10, r8, [r11, #-120] -// vmla.s16 Q3, Q1, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// ldrd r5, r3, [r11, #-56] -// vmla.s16 Q3, Q1, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r6, r4, [r11, #-64] -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r8, r3, [r11, #-128] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r3 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// vmla.s16 Q3, Q1, r3 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r6 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r6 -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// neg r7, r7 -// vmla.s16 Q2, Q0, r7 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q1, r7 -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r9 -// vadd.u16 Q4, Q4, Q0 -// vldrh.u16 Q5, [r11,#0] -// vadd.u16 Q5, Q5, Q2 -// vldrh.u16 Q7, [r11,#16] -// vadd.u16 Q7, Q7, Q3 -// vstrh.u16 Q5, [r0, #0] -// vstrh.u16 Q7, [r0, #16] -// vadd.u16 Q6, Q6, Q1 -// vneg.s16 Q3, Q3 -// vmov.u16 Q0, #0 -// mov r12, #0 -// ldrd r10, r9, [r11, #-72] -// vmla.s16 Q3, Q4, r9 -// ldrd r8, r7, [r11, #-8] -// vmla.s16 Q2, Q4, r7 -// vneg.s16 Q1, Q6 -// vmla.s16 Q3, Q1, r7 -// ldrd r6, r5, [r11, #-80] -// vmla.s16 Q2, Q6, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r8 -// ldrd r9, r7, [r11, #-16] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r10, r8, [r11, #-88] -// vmla.s16 Q2, Q6, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r9 -// ldrd r7, r5, [r11, #-24] -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// ldrd r9, r6, [r11, #-96] -// vmla.s16 Q2, Q6, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r8, r5, [r11, #-32] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q2, Q6, r9 -// vsub.u16 Q3, Q3, Q0 -// vmla.s16 Q3, Q1, r8 -// vldrh.u16 Q0, [r11,#0] -// vldrh.u16 Q1, [r11,#16] -// vsub.u16 Q0, Q3, Q0 -// vsub.u16 Q1, Q2, Q1 -// vstrh.u16 Q0, [r0, #32] -// vstrh.u16 Q1, [r0, #48] - -add sp, sp, #224 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr diff --git a/tests/saber/auto/inv_ntt_u32_33556993_28678040_incomplete.s b/tests/saber/auto/inv_ntt_u32_33556993_28678040_incomplete.s deleted file mode 100644 index a65f906..0000000 --- a/tests/saber/auto/inv_ntt_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2535 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots_inv: -.word 57730785 // zeta^504 * 2^31 = 28678040^504 * 2^31 = 25085703 * 2^31 -.word 3752846111 // zeta^504 * f(q^(-1) mod 2^32) * 2^31 = 28678040^504 * 375649793 * 2^31 -.word 42601623 // zeta^508 * 2^31 = 28678040^508 * 2^31 = 32762154 * 2^31 -.word 2096617833 // zeta^508 * f(q^(-1) mod 2^32) * 2^31 = 28678040^508 * 375649793 * 2^31 -.word 43352521 // zeta^380 * 2^31 = 28678040^380 * 2^31 = 24111249 * 2^31 -.word 3690485815 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 28678040^380 * 375649793 * 2^31 -.word 59392861 // zeta^376 * 2^31 = 28678040^376 * 2^31 = 5443354 * 2^31 -.word 348348067 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 28678040^376 * 375649793 * 2^31 -.word 65052633 // zeta^444 * 2^31 = 28678040^444 * 2^31 = 11430609 * 2^31 -.word 2878986791 // zeta^444 * f(q^(-1) mod 2^32) * 2^31 = 28678040^444 * 375649793 * 2^31 -.word 58217677 // zeta^316 * 2^31 = 28678040^316 * 2^31 = 29824921 * 2^31 -.word 4056132915 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 28678040^316 * 375649793 * 2^31 -.word 57130935 // zeta^440 * 2^31 = 28678040^440 * 2^31 = 28470806 * 2^31 -.word 1821992521 // zeta^440 * f(q^(-1) mod 2^32) * 2^31 = 28678040^440 * 375649793 * 2^31 -.word 14439459 // zeta^476 * 2^31 = 28678040^476 * 2^31 = 15403199 * 2^31 -.word 3133213149 // zeta^476 * f(q^(-1) mod 2^32) * 2^31 = 28678040^476 * 375649793 * 2^31 -.word 30030779 // zeta^348 * 2^31 = 28678040^348 * 2^31 = 32900632 * 2^31 -.word 2105479749 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 28678040^348 * 375649793 * 2^31 -.word 3784291 // zeta^312 * 2^31 = 28678040^312 * 2^31 = 25309194 * 2^31 -.word 1619664797 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 28678040^312 * 375649793 * 2^31 -.word 48646815 // zeta^412 * 2^31 = 28678040^412 * 2^31 = 11510556 * 2^31 -.word 736619361 // zeta^412 * f(q^(-1) mod 2^32) * 2^31 = 28678040^412 * 375649793 * 2^31 -.word 15892551 // zeta^284 * 2^31 = 28678040^284 * 2^31 = 17389126 * 2^31 -.word 1112819129 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 28678040^284 * 375649793 * 2^31 -.word 50479773 // zeta^472 * 2^31 = 28678040^472 * 2^31 = 4264131 * 2^31 -.word 2420367203 // zeta^472 * f(q^(-1) mod 2^32) * 2^31 = 28678040^472 * 375649793 * 2^31 -.word 20532335 // zeta^492 * 2^31 = 28678040^492 * 2^31 = 22651623 * 2^31 -.word 3597076881 // zeta^492 * f(q^(-1) mod 2^32) * 2^31 = 28678040^492 * 375649793 * 2^31 -.word 46242673 // zeta^364 * 2^31 = 28678040^364 * 2^31 = 8172970 * 2^31 -.word 523030159 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 28678040^364 * 375649793 * 2^31 -.word 58797193 // zeta^344 * 2^31 = 28678040^344 * 2^31 = 24307701 * 2^31 -.word 3703057783 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 28678040^344 * 375649793 * 2^31 -.word 34903951 // zeta^428 * 2^31 = 28678040^428 * 2^31 = 20443666 * 2^31 -.word 1308294769 // zeta^428 * f(q^(-1) mod 2^32) * 2^31 = 28678040^428 * 375649793 * 2^31 -.word 48022295 // zeta^300 * 2^31 = 28678040^300 * 2^31 = 28778784 * 2^31 -.word 1841701609 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 28678040^300 * 375649793 * 2^31 -.word 62080381 // zeta^408 * 2^31 = 28678040^408 * 2^31 = 6865022 * 2^31 -.word 439327875 // zeta^408 * f(q^(-1) mod 2^32) * 2^31 = 28678040^408 * 375649793 * 2^31 -.word 55892463 // zeta^460 * 2^31 = 28678040^460 * 2^31 = 8866965 * 2^31 -.word 2714926097 // zeta^460 * f(q^(-1) mod 2^32) * 2^31 = 28678040^460 * 375649793 * 2^31 -.word 5286953 // zeta^332 * 2^31 = 28678040^332 * 2^31 = 25271104 * 2^31 -.word 1617227223 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 28678040^332 * 375649793 * 2^31 -.word 40872659 // zeta^280 * 2^31 = 28678040^280 * 2^31 = 32984098 * 2^31 -.word 2110821165 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 28678040^280 * 375649793 * 2^31 -.word 42133307 // zeta^396 * 2^31 = 28678040^396 * 2^31 = 14019017 * 2^31 -.word 3044632261 // zeta^396 * f(q^(-1) mod 2^32) * 2^31 = 28678040^396 * 375649793 * 2^31 -.word 54343827 // zeta^268 * 2^31 = 28678040^268 * 2^31 = 9843973 * 2^31 -.word 2777449837 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 28678040^268 * 375649793 * 2^31 -.word 6014597 // zeta^488 * 2^31 = 28678040^488 * 2^31 = 7194579 * 2^31 -.word 2607901563 // zeta^488 * f(q^(-1) mod 2^32) * 2^31 = 28678040^488 * 375649793 * 2^31 -.word 25291403 // zeta^500 * 2^31 = 28678040^500 * 2^31 = 355881 * 2^31 -.word 2170258293 // zeta^500 * f(q^(-1) mod 2^32) * 2^31 = 28678040^500 * 375649793 * 2^31 -.word 14166063 // zeta^372 * 2^31 = 28678040^372 * 2^31 = 13728463 * 2^31 -.word 3026038225 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 28678040^372 * 375649793 * 2^31 -.word 31380141 // zeta^360 * 2^31 = 28678040^360 * 2^31 = 2302061 * 2^31 -.word 2294804307 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 28678040^360 * 375649793 * 2^31 -.word 31709009 // zeta^436 * 2^31 = 28678040^436 * 2^31 = 21728197 * 2^31 -.word 3537982127 // zeta^436 * f(q^(-1) mod 2^32) * 2^31 = 28678040^436 * 375649793 * 2^31 -.word 12550399 // zeta^308 * 2^31 = 28678040^308 * 2^31 = 11713874 * 2^31 -.word 749630721 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 28678040^308 * 375649793 * 2^31 -.word 21111903 // zeta^424 * 2^31 = 28678040^424 * 2^31 = 13908588 * 2^31 -.word 890081697 // zeta^424 * f(q^(-1) mod 2^32) * 2^31 = 28678040^424 * 375649793 * 2^31 -.word 65984707 // zeta^468 * 2^31 = 28678040^468 * 2^31 = 25787077 * 2^31 -.word 3797730621 // zeta^468 * f(q^(-1) mod 2^32) * 2^31 = 28678040^468 * 375649793 * 2^31 -.word 52266271 // zeta^340 * 2^31 = 28678040^340 * 2^31 = 31977548 * 2^31 -.word 2046406881 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 28678040^340 * 375649793 * 2^31 -.word 12778219 // zeta^296 * 2^31 = 28678040^296 * 2^31 = 27066590 * 2^31 -.word 1732129557 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 28678040^296 * 375649793 * 2^31 -.word 39517177 // zeta^404 * 2^31 = 28678040^404 * 2^31 = 14739293 * 2^31 -.word 3090726407 // zeta^404 * f(q^(-1) mod 2^32) * 2^31 = 28678040^404 * 375649793 * 2^31 -.word 12656259 // zeta^276 * 2^31 = 28678040^276 * 2^31 = 24450888 * 2^31 -.word 1564737405 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 28678040^276 * 375649793 * 2^31 -.word 56722355 // zeta^456 * 2^31 = 28678040^456 * 2^31 = 31418183 * 2^31 -.word 4158093901 // zeta^456 * f(q^(-1) mod 2^32) * 2^31 = 28678040^456 * 375649793 * 2^31 -.word 27185869 // zeta^484 * 2^31 = 28678040^484 * 2^31 = 15870328 * 2^31 -.word 1015623475 // zeta^484 * f(q^(-1) mod 2^32) * 2^31 = 28678040^484 * 375649793 * 2^31 -.word 14750755 // zeta^356 * 2^31 = 28678040^356 * 2^31 = 27851125 * 2^31 -.word 3929819613 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 28678040^356 * 375649793 * 2^31 -.word 65797823 // zeta^328 * 2^31 = 28678040^328 * 2^31 = 18723698 * 2^31 -.word 1198225217 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 28678040^328 * 375649793 * 2^31 -.word 13164949 // zeta^420 * 2^31 = 28678040^420 * 2^31 = 28267567 * 2^31 -.word 3956469867 // zeta^420 * f(q^(-1) mod 2^32) * 2^31 = 28678040^420 * 375649793 * 2^31 -.word 1145583 // zeta^292 * 2^31 = 28678040^292 * 2^31 = 8225248 * 2^31 -.word 526375697 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 28678040^292 * 375649793 * 2^31 -.word 12271567 // zeta^392 * 2^31 = 28678040^392 * 2^31 = 6528331 * 2^31 -.word 2565264945 // zeta^392 * f(q^(-1) mod 2^32) * 2^31 = 28678040^392 * 375649793 * 2^31 -.word 22449375 // zeta^452 * 2^31 = 28678040^452 * 2^31 = 12336210 * 2^31 -.word 789457185 // zeta^452 * f(q^(-1) mod 2^32) * 2^31 = 28678040^452 * 375649793 * 2^31 -.word 31982975 // zeta^324 * 2^31 = 28678040^324 * 2^31 = 33215913 * 2^31 -.word 4273139841 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 28678040^324 * 375649793 * 2^31 -.word 35394733 // zeta^264 * 2^31 = 28678040^264 * 2^31 = 9731484 * 2^31 -.word 622767443 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 28678040^264 * 375649793 * 2^31 -.word 23998611 // zeta^388 * 2^31 = 28678040^388 * 2^31 = 12857867 * 2^31 -.word 2970324333 // zeta^388 * f(q^(-1) mod 2^32) * 2^31 = 28678040^388 * 375649793 * 2^31 -.word 62038423 // zeta^260 * 2^31 = 28678040^260 * 2^31 = 24546403 * 2^31 -.word 3718333545 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 28678040^260 * 375649793 * 2^31 -.word 32686385 // zeta^480 * 2^31 = 28678040^480 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^480 * f(q^(-1) mod 2^32) * 2^31 = 28678040^480 * 375649793 * 2^31 -.word 58757463 // zeta^496 * 2^31 = 28678040^496 * 2^31 = 17352831 * 2^31 -.word 3257980073 // zeta^496 * f(q^(-1) mod 2^32) * 2^31 = 28678040^496 * 375649793 * 2^31 -.word 41196349 // zeta^368 * 2^31 = 28678040^368 * 2^31 = 10953311 * 2^31 -.word 2848442051 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 28678040^368 * 375649793 * 2^31 -.word 2430825 // zeta^352 * 2^31 = 28678040^352 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 28678040^352 * 375649793 * 2^31 -.word 26613973 // zeta^432 * 2^31 = 28678040^432 * 2^31 = 23642097 * 2^31 -.word 3660462379 // zeta^432 * f(q^(-1) mod 2^32) * 2^31 = 28678040^432 * 375649793 * 2^31 -.word 7832335 // zeta^304 * 2^31 = 28678040^304 * 2^31 = 12267508 * 2^31 -.word 785060593 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 28678040^304 * 375649793 * 2^31 -.word 62228979 // zeta^416 * 2^31 = 28678040^416 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^416 * f(q^(-1) mod 2^32) * 2^31 = 28678040^416 * 375649793 * 2^31 -.word 12542317 // zeta^464 * 2^31 = 28678040^464 * 2^31 = 3271804 * 2^31 -.word 209379475 // zeta^464 * f(q^(-1) mod 2^32) * 2^31 = 28678040^464 * 375649793 * 2^31 -.word 18302687 // zeta^336 * 2^31 = 28678040^336 * 2^31 = 3819232 * 2^31 -.word 244412193 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 28678040^336 * 375649793 * 2^31 -.word 48515911 // zeta^288 * 2^31 = 28678040^288 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 28678040^288 * 375649793 * 2^31 -.word 21796399 // zeta^400 * 2^31 = 28678040^400 * 2^31 = 18930340 * 2^31 -.word 1211449297 // zeta^400 * f(q^(-1) mod 2^32) * 2^31 = 28678040^400 * 375649793 * 2^31 -.word 27114239 // zeta^272 * 2^31 = 28678040^272 * 2^31 = 13128918 * 2^31 -.word 840186625 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 28678040^272 * 375649793 * 2^31 -.word 36501331 // zeta^384 * 2^31 = 28678040^384 * 2^31 = 15854702 * 2^31 -.word 17843885 // zeta^384 * f(q^(-1) mod 2^32) * 2^31 = 28678040^384 * 375649793 * 2^31 -.word 23796181 // zeta^448 * 2^31 = 28678040^448 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^448 * f(q^(-1) mod 2^32) * 2^31 = 28678040^448 * 375649793 * 2^31 -.word 52637069 // zeta^320 * 2^31 = 28678040^320 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 28678040^320 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots_inv -.syntax unified -.type inv_ntt_u32_33556993_28678040_incomplete, %function -.global inv_ntt_u32_33556993_28678040_incomplete -inv_ntt_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #16] -vsub.s32 Q0, Q2, Q3 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #32] -vadd.s32 Q2, Q2, Q3 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q3, Q0, r8 -vsub.s32 Q1, Q4, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q4, Q4, Q5 -vqrdmlah.s32 Q3, Q0, r12 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q2, Q4 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q5, Q1, r12 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vqrdmulh.s32 Q4, Q0, r10 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q4, Q0, r12 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[16]: Already loaded as Q6 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vadd.s32 Q6, Q6, Q7 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q6, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #128] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[36]: Load as Q5 -vldrw.u32 Q5, [r0, #144] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[32]: Already loaded as Q4 -// input[36]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q4, Q4, Q5 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #176] -vqrdmulh.s32 Q5, Q0, r8 -vsub.s32 Q1, Q2, Q6 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q6 -vqrdmlah.s32 Q5, Q0, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q6, Q1, r12 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q5, Q6 -vmul.u32 Q0, Q0, r9 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #208] -vadd.s32 Q5, Q5, Q6 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q6, Q1, r10 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q5, [r0,#(144)] -// Release input[36] from Q5 -vqrdmlah.s32 Q6, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q3 -// input[52]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vadd.s32 Q3, Q3, Q7 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q6, [r0,#(176)] -// Release input[44] from Q6 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q6 -vldrw.u32 Q6, [r0, #272] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(208)] -// Release input[52] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q5 -// input[68]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #288] -vadd.s32 Q5, Q5, Q6 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(240)] -// Release input[60] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(256)] -// Release input[64] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(272)] -// Release input[68] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[80]: Already loaded as Q4 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #352] -vadd.s32 Q4, Q4, Q7 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #368] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q6 -vldrw.u32 Q6, [r0, #400] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[96]: Already loaded as Q3 -// input[100]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q3, Q3, Q6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(368)] -// Release input[92] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(400)] -// Release input[100] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vadd.s32 Q5, Q5, Q7 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q4 -vldrw.u32 Q4, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #-480] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(464)] -// Release input[116] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q4 -// input[132]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vadd.s32 Q4, Q4, Q6 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(496)] -// Release input[124] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-496)] -// Release input[128] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Already loaded as Q3 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #-400] -vadd.s32 Q3, Q3, Q7 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #-352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[160]: Already loaded as Q5 -// input[164]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q5, Q5, Q6 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #-288] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-352)] -// Release input[164] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q4 -// input[180]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #-272] -vadd.s32 Q4, Q4, Q7 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[196]: Load as Q6 -vldrw.u32 Q6, [r14, #-224] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-288)] -// Release input[180] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q3 -// input[196]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #-208] -vadd.s32 Q3, Q3, Q6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #-176] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-224)] -// Release input[196] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[208]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vadd.s32 Q5, Q5, Q7 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #-112] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #-96] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-176)] -// Release input[208] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[224]: Already loaded as Q4 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q4, Q4, Q6 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #-64] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[244]: Load as Q7 -vldrw.u32 Q7, [r14, #-32] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-96)] -// Release input[228] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q3 -// input[244]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #-16] -vadd.s32 Q3, Q3, Q7 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-64)] -// Release input[236] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[0]: Load as Q5 -vldrw.u32 Q5, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #64] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-32)] -// Release input[244] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[0]: Already loaded as Q5 -// input[16]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vadd.s32 Q5, Q5, Q6 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #192] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(0)] -// Release input[252] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #80] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(0)] -// Release input[0] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(64)] -// Release input[16] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[4]: Already loaded as Q4 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vadd.s32 Q4, Q4, Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #208] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #96] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(144)] -// Release input[36] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(80)] -// Release input[20] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[8]: Already loaded as Q3 -// input[24]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #160] -vadd.s32 Q3, Q3, Q6 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #224] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(208)] -// Release input[52] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #48] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #112] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(96)] -// Release input[24] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[12]: Already loaded as Q5 -// input[28]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vadd.s32 Q5, Q5, Q7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r0,#(224)] -// Release input[56] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q4 -vldrw.u32 Q4, [r0, #256] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(112)] -// Release input[28] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[64]: Already loaded as Q4 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vadd.s32 Q4, Q4, Q6 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #448] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(256)] -// Release input[64] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(320)] -// Release input[80] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[68]: Already loaded as Q3 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vadd.s32 Q3, Q3, Q7 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r0,#(448)] -// Release input[112] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Load as Q5 -vldrw.u32 Q5, [r0, #288] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(400)] -// Release input[100] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(336)] -// Release input[84] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -// input[72]: Already loaded as Q5 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #416] -vadd.s32 Q5, Q5, Q6 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -// Release input[104] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r0,#(288)] -// Release input[72] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r0,#(352)] -// Release input[88] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[76]: Already loaded as Q4 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #432] -vadd.s32 Q4, Q4, Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #496] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q6 -vldrw.u32 Q6, [r14, #-432] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r0,#(368)] -// Release input[92] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[128]: Already loaded as Q3 -// input[144]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #-368] -vadd.s32 Q3, Q3, Q6 -// input[176]: Load as Q4 -vldrw.u32 Q4, [r14, #-304] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r0,#(496)] -// Release input[124] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Load as Q5 -vldrw.u32 Q5, [r14, #-480] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #-416] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-432)] -// Release input[144] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[132]: Already loaded as Q5 -// input[148]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vadd.s32 Q5, Q5, Q7 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-304)] -// Release input[176] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #-464] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #-400] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-480)] -// Release input[132] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-416)] -// Release input[148] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -// input[136]: Already loaded as Q4 -// input[152]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #-336] -vadd.s32 Q4, Q4, Q6 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q5 -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #-384] -vadd.s32 Q6, Q6, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -// Release input[168] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-400)] -// Release input[152] from Q6 -vqrdmlah.s32 Q5, Q1, r12 -// input[140]: Already loaded as Q3 -// input[156]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #-320] -vadd.s32 Q3, Q3, Q7 -// input[188]: Load as Q4 -vldrw.u32 Q4, [r14, #-256] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[192]: Load as Q5 -vldrw.u32 Q5, [r14, #-240] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q4 -vmul.u32 Q0, Q0, r9 -// input[208]: Load as Q6 -vldrw.u32 Q6, [r14, #-176] -vadd.s32 Q7, Q7, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-384)] -// Release input[156] from Q7 -vqrdmlah.s32 Q4, Q1, r12 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Already loaded as Q5 -// input[208]: Already loaded as Q6 -vsub.s32 Q0, Q5, Q6 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #-112] -vadd.s32 Q5, Q5, Q6 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q4, [r14,#(-256)] -// Release input[188] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q3 -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #-160] -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-240)] -// Release input[192] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-176)] -// Release input[208] from Q6 -vqrdmlah.s32 Q3, Q1, r12 -// input[196]: Already loaded as Q4 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q4, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vadd.s32 Q4, Q4, Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q5 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q5 -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q6 -vldrw.u32 Q6, [r14, #-144] -vadd.s32 Q7, Q7, Q5 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q5, Q1, r10 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-160)] -// Release input[212] from Q7 -vqrdmlah.s32 Q5, Q1, r12 -// input[200]: Already loaded as Q3 -// input[216]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vadd.s32 Q3, Q3, Q6 -// input[248]: Load as Q4 -vldrw.u32 Q4, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q1, Q2, Q4 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q4 -vqrdmlah.s32 Q6, Q0, r12 -vstrw.u32 Q5, [r14,#(-32)] -// Release input[244] from Q5 -vqrdmulh.s32 Q4, Q1, r6 -vsub.s32 Q0, Q3, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q3, Q3, Q2 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Load as Q5 -vldrw.u32 Q5, [r14, #-192] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q6, Q4 -vmul.u32 Q0, Q0, r9 -// input[220]: Load as Q7 -vldrw.u32 Q7, [r14, #-128] -vadd.s32 Q6, Q6, Q4 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmulh.s32 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q6, [r14,#(-144)] -// Release input[216] from Q6 -vqrdmlah.s32 Q4, Q1, r12 -// input[204]: Already loaded as Q5 -// input[220]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vadd.s32 Q5, Q5, Q7 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q1, Q2, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q2, Q2, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vstrw.u32 Q4, [r14,#(-16)] -// Release input[248] from Q4 -vqrdmulh.s32 Q3, Q1, r6 -vsub.s32 Q0, Q5, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q3, Q1, r12 -// input[0]: Load as Q4 -vldrw.u32 Q4, [r0, #0] -vqrdmulh.s32 Q2, Q0, r10 -vsub.s32 Q1, Q7, Q3 -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q6 -vldrw.u32 Q6, [r0, #256] -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmulh.s32 Q3, Q1, r10 -vstrw.u32 Q5, [r14,#(-192)] -// Release input[204] from Q5 -vmul.u32 Q1, Q1, r9 -vstrw.u32 Q7, [r14,#(-128)] -// Release input[220] from Q7 -vqrdmlah.s32 Q3, Q1, r12 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -// Release input[0] from Q4 -// Release input[64] from Q6 -mov r10, #0 -.equ const_barrett, 63 -movw r9, #:lower16:const_barrett -movt r9, #:upper16:const_barrett -vidup.u32 Q0, r10, #1 -vshl.u32 Q0, Q0, #6 -vldrw.32 Q1, [r0, Q0, UXTW #2] -vqrdmulh.s32 Q2, Q1, r9 -neg r12, r12 -vmla.s32 Q1, Q2, r12 -neg r12, r12 -vstrw.32 Q1, [r0, Q0, UXTW #2] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -mov r11, #0 // XXXXX -.equ q_half, 16778496 -movw r4, #:lower16:q_half -movt r4, #:upper16:q_half -.equ pow_2_n_mod_q, 34739919 -movw r3, #:lower16:pow_2_n_mod_q -movt r3, #:upper16:pow_2_n_mod_q -.equ pow_2_n_mod_q_twisted, 4294311729 -movw r2, #:lower16:pow_2_n_mod_q_twisted -movt r2, #:upper16:pow_2_n_mod_q_twisted -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vsub.s32 Q2, Q0, Q1 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #-496] -vadd.s32 Q0, Q0, Q1 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #-240] -vqrdmulh.s32 Q1, Q2, r8 -vsub.s32 Q5, Q3, Q4 -vmul.u32 Q2, Q2, r7 -vadd.s32 Q3, Q3, Q4 -vqrdmlah.s32 Q1, Q2, r12 -vqrdmulh.s32 Q4, Q5, r6 -vsub.s32 Q2, Q0, Q3 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q4, Q5, r12 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #16] -vqrdmulh.s32 Q3, Q2, r10 -vsub.s32 Q6, Q1, Q4 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q3, Q2, r12 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #272] -vqrdmulh.s32 Q2, Q0, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q0, Q0, r2 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vqrdmlah.s32 Q2, Q0, r12 -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q1, Q1, r2 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #-480] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(0)] -vqrdmulh.s32 Q4, Q6, r10 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r9 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q1 -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -// input[132]: Already loaded as Q3 -vqrdmlah.s32 Q4, Q6, r12 -vadd.s32 Q5, Q5, Q7 -// input[196]: Load as Q1 -vldrw.u32 Q1, [r14, #-224] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q2, Q3, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q4, Q4, #1 -vpt.s32 LT, Q4, r11 -vaddt.s32 Q4, Q4, r12 -vpt.s32 GE, Q4, r4 -vsubt.s32 Q4, Q4, r12 -vstrw.u32 Q4, [r14,#(-240)] -// Release input[192] from Q4 -vqrdmulh.s32 Q1, Q2, r6 -vsub.s32 Q0, Q5, Q3 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q1, Q2, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q3, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q3, Q0, r12 -// input[72]: Load as Q6 -vldrw.u32 Q6, [r0, #288] -vqrdmulh.s32 Q0, Q5, r3 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q5, Q5, r2 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vqrdmlah.s32 Q0, Q5, r12 -// Release input[4] from Q5 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #-464] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(16)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q7 -// input[8]: Already loaded as Q2 -// input[72]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[136]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-224)] -// Release input[196] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #304] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-464)] -// Release input[136] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #-448] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(32)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(288)] -// Release input[72] from Q6 -// input[12]: Already loaded as Q1 -// input[76]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[140]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #-192] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #64] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #320] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-448)] -// Release input[140] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #-432] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(48)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q7 -// input[16]: Already loaded as Q3 -// input[80]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[144]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-192)] -// Release input[204] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #336] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-432)] -// Release input[144] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[16] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #-416] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(64)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q6 -// input[20]: Already loaded as Q2 -// input[84]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[148]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #-160] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #352] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-416)] -// Release input[148] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #-400] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(80)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(336)] -// Release input[84] from Q7 -// input[24]: Already loaded as Q1 -// input[88]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[152]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #-144] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #112] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #368] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-400)] -// Release input[152] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #-384] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(96)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(352)] -// Release input[88] from Q6 -// input[28]: Already loaded as Q3 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[156]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #-128] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-144)] -// Release input[216] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #384] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-384)] -// Release input[156] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[28] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #-368] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(112)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q7 -// input[32]: Already loaded as Q2 -// input[96]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[160]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[100]: Load as Q7 -vldrw.u32 Q7, [r0, #400] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-368)] -// Release input[160] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[164]: Load as Q5 -vldrw.u32 Q5, [r14, #-352] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(128)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q6 -// input[36]: Already loaded as Q1 -// input[100]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[164]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[104]: Load as Q6 -vldrw.u32 Q6, [r0, #416] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-352)] -// Release input[164] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #-336] -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(144)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(400)] -// Release input[100] from Q7 -// input[40]: Already loaded as Q3 -// input[104]: Already loaded as Q6 -vsub.s32 Q0, Q3, Q6 -// input[168]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q6 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-96)] -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q7 -vldrw.u32 Q7, [r0, #432] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-336)] -// Release input[168] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[40] from Q3 -vqrdmulh.s32 Q3, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #-320] -vqrdmlah.s32 Q3, Q6, r12 -vstrw.u32 Q0, [r0,#(160)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q6 -// input[44]: Already loaded as Q2 -// input[108]: Already loaded as Q7 -vsub.s32 Q0, Q2, Q7 -// input[172]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q7 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-80)] -// Release input[232] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[112]: Load as Q6 -vldrw.u32 Q6, [r0, #448] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-320)] -// Release input[172] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #-304] -vqrdmlah.s32 Q2, Q7, r12 -vstrw.u32 Q0, [r0,#(176)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(432)] -// Release input[108] from Q7 -// input[48]: Already loaded as Q1 -// input[112]: Already loaded as Q6 -vsub.s32 Q0, Q1, Q6 -// input[176]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q6 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #208] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #464] -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-304)] -// Release input[176] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[180]: Load as Q5 -vldrw.u32 Q5, [r14, #-288] -vqrdmlah.s32 Q1, Q6, r12 -vstrw.u32 Q0, [r0,#(192)] -vqrdmulh.s32 Q2, Q4, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q6 -// input[52]: Already loaded as Q3 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q3, Q7 -// input[180]: Already loaded as Q5 -vqrdmlah.s32 Q2, Q4, r12 -vadd.s32 Q3, Q3, Q7 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q1 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q1 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmulh.s32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vqrdmlah.s32 Q1, Q4, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q7, Q1 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q6 -vldrw.u32 Q6, [r0, #480] -vqrdmulh.s32 Q0, Q3, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q3, Q3, r2 -vstrw.u32 Q5, [r14,#(-288)] -// Release input[180] from Q5 -vqrdmlah.s32 Q0, Q3, r12 -// Release input[52] from Q3 -vqrdmulh.s32 Q3, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #-272] -vqrdmlah.s32 Q3, Q7, r12 -vstrw.u32 Q0, [r0,#(208)] -vqrdmulh.s32 Q1, Q4, r10 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q7 -// input[56]: Already loaded as Q2 -// input[120]: Already loaded as Q6 -vsub.s32 Q0, Q2, Q6 -// input[184]: Already loaded as Q5 -vqrdmlah.s32 Q1, Q4, r12 -vadd.s32 Q2, Q2, Q6 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vqrdmulh.s32 Q6, Q0, r8 -vsub.s32 Q4, Q5, Q3 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q3 -vqrdmlah.s32 Q6, Q0, r12 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmulh.s32 Q3, Q4, r6 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q2, Q2, Q5 -vqrdmlah.s32 Q3, Q4, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q4, Q6, Q3 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q6, Q6, Q3 -vqrdmlah.s32 Q5, Q0, r12 -// input[124]: Load as Q7 -vldrw.u32 Q7, [r0, #496] -vqrdmulh.s32 Q0, Q2, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q2, Q2, r2 -vstrw.u32 Q5, [r14,#(-272)] -// Release input[184] from Q5 -vqrdmlah.s32 Q0, Q2, r12 -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q6, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q6, Q6, r2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #-256] -vqrdmlah.s32 Q2, Q6, r12 -vstrw.u32 Q0, [r0,#(224)] -vqrdmulh.s32 Q3, Q4, r10 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vmul.u32 Q4, Q4, r9 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q6 -// input[60]: Already loaded as Q1 -// input[124]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -// input[188]: Already loaded as Q5 -vqrdmlah.s32 Q3, Q4, r12 -vadd.s32 Q1, Q1, Q7 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q7, Q0, r8 -vsub.s32 Q4, Q5, Q2 -vmul.u32 Q0, Q0, r7 -vadd.s32 Q5, Q5, Q2 -vqrdmlah.s32 Q7, Q0, r12 -vshr.s32 Q3, Q3, #1 -vpt.s32 LT, Q3, r11 -vaddt.s32 Q3, Q3, r12 -vpt.s32 GE, Q3, r4 -vsubt.s32 Q3, Q3, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vqrdmulh.s32 Q2, Q4, r6 -vsub.s32 Q0, Q1, Q5 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q1, Q1, Q5 -vqrdmlah.s32 Q2, Q4, r12 -vqrdmulh.s32 Q5, Q0, r10 -vsub.s32 Q3, Q7, Q2 -vmul.u32 Q0, Q0, r9 -vadd.s32 Q7, Q7, Q2 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q0, Q1, r3 -vshr.s32 Q5, Q5, #1 -vpt.s32 LT, Q5, r11 -vaddt.s32 Q5, Q5, r12 -vpt.s32 GE, Q5, r4 -vsubt.s32 Q5, Q5, r12 -vmul.u32 Q1, Q1, r2 -vstrw.u32 Q5, [r14,#(-256)] -// Release input[188] from Q5 -vqrdmlah.s32 Q0, Q1, r12 -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q7, r3 -vshr.s32 Q0, Q0, #1 -vpt.s32 LT, Q0, r11 -vaddt.s32 Q0, Q0, r12 -vpt.s32 GE, Q0, r4 -vsubt.s32 Q0, Q0, r12 -vmul.u32 Q7, Q7, r2 -vqrdmlah.s32 Q1, Q7, r12 -vstrw.u32 Q0, [r0,#(240)] -vqrdmulh.s32 Q2, Q3, r10 -vshr.s32 Q1, Q1, #1 -vpt.s32 LT, Q1, r11 -vaddt.s32 Q1, Q1, r12 -vpt.s32 GE, Q1, r4 -vsubt.s32 Q1, Q1, r12 -vmul.u32 Q3, Q3, r9 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q7 -vqrdmlah.s32 Q2, Q3, r12 -vshr.s32 Q2, Q2, #1 -vpt.s32 LT, Q2, r11 -vaddt.s32 Q2, Q2, r12 -vpt.s32 GE, Q2, r4 -vsubt.s32 Q2, Q2, r12 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2311 -// Instruction count: 1810 \ No newline at end of file diff --git a/tests/saber/auto/ntt_u32_33556993_28678040_incomplete.s b/tests/saber/auto/ntt_u32_33556993_28678040_incomplete.s deleted file mode 100644 index e93e64f..0000000 --- a/tests/saber/auto/ntt_u32_33556993_28678040_incomplete.s +++ /dev/null @@ -1,2035 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_u32_33556993_28678040_incomplete, %function -.global ntt_u32_33556993_28678040_incomplete -ntt_u32_33556993_28678040_incomplete: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q2, Q2, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #144] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #224] -vmul.u32 Q1, Q1, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #208] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(128)] -// Release input[32] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #288] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #272] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #256] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #368] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #352] -vmul.u32 Q2, Q2, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #432] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(368)] -// Release input[92] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q1, Q1, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(320)] -// Release input[80] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #384] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #480] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #-448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(496)] -// Release input[124] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q2, Q2, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #-480] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #-384] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-480)] -// Release input[132] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q1, Q1, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-496)] -// Release input[128] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #-432] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q0 -vldrw.u32 Q0, [r14, #-320] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #-352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #-368] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-320)] -// Release input[172] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q2, Q2, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #-288] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[200]: Load as Q3 -vldrw.u32 Q3, [r14, #-208] -vmul.u32 Q1, Q1, r9 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #-240] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[220]: Load as Q0 -vldrw.u32 Q0, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-208)] -// Release input[200] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #-144] -vmul.u32 Q0, Q0, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #-176] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[236]: Load as Q2 -vldrw.u32 Q2, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-128)] -// Release input[220] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q2, Q2, r9 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #-96] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[252]: Load as Q1 -vldrw.u32 Q1, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-64)] -// Release input[236] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #-16] -vmul.u32 Q1, Q1, r9 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #-32] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-112)] -// Release input[224] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -vqrdmulh.s32 Q0, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(0)] -// Release input[252] from Q1 -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2003 -// Instruction count: 1565 \ No newline at end of file diff --git a/tests/saber/auto/ntt_u32_33556993_28678040_incomplete_double.s b/tests/saber/auto/ntt_u32_33556993_28678040_incomplete_double.s deleted file mode 100644 index ffb5c99..0000000 --- a/tests/saber/auto/ntt_u32_33556993_28678040_incomplete_double.s +++ /dev/null @@ -1,2342 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 29095681 // zeta^128 * 2^31 = 28678040^128 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 28678040^128 * 375649793 * 2^31 -.word 14476917 // zeta^ 64 * 2^31 = 28678040^ 64 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 64 * 375649793 * 2^31 -.word 43317805 // zeta^192 * 2^31 = 28678040^192 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 28678040^192 * 375649793 * 2^31 -.word 18598075 // zeta^ 32 * 2^31 = 28678040^ 32 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 32 * 375649793 * 2^31 -.word 39999747 // zeta^ 16 * 2^31 = 28678040^ 16 * 2^31 = 20428075 * 2^31 -.word 3454780669 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 16 * 375649793 * 2^31 -.word 45317587 // zeta^144 * 2^31 = 28678040^144 * 2^31 = 14626653 * 2^31 -.word 3083517997 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 28678040^144 * 375649793 * 2^31 -.word 4885007 // zeta^160 * 2^31 = 28678040^160 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 28678040^160 * 375649793 * 2^31 -.word 48811299 // zeta^ 80 * 2^31 = 28678040^ 80 * 2^31 = 29737761 * 2^31 -.word 4050555101 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 80 * 375649793 * 2^31 -.word 54571669 // zeta^208 * 2^31 = 28678040^208 * 2^31 = 30285189 * 2^31 -.word 4085587819 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 28678040^208 * 375649793 * 2^31 -.word 64683161 // zeta^ 96 * 2^31 = 28678040^ 96 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 96 * 375649793 * 2^31 -.word 59281651 // zeta^ 48 * 2^31 = 28678040^ 48 * 2^31 = 21289485 * 2^31 -.word 3509906701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 48 * 375649793 * 2^31 -.word 40500013 // zeta^176 * 2^31 = 28678040^176 * 2^31 = 9914896 * 2^31 -.word 634504915 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 28678040^176 * 375649793 * 2^31 -.word 34427601 // zeta^224 * 2^31 = 28678040^224 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 28678040^224 * 375649793 * 2^31 -.word 25917637 // zeta^112 * 2^31 = 28678040^112 * 2^31 = 22603682 * 2^31 -.word 1446525243 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 28678040^112 * 375649793 * 2^31 -.word 8356523 // zeta^240 * 2^31 = 28678040^240 * 2^31 = 16204162 * 2^31 -.word 1036987221 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 28678040^240 * 375649793 * 2^31 -.word 31719253 // zeta^ 8 * 2^31 = 28678040^ 8 * 2^31 = 23825509 * 2^31 -.word 3672199851 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 8 * 375649793 * 2^31 -.word 5075563 // zeta^ 4 * 2^31 = 28678040^ 4 * 2^31 = 9010590 * 2^31 -.word 576633749 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 4 * 375649793 * 2^31 -.word 43115375 // zeta^132 * 2^31 = 28678040^132 * 2^31 = 20699126 * 2^31 -.word 1324642961 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 28678040^132 * 375649793 * 2^31 -.word 54842419 // zeta^136 * 2^31 = 28678040^136 * 2^31 = 27028662 * 2^31 -.word 1729702349 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 28678040^136 * 375649793 * 2^31 -.word 35131011 // zeta^ 68 * 2^31 = 28678040^ 68 * 2^31 = 341080 * 2^31 -.word 21827453 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 68 * 375649793 * 2^31 -.word 44664611 // zeta^196 * 2^31 = 28678040^196 * 2^31 = 21220783 * 2^31 -.word 3505510109 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 28678040^196 * 375649793 * 2^31 -.word 1316163 // zeta^ 72 * 2^31 = 28678040^ 72 * 2^31 = 14833295 * 2^31 -.word 3096742077 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 72 * 375649793 * 2^31 -.word 65968403 // zeta^ 36 * 2^31 = 28678040^ 36 * 2^31 = 25331745 * 2^31 -.word 3768591597 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 36 * 375649793 * 2^31 -.word 53949037 // zeta^164 * 2^31 = 28678040^164 * 2^31 = 5289426 * 2^31 -.word 338497427 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 28678040^164 * 375649793 * 2^31 -.word 10391631 // zeta^200 * 2^31 = 28678040^200 * 2^31 = 2138810 * 2^31 -.word 136873393 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 28678040^200 * 375649793 * 2^31 -.word 52363231 // zeta^100 * 2^31 = 28678040^100 * 2^31 = 5705868 * 2^31 -.word 365147681 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 28678040^100 * 375649793 * 2^31 -.word 39928117 // zeta^228 * 2^31 = 28678040^228 * 2^31 = 17686665 * 2^31 -.word 3279343819 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 28678040^228 * 375649793 * 2^31 -.word 54335767 // zeta^ 40 * 2^31 = 28678040^ 40 * 2^31 = 6490403 * 2^31 -.word 2562837737 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 40 * 375649793 * 2^31 -.word 54457727 // zeta^ 20 * 2^31 = 28678040^ 20 * 2^31 = 9106105 * 2^31 -.word 2730229889 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 20 * 375649793 * 2^31 -.word 27596809 // zeta^148 * 2^31 = 28678040^148 * 2^31 = 18817700 * 2^31 -.word 1204240887 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 28678040^148 * 375649793 * 2^31 -.word 46002083 // zeta^168 * 2^31 = 28678040^168 * 2^31 = 19648405 * 2^31 -.word 3404885597 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 28678040^168 * 375649793 * 2^31 -.word 14847715 // zeta^ 84 * 2^31 = 28678040^ 84 * 2^31 = 1579445 * 2^31 -.word 2248560413 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 84 * 375649793 * 2^31 -.word 1129279 // zeta^212 * 2^31 = 28678040^212 * 2^31 = 7769916 * 2^31 -.word 497236673 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 28678040^212 * 375649793 * 2^31 -.word 35733845 // zeta^104 * 2^31 = 28678040^104 * 2^31 = 31254932 * 2^31 -.word 2000162987 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 28678040^104 * 375649793 * 2^31 -.word 54563587 // zeta^ 52 * 2^31 = 28678040^ 52 * 2^31 = 21843119 * 2^31 -.word 3545336573 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 52 * 375649793 * 2^31 -.word 35404977 // zeta^180 * 2^31 = 28678040^180 * 2^31 = 11828796 * 2^31 -.word 756985167 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 28678040^180 * 375649793 * 2^31 -.word 61099389 // zeta^232 * 2^31 = 28678040^232 * 2^31 = 26362414 * 2^31 -.word 1687065731 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 28678040^232 * 375649793 * 2^31 -.word 52947923 // zeta^116 * 2^31 = 28678040^116 * 2^31 = 19828530 * 2^31 -.word 1268929069 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 28678040^116 * 375649793 * 2^31 -.word 41822583 // zeta^244 * 2^31 = 28678040^244 * 2^31 = 33201112 * 2^31 -.word 2124709001 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 28678040^244 * 375649793 * 2^31 -.word 26241327 // zeta^ 24 * 2^31 = 28678040^ 24 * 2^31 = 572895 * 2^31 -.word 2184146129 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 24 * 375649793 * 2^31 -.word 12770159 // zeta^ 12 * 2^31 = 28678040^ 12 * 2^31 = 23713020 * 2^31 -.word 1517517457 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 12 * 375649793 * 2^31 -.word 24980679 // zeta^140 * 2^31 = 28678040^140 * 2^31 = 19537976 * 2^31 -.word 1250335033 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 28678040^140 * 375649793 * 2^31 -.word 5033605 // zeta^152 * 2^31 = 28678040^152 * 2^31 = 26691971 * 2^31 -.word 3855639419 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 28678040^152 * 375649793 * 2^31 -.word 61827033 // zeta^ 76 * 2^31 = 28678040^ 76 * 2^31 = 8285889 * 2^31 -.word 2677740071 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 76 * 375649793 * 2^31 -.word 11221523 // zeta^204 * 2^31 = 28678040^204 * 2^31 = 24690028 * 2^31 -.word 1580041197 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 28678040^204 * 375649793 * 2^31 -.word 8316793 // zeta^ 88 * 2^31 = 28678040^ 88 * 2^31 = 9249292 * 2^31 -.word 591909511 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 88 * 375649793 * 2^31 -.word 19091691 // zeta^ 44 * 2^31 = 28678040^ 44 * 2^31 = 4778209 * 2^31 -.word 2453265685 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 44 * 375649793 * 2^31 -.word 32210035 // zeta^172 * 2^31 = 28678040^172 * 2^31 = 13113327 * 2^31 -.word 2986672525 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 28678040^172 * 375649793 * 2^31 -.word 16634213 // zeta^216 * 2^31 = 28678040^216 * 2^31 = 29292862 * 2^31 -.word 1874600091 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 28678040^216 * 375649793 * 2^31 -.word 20871313 // zeta^108 * 2^31 = 28678040^108 * 2^31 = 25384023 * 2^31 -.word 3771937135 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 28678040^108 * 375649793 * 2^31 -.word 46581651 // zeta^236 * 2^31 = 28678040^236 * 2^31 = 10905370 * 2^31 -.word 697890413 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 28678040^236 * 375649793 * 2^31 -.word 63329695 // zeta^ 56 * 2^31 = 28678040^ 56 * 2^31 = 8247799 * 2^31 -.word 2675302497 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 56 * 375649793 * 2^31 -.word 51221435 // zeta^ 28 * 2^31 = 28678040^ 28 * 2^31 = 16167867 * 2^31 -.word 3182148165 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 28 * 375649793 * 2^31 -.word 18467171 // zeta^156 * 2^31 = 28678040^156 * 2^31 = 22046437 * 2^31 -.word 3558347933 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 28678040^156 * 375649793 * 2^31 -.word 9983051 // zeta^184 * 2^31 = 28678040^184 * 2^31 = 5086187 * 2^31 -.word 2472974773 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 28678040^184 * 375649793 * 2^31 -.word 37083207 // zeta^ 92 * 2^31 = 28678040^ 92 * 2^31 = 656361 * 2^31 -.word 2189487545 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 92 * 375649793 * 2^31 -.word 52674527 // zeta^220 * 2^31 = 28678040^220 * 2^31 = 18153794 * 2^31 -.word 1161754145 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 28678040^220 * 375649793 * 2^31 -.word 7721125 // zeta^120 * 2^31 = 28678040^120 * 2^31 = 28113639 * 2^31 -.word 3946619227 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 28678040^120 * 375649793 * 2^31 -.word 8896309 // zeta^ 60 * 2^31 = 28678040^ 60 * 2^31 = 3732072 * 2^31 -.word 238834379 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 28678040^ 60 * 375649793 * 2^31 -.word 2061353 // zeta^188 * 2^31 = 28678040^188 * 2^31 = 22126384 * 2^31 -.word 1415980503 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 28678040^188 * 375649793 * 2^31 -.word 9383201 // zeta^248 * 2^31 = 28678040^248 * 2^31 = 8471290 * 2^31 -.word 542121183 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 28678040^248 * 375649793 * 2^31 -.word 23761465 // zeta^124 * 2^31 = 28678040^124 * 2^31 = 9445744 * 2^31 -.word 604481479 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 28678040^124 * 375649793 * 2^31 -.word 24512363 // zeta^252 * 2^31 = 28678040^252 * 2^31 = 794839 * 2^31 -.word 2198349461 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 28678040^252 * 375649793 * 2^31 -.align 4 -barrett_offsets: -.byte 0 -.byte 64 -.byte 128 -.byte 192 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_u32_33556993_28678040_incomplete_double, %function -.global ntt_u32_33556993_28678040_incomplete_double -ntt_u32_33556993_28678040_incomplete_double: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d0-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Using modulus 33556993 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q1, Q0, r10 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #-496] -vmul.u32 Q0, Q0, r9 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #256] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #0] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[196]: Load as Q4 -vldrw.u32 Q4, [r14, #-224] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q6 -// input[196]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vqrdmulh.s32 Q1, Q2, r10 -vsub.s32 Q4, Q3, Q0 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q0 -vqrdmlah.s32 Q1, Q2, r12 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #16] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q4, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q4, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vadd.s32 Q2, Q2, Q5 -vstrw.u32 Q4, [r14,#(-224)] -// Release input[196] from Q4 -vqrdmlah.s32 Q6, Q3, r12 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q0, Q6 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vadd.s32 Q0, Q0, Q6 -// input[200]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #-464] -vmul.u32 Q1, Q1, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #288] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #32] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[204]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #-448] -vmul.u32 Q0, Q0, r9 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #-176] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[208]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #-432] -vmul.u32 Q2, Q2, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(48)] -// Release input[12] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[16]: Load as Q0 -vldrw.u32 Q0, [r0, #64] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[212]: Load as Q1 -vldrw.u32 Q1, [r14, #-160] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-176)] -// Release input[208] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[212]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #-416] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(64)] -// Release input[16] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #80] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[216]: Load as Q0 -vldrw.u32 Q0, [r14, #-144] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-160)] -// Release input[212] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[216]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #-400] -vmul.u32 Q0, Q0, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #96] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #-128] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-144)] -// Release input[216] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[220]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vmul.u32 Q2, Q2, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #-112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-128)] -// Release input[220] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[224]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q1, Q1, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #384] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(112)] -// Release input[28] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #128] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[228]: Load as Q0 -vldrw.u32 Q0, [r14, #-96] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[228]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q0, Q0, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #400] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #144] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #-80] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-96)] -// Release input[228] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[232]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q2, Q2, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #160] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[236]: Load as Q1 -vldrw.u32 Q1, [r14, #-64] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[236]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(160)] -// Release input[40] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #176] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-64)] -// Release input[236] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[240]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #-304] -vmul.u32 Q0, Q0, r9 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #448] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(176)] -// Release input[44] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #192] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-48)] -// Release input[240] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[244]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #-288] -vmul.u32 Q2, Q2, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #464] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #208] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-32)] -// Release input[244] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #-272] -vmul.u32 Q1, Q1, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #480] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[252]: Load as Q0 -vldrw.u32 Q0, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(480)] -// Release input[120] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[252]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vmul.u32 Q0, Q0, r9 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #496] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #240] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #192] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(0)] -// Release input[252] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #128] -vmul.u32 Q2, Q2, r9 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #64] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #0] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[52]: Load as Q1 -vldrw.u32 Q1, [r0, #208] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(64)] -// Release input[16] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #144] -vmul.u32 Q1, Q1, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #80] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #16] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #224] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(208)] -// Release input[52] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(144)] -// Release input[36] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #160] -vmul.u32 Q0, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #96] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #32] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #240] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(96)] -// Release input[24] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #176] -vmul.u32 Q2, Q2, r9 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #112] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #48] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #448] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #384] -vmul.u32 Q1, Q1, r9 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #320] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #256] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #464] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #400] -vmul.u32 Q0, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #336] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[68]: Load as Q1 -vldrw.u32 Q1, [r0, #272] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #480] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #416] -vmul.u32 Q2, Q2, r9 -// input[88]: Load as Q4 -vldrw.u32 Q4, [r0, #352] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r0,#(272)] -// Release input[68] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #288] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #496] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(352)] -// Release input[88] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #432] -vmul.u32 Q1, Q1, r9 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #368] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #304] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #-368] -vmul.u32 Q0, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #-432] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #-496] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-432)] -// Release input[144] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[180]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[164]: Load as Q3 -vldrw.u32 Q3, [r14, #-352] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #-416] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #-480] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-288)] -// Release input[180] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-352)] -// Release input[164] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #-336] -vmul.u32 Q1, Q1, r9 -// input[152]: Load as Q4 -vldrw.u32 Q4, [r14, #-400] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #-464] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #-256] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-272)] -// Release input[184] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-400)] -// Release input[152] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[172]: Load as Q3 -vldrw.u32 Q3, [r14, #-320] -vmul.u32 Q0, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #-384] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #-448] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #-48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-320)] -// Release input[172] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-384)] -// Release input[156] from Q4 -vadd.s32 Q1, Q1, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #-112] -vmul.u32 Q2, Q2, r9 -// input[208]: Load as Q4 -vldrw.u32 Q4, [r14, #-176] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #-32] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-176)] -// Release input[208] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #-96] -vmul.u32 Q1, Q1, r9 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #-160] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #-16] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[232]: Load as Q3 -vldrw.u32 Q3, [r14, #-80] -vmul.u32 Q0, Q0, r9 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #-144] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-224)] -// Release input[196] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q0, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q2, Q3, r12 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q1, Q2 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q2 -vqrdmlah.s32 Q5, Q0, r12 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #0] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-80)] -// Release input[232] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q0, Q2, r10 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #-64] -vmul.u32 Q2, Q2, r9 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #-128] -vqrdmlah.s32 Q0, Q2, r12 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q2, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q1, Q3, r12 -// input[204]: Load as Q0 -vldrw.u32 Q0, [r14, #-192] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q0, Q1 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q0, Q0, Q1 -vqrdmlah.s32 Q5, Q2, r12 -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #48] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q0, Q0, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q1 -vqrdmulh.s32 Q2, Q1, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #32] -vmul.u32 Q1, Q1, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #16] -vqrdmlah.s32 Q2, Q1, r12 -vstrw.u32 Q0, [r14,#(-192)] -// Release input[204] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q1, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q0, Q3, r12 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #0] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q0 -vldrw.u32 Q0, [r0, #112] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q1, [r1, #96] -vqrdmulh.s32 Q7, Q1, r6 -vadd.s32 Q3, Q3, Q5 -vmul.u32 Q1, Q1, r5 -/// Twist in[8] by r6 -vstrw.u32 Q3, [r1, #64] -vqrdmlah.s32 Q7, Q1, r12 -// Release input[12] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q3, r6 -vsub.s32 Q4, Q2, Q6 -vmul.u32 Q3, Q3, r5 -vstrw.u32 Q4, [r1,#32] -vqrdmlah.s32 Q7, Q3, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[8] from Q3 -vqrdmulh.s32 Q7, Q4, r8 -vadd.s32 Q2, Q2, Q6 -vmul.u32 Q4, Q4, r7 -vstrw.u32 Q2, [r1], #128 -vqrdmlah.s32 Q7, Q4, r12 -vneg.s32 Q7, Q7 -// Release input[4] from Q4 -vqrdmulh.s32 Q1, Q2, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q2, Q2, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[0] from Q2 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[28]: Already loaded as Q0 -vqrdmulh.s32 Q1, Q0, r10 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #96] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #80] -vqrdmlah.s32 Q1, Q0, r12 -vqrdmulh.s32 Q4, Q2, r10 -vsub.s32 Q0, Q3, Q1 -vmul.u32 Q2, Q2, r9 -vadd.s32 Q3, Q3, Q1 -vqrdmlah.s32 Q4, Q2, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #64] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q2, Q1, Q4 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q1, Q1, Q4 -vqrdmlah.s32 Q5, Q0, r12 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #176] -vqrdmulh.s32 Q6, Q3, r8 -vsub.s32 Q0, Q2, Q5 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q0, [r1, #96] -vqrdmulh.s32 Q7, Q0, r6 -vadd.s32 Q2, Q2, Q5 -vmul.u32 Q0, Q0, r5 -/// Twist in[24] by r6 -vstrw.u32 Q2, [r1, #64] -vqrdmlah.s32 Q7, Q0, r12 -// Release input[28] from Q0 -vqrdmlah.s32 Q6, Q3, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q2, r6 -vsub.s32 Q3, Q1, Q6 -vmul.u32 Q2, Q2, r5 -vstrw.u32 Q3, [r1,#32] -vqrdmlah.s32 Q7, Q2, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[24] from Q2 -vqrdmulh.s32 Q7, Q3, r8 -vadd.s32 Q1, Q1, Q6 -vmul.u32 Q3, Q3, r7 -vstrw.u32 Q1, [r1], #128 -vqrdmlah.s32 Q7, Q3, r12 -vneg.s32 Q7, Q7 -// Release input[20] from Q3 -vqrdmulh.s32 Q0, Q1, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q1, Q1, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q0, [r1,#-112] -// Release input[16] from Q1 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[44]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[40]: Load as Q1 -vldrw.u32 Q1, [r0, #160] -vmul.u32 Q4, Q4, r9 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #144] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[32]: Load as Q0 -vldrw.u32 Q0, [r0, #128] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #240] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[40] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[44] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[40] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[36] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[32] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #224] -vmul.u32 Q3, Q3, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #208] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #192] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #304] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[56] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[60] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[56] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[52] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[48] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #288] -vmul.u32 Q4, Q4, r9 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #272] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #256] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #368] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[72] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[76] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[72] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[68] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[64] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[88]: Load as Q1 -vldrw.u32 Q1, [r0, #352] -vmul.u32 Q3, Q3, r9 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #336] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[80]: Load as Q0 -vldrw.u32 Q0, [r0, #320] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #432] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[88] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[92] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[88] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[84] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[80] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #416] -vmul.u32 Q4, Q4, r9 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #400] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #384] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[124]: Load as Q3 -vldrw.u32 Q3, [r0, #496] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[104] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[108] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[104] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[100] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[96] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[124]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #480] -vmul.u32 Q3, Q3, r9 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #464] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #448] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #-448] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[120] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[124] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[120] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[116] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[112] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[140]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #-464] -vmul.u32 Q4, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #-480] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #-496] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #-384] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[136] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[140] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[136] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[132] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[128] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #-400] -vmul.u32 Q3, Q3, r9 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #-416] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #-432] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #-320] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[152] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[156] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[152] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[148] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[144] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #-336] -vmul.u32 Q4, Q4, r9 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #-352] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[160]: Load as Q0 -vldrw.u32 Q0, [r14, #-368] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #-256] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[168] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[172] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[168] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[164] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[160] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[184]: Load as Q1 -vldrw.u32 Q1, [r14, #-272] -vmul.u32 Q3, Q3, r9 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #-288] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #-304] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #-192] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[184] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[188] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[184] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[180] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[176] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #-208] -vmul.u32 Q4, Q4, r9 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #-224] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #-240] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #-128] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[200] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[204] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[200] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[196] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[192] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #-144] -vmul.u32 Q3, Q3, r9 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #-160] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[208]: Load as Q0 -vldrw.u32 Q0, [r14, #-176] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #-64] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q7, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[216] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q3, r12 -// Release input[220] from Q3 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[216] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[212] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[208] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[236]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r10 -// input[232]: Load as Q1 -vldrw.u32 Q1, [r14, #-80] -vmul.u32 Q4, Q4, r9 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #-96] -vqrdmlah.s32 Q0, Q4, r12 -vqrdmulh.s32 Q3, Q1, r10 -vsub.s32 Q4, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q3, Q1, r12 -// input[224]: Load as Q0 -vldrw.u32 Q0, [r14, #-112] -vqrdmulh.s32 Q5, Q4, r6 -vsub.s32 Q1, Q0, Q3 -vmul.u32 Q4, Q4, r5 -vadd.s32 Q0, Q0, Q3 -vqrdmlah.s32 Q5, Q4, r12 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #0] -vqrdmulh.s32 Q6, Q2, r8 -vsub.s32 Q4, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q4, [r1, #96] -vqrdmulh.s32 Q7, Q4, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q4, Q4, r5 -/// Twist in[232] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q7, Q4, r12 -// Release input[236] from Q4 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q7, Q7 -vstrw.u32 Q7, [r1, #112] -vqrdmulh.s32 Q7, Q1, r6 -vsub.s32 Q2, Q0, Q6 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q7, Q1, r12 -vstrw.u32 Q7, [r1, #80] -// Release input[232] from Q1 -vqrdmulh.s32 Q7, Q2, r8 -vadd.s32 Q0, Q0, Q6 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q7, Q2, r12 -vneg.s32 Q7, Q7 -// Release input[228] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q7, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[224] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r10 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #-16] -vmul.u32 Q3, Q3, r9 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #-32] -vqrdmlah.s32 Q0, Q3, r12 -vqrdmulh.s32 Q4, Q1, r10 -vsub.s32 Q3, Q2, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q2, Q2, Q0 -vqrdmlah.s32 Q4, Q1, r12 -// input[240]: Load as Q0 -vldrw.u32 Q0, [r14, #-48] -vqrdmulh.s32 Q5, Q3, r6 -vsub.s32 Q1, Q0, Q4 -vmul.u32 Q3, Q3, r5 -vadd.s32 Q0, Q0, Q4 -vqrdmlah.s32 Q5, Q3, r12 -vqrdmulh.s32 Q4, Q2, r8 -vsub.s32 Q3, Q1, Q5 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q3, [r1, #96] -vqrdmulh.s32 Q6, Q3, r6 -vadd.s32 Q1, Q1, Q5 -vmul.u32 Q3, Q3, r5 -/// Twist in[248] by r6 -vstrw.u32 Q1, [r1, #64] -vqrdmlah.s32 Q6, Q3, r12 -// Release input[252] from Q3 -vqrdmlah.s32 Q4, Q2, r12 -vneg.s32 Q6, Q6 -vstrw.u32 Q6, [r1, #112] -vqrdmulh.s32 Q6, Q1, r6 -vsub.s32 Q2, Q0, Q4 -vmul.u32 Q1, Q1, r5 -vstrw.u32 Q2, [r1,#32] -vqrdmlah.s32 Q6, Q1, r12 -vstrw.u32 Q6, [r1, #80] -// Release input[248] from Q1 -vqrdmulh.s32 Q6, Q2, r8 -vadd.s32 Q0, Q0, Q4 -vmul.u32 Q2, Q2, r7 -vstrw.u32 Q0, [r1], #128 -vqrdmlah.s32 Q6, Q2, r12 -vneg.s32 Q6, Q6 -// Release input[244] from Q2 -vqrdmulh.s32 Q1, Q0, r8 -vstrw.u32 Q6, [r1,#-80] -vmul.u32 Q0, Q0, r7 -ldrd r10, r9, [r11], #+8 -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q1, [r1,#-112] -// Release input[240] from Q0 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// Modular inverse of 33556993 mod 2^32 = 375649793 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d0-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr -.align 4 -barrett_offsets_addr: .word barrett_offsets - -// Line count: 2310 -// Instruction count: 1856 \ No newline at end of file diff --git a/tests/saber/karatsuba.s b/tests/saber/karatsuba.s deleted file mode 100644 index 98785db..0000000 --- a/tests/saber/karatsuba.s +++ /dev/null @@ -1,873 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "karatsuba_const.h" - -.syntax unified - -/* Template: - * - * .type FUNCNAME, %function - * .global FUNCNAME - * FUNCNAME: - * push {r4-r12,lr} - * vpush {d0-d15} - * - * foo .req r0 - * .unreq bar - * - * vpop {d0-d15} - * pop {r4-r12,lr} - * bx lr - */ - -/* - * Karatsuba evaluation - */ - -.type karatsuba_fwd_dual_32_loop, %function -.global karatsuba_fwd_dual_32_loop -karatsuba_fwd_dual_32_loop: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define VECTOR_SIZE 16 - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - #define NUM_LIMBS 3 - - even_a .req q0 - odd_a .req q1 - sum_a .req q2 - - even_b .req q3 - odd_b .req q4 - sum_b .req q5 - - loop_cnt .req r14 - mov loop_cnt, #(KARATSUBA_FWD_ITERATIONS-2) - - /* First iteration */ - #define OFFSET 0 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - #define SHIFT (-NUM_LIMBS*LIMB_BYTE_SIZE) - - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - wls loop_cnt, loop_cnt, karatsuba_fwd_dual_32_loop_end -karatsuba_fwd_dual_32_loop_start: - - #define OFFSET 0 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - le loop_cnt, karatsuba_fwd_dual_32_loop_start -karatsuba_fwd_dual_32_loop_end: - - /* Last iteration */ - - #define OFFSET 0 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - #if OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexpected offset - #endif - vstrh.u16 even_a, [dst /*, #(OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE) */], #(NUM_LIMBS*LIMB_BYTE_SIZE) - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 1 - vld21.s16 {even_a,odd_a}, [src] - vld20.s16 {even_a,odd_a}, [src]! - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 2 - vld21.s16 {even_b,odd_b}, [src] - vld20.s16 {even_b,odd_b}, [src]! - vadd.u16 sum_a, even_a, odd_a - vstrh.u16 even_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_a, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - #define OFFSET 3 - vadd.u16 sum_b, even_b, odd_b - vstrh.u16 even_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + EVEN_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 odd_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + ODD_INDEX * LIMB_BYTE_SIZE)] - vstrh.u16 sum_b, [dst, #(SHIFT + OFFSET * VECTOR_SIZE + SUM_INDEX * LIMB_BYTE_SIZE)] - #undef OFFSET - - .unreq even_a - .unreq odd_a - .unreq sum_a - .unreq even_b - .unreq odd_b - .unreq sum_b - - .unreq src - .unreq dst - .unreq loop_cnt - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* - * Karatsuba interpolation - */ - -.type karatsuba_naive_inv_dual_32, %function -.global karatsuba_naive_inv_dual_32 -karatsuba_naive_inv_dual_32: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - mov carry, #0 - - even_even .req q0 - sum_even .req q1 - even_odd .req q2 - sum_odd .req q3 - odd_even .req q4 - odd_odd .req q5 - - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_even, sum_even, odd_even - vsub.u16 sum_odd, sum_odd, odd_odd - vsub.u16 sum_even, sum_even, even_even - vsub.u16 sum_odd, sum_odd, even_odd - - vadd.u16 even_odd, even_odd, odd_even - vshlc odd_odd, carry, #16 - vadd.u16 even_even, even_even, odd_odd - - vst40.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst41.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst42.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst43.s16 {even_even, sum_even, even_odd, sum_odd}, [dst]! - - .unreq even_even - .unreq even_odd - .unreq odd_even - .unreq odd_odd - .unreq sum_even - .unreq sum_odd - - add src, src, #16 - - even_even .req q0 - sum_even .req q1 - even_odd .req q2 - sum_odd .req q3 - odd_even .req q4 - odd_odd .req q5 - - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_even, sum_even, odd_even - vsub.u16 sum_odd, sum_odd, odd_odd - vsub.u16 sum_even, sum_even, even_even - vsub.u16 sum_odd, sum_odd, even_odd - - vadd.u16 even_odd, even_odd, odd_even - vshlc odd_odd, carry, #16 - vadd.u16 even_even, even_even, odd_odd - - vst40.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst41.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst42.s16 {even_even, sum_even, even_odd, sum_odd}, [dst] - vst43.s16 {even_even, sum_even, even_odd, sum_odd}, [dst]! - - carry_correct .req r11 - ldrh carry_correct, [dst, #(-128)] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-128)] - .unreq carry_correct - - .unreq even_even - .unreq even_odd - .unreq odd_even - .unreq odd_odd - .unreq sum_even - .unreq sum_odd - - .unreq src - .unreq dst - .unreq carry - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* Slightly pipeline optimized version of Karatsuba interpolation. */ - -.type karatsuba_inv_dual_32, %function -.global karatsuba_inv_dual_32 -karatsuba_inv_dual_32: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - sum_even .req q4 // alloc q4 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - mov carry, #0 - - odd_even .req q6 // alloc q4, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q4, q5, q6 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - add src, src, #16 - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - vshlc odd_odd, carry, #16 - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - // Correction of initial coefficient after we know the wraparound - carry_correct .req r11 - ldrh carry_correct, [dst, #(-64)] - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - .unreq carry_correct - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -/* Slightly pipelined and looping version of Karatsuba interpolation. */ - -.type karatsuba_inv_dual_32_loop, %function -.global karatsuba_inv_dual_32_loop -karatsuba_inv_dual_32_loop: - push {r4-r12,lr} - vpush {d0-d15} - - src .req r0 - dst .req r1 - carry .req r12 - - carry_correct .req r11 - - #define LIMB_BYTE_SIZE 64 - #define LIMB_BYTE_SIZE_HALF (LIMB_BYTE_SIZE/2) - #define NUM_LIMBS 3 - #define TOTAL_SIZE_BYTES (NUM_LIMBS*LIMB_BYTE_SIZE) - - #define EVEN_INDEX 0 - #define ODD_INDEX 1 - #define SUM_INDEX 2 - - /* INITIAL ITERATION */ - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - #define SHIFT 0 - - sum_even .req q7 // alloc q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - mov carry, #0 - - odd_even .req q6 // alloc q6, q7 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - - #if SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexepected offset - #endif - - vldrh.u16 even_even, [src], #(TOTAL_SIZE_BYTES) //[src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - #undef SHIFT - #define SHIFT (-TOTAL_SIZE_BYTES) - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT (16 - TOTAL_SIZE_BYTES) - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT 0 - - // Preload for next iteration - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - // Preload for next iteration - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - // Correction of initial coefficient after we know the wraparound - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - mov carry, #0 - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - /* LOOP */ - - loop_cnt .req r14 - mov loop_cnt, #(KARATSUBA_INV_ITERATIONS-2) - - wls loop_cnt, loop_cnt, karatsuba_inv_dual_32_loop_end - -karatsuba_inv_dual_32_loop_start: - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even and odd_even preloaded - vsub.u16 sum_even, sum_even, odd_even - - #if SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE != 0 - #error Unexepected offset - #endif - - even_even .req q5 // alloc q7, q5, q6 - vldrh.u16 even_even, [src], #TOTAL_SIZE_BYTES // [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - #undef SHIFT - #define SHIFT (-TOTAL_SIZE_BYTES) - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT (16-TOTAL_SIZE_BYTES) - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - // Preload for next iteration - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - // Preload for next iteration - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - // Correction of initial coefficient after we know the wraparound - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - mov carry, #0 - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - le loop_cnt, karatsuba_inv_dual_32_loop_start - -karatsuba_inv_dual_32_loop_end: - - /* LAST ITERATION */ - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even and odd_even preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q4, q5, q6 - vldrh.u16 even_even, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - #undef SHIFT - #define SHIFT 16 - - odd_even .req q6 // alloc q4, q5, q6 - vldrh.u16 odd_even, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE)] - - vshlc odd_odd, carry, #16 - - sum_even .req q7 // alloc q4, q5, q6, q7 - vldrh.u16 sum_even, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE)] - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5, q6, q7 - .unreq even_even // alloc q6, q7 - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vmov.u16 carry_correct, f_even_even[0] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - f_even_even .req q0 - f_sum_even .req q1 - f_even_odd .req q2 - f_sum_odd .req q3 - - // sum_even already preloaded - // odd_even already preloaded - vsub.u16 sum_even, sum_even, odd_even - - even_even .req q5 // alloc q5, q6, q7 - vldrh.u16 even_even, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE)] - vsub.u16 f_sum_even, sum_even, even_even - .unreq sum_even // alloc q5, q6 - - even_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 even_odd, [src, #(SHIFT + EVEN_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vadd.u16 f_even_odd, even_odd, odd_even - .unreq odd_even // alloc q4, q5 - - sum_odd .req q6 // alloc q4, q5, q6 - vldrh.u16 sum_odd, [src, #(SHIFT + SUM_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - vsub.u16 sum_odd, sum_odd, even_odd - .unreq even_odd // alloc q5, q6 - - odd_odd .req q4 // alloc q4, q5, q6 - vldrh.u16 odd_odd, [src, #(SHIFT + ODD_INDEX * LIMB_BYTE_SIZE + LIMB_BYTE_SIZE_HALF)] - - vsub.u16 f_sum_odd, sum_odd, odd_odd - .unreq sum_odd // alloc q4, q5 - - vshlc odd_odd, carry, #16 - - vadd.u16 f_even_even, even_even, odd_odd - .unreq odd_odd // alloc q5 - .unreq even_even // alloc -- - - vst40.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - sub carry_correct, carry_correct, carry - strh carry_correct, [dst, #(-64)] - vst41.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst42.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst] - vst43.s16 {f_even_even, f_sum_even, f_even_odd, f_sum_odd}, [dst]! - .unreq f_even_even - .unreq f_even_odd - .unreq f_sum_even - .unreq f_sum_odd - - .unreq src - .unreq dst - .unreq carry - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/saber/saber.mk b/tests/saber/saber.mk index a2d1a4a..b897893 100644 --- a/tests/saber/saber.mk +++ b/tests/saber/saber.mk @@ -22,6 +22,7 @@ SABER_SOURCES += rng.c # Assembly sources required for this test SABER_ASMS += saber_round.s SABER_ASMS += montgomery.s -SABER_ASMS += auto/inv_ntt_u32_33556993_28678040_incomplete.s -SABER_ASMS += auto/ntt_u32_33556993_28678040_incomplete.s -SABER_ASMS += auto/ntt_u32_33556993_28678040_incomplete_double.s +SABER_ASM_DIR = ../../asm/auto/saber +SABER_ASMS += $(SABER_ASM_DIR)/inv_ntt_u32_33556993_28678040_incomplete.s +SABER_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete.s +SABER_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete_double.s diff --git a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s deleted file mode 100644 index 3f7bb39..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s +++ /dev/null @@ -1,268 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd, %function -.global poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd -poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -nop -nop -nop -nop -nop -nop -vld20.u16 {Q4, Q5}, [r2] -vld21.u16 {Q4, Q5}, [r2]! -vld20.u16 {Q6, Q7}, [r2] -vld21.u16 {Q6, Q7}, [r2]! -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #24] -vmul.u16 Q0, Q4, r11 -ldrd r10, r9, [r1, #56] -vmul.u16 Q1, Q4, r9 -vneg.s16 Q3, Q6 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #16] -vmla.s16 Q1, Q6, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #48] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #8] -vmla.s16 Q1, Q6, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #40] -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #0] -vmla.s16 Q1, Q6, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #32] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q6, r11 -vstrh.u16 Q1, [r0,#(16)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(0)] -vld20.u16 {Q0, Q1}, [r1] -vld21.u16 {Q0, Q1}, [r1]! -vld20.u16 {Q2, Q3}, [r1] -vld21.u16 {Q2, Q3}, [r1]! -vst20.u16 {Q1, Q2}, [r1] -vst21.u16 {Q1, Q2}, [r1]! -vst20.u16 {Q3, Q4}, [r1] -vst21.u16 {Q3, Q4}, [r1]! -vadd.u16 Q0, Q0, Q1 -vadd.u16 Q2, Q2, Q3 -vst20.u16 {Q0, Q1}, [r1] -vst21.u16 {Q0, Q1}, [r1]! -vst20.u16 {Q2, Q3}, [r1] -vst21.u16 {Q2, Q3}, [r1]! -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #-104] -vmul.u16 Q0, Q5, r11 -ldrd r10, r9, [r1, #-72] -vmul.u16 Q1, Q5, r9 -vneg.s16 Q3, Q7 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #-112] -vmla.s16 Q1, Q7, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #-80] -vmla.s16 Q1, Q7, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #-120] -vmla.s16 Q1, Q7, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #-88] -vmla.s16 Q1, Q7, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #-128] -vmla.s16 Q1, Q7, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #-96] -vmla.s16 Q1, Q7, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q7, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q5, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q5, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q7, r11 -vstrh.u16 Q1, [r0,#(48)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(32)] -vadd.u16 Q4, Q4, Q5 -vadd.u16 Q6, Q6, Q7 -vmov.u16 Q2, #0 -mov r12, #0 -ldrd r14, r11, [r1, #-40] -vmul.u16 Q0, Q4, r11 -ldrd r10, r9, [r1, #-8] -vmul.u16 Q1, Q4, r9 -vneg.s16 Q3, Q6 -vmla.s16 Q0, Q3, r9 -ldrd r8, r7, [r1, #-48] -vmla.s16 Q1, Q6, r11 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r10 -ldrd r11, r9, [r1, #-16] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r7 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r14, r10, [r1, #-56] -vmla.s16 Q1, Q6, r7 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r11 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r11 -ldrd r9, r7, [r1, #-24] -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r10 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -ldrd r11, r8, [r1, #-64] -vmla.s16 Q1, Q6, r10 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r14 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r9 -ldrd r10, r7, [r1, #-32] -vmla.s16 Q1, Q6, r14 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r8 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r7 -vshlc Q2, r12, #16 -vmla.s16 Q0, Q3, r7 -vmla.s16 Q1, Q6, r8 -vshlc Q0, r12, #16 -vmla.s16 Q0, Q4, r11 -vshlc Q1, r12, #16 -vmla.s16 Q1, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q1, Q6, r11 -vstrh.u16 Q1, [r0,#(80)] -vsub.u16 Q0, Q0, Q2 -vmla.s16 Q0, Q3, r10 -vstrh.u16 Q0, [r0,#(64)] -nop -nop -nop -nop -nop -nop -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_mve_simd.s deleted file mode 100644 index f26a204..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_anticyclic_mve_simd.s +++ /dev/null @@ -1,274 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_mve_simd, %function -.global poly_u16_mul_32_anticyclic_mve_simd -poly_u16_mul_32_anticyclic_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -nop // XXX -mov r14, #0x42 -mov r14, #0x3 -vmsr p0, r14 -vldrh.u16 Q0, [r2, #(2 * 0)] -vldrh.u16 Q1, [r2, #(2 * 8)] -vldrh.u16 Q2, [r2, #(2 * 16)] -vldrh.u16 Q3, [r2, #(2 * 24)] -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -ldrh r10, [r1, #46] -ldrh r9, [r1, #62] -vmul.u16 Q4, Q0, r14 -vmul.u16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmul.u16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmul.u16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #12] -ldrh r11, [r1, #28] -ldrh r10, [r1, #44] -ldrh r9, [r1, #60] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #10] -ldrh r11, [r1, #26] -ldrh r10, [r1, #42] -ldrh r9, [r1, #58] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #8] -ldrh r11, [r1, #24] -ldrh r10, [r1, #40] -ldrh r9, [r1, #56] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #6] -ldrh r11, [r1, #22] -ldrh r10, [r1, #38] -ldrh r9, [r1, #54] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #4] -ldrh r11, [r1, #20] -ldrh r10, [r1, #36] -ldrh r9, [r1, #52] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #2] -ldrh r11, [r1, #18] -ldrh r10, [r1, #34] -ldrh r9, [r1, #50] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vshlc Q4, r12, #16 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vshlc Q5, r12, #16 -vmla.s16 Q6, Q3, r9 -vshlc Q6, r12, #16 -vshlc Q7, r12, #16 -neg r12, r12 -vmov.u16 Q4[0], r12 -ldrh r14, [r1, #0] -ldrh r11, [r1, #16] -ldrh r10, [r1, #32] -ldrh r9, [r1, #48] -vmla.s16 Q4, Q0, r14 -vmla.s16 Q5, Q0, r11 -vmla.s16 Q5, Q1, r14 -vmla.s16 Q6, Q0, r10 -vmla.s16 Q6, Q1, r11 -vmla.s16 Q6, Q2, r14 -vmla.s16 Q7, Q0, r9 -vmla.s16 Q7, Q1, r10 -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r14 -neg r11, r11 -neg r10, r10 -neg r9, r9 -vmla.s16 Q4, Q1, r9 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r11 -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r10 -vmla.s16 Q6, Q3, r9 -neg r12, r12 -vstrh.u16 Q4, [r0,#(0)] -vstrh.u16 Q5, [r0,#(16)] -vstrh.u16 Q6, [r0,#(32)] -vstrh.u16 Q7, [r0,#(48)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/auto/poly_u16_mul_32_mve_simd.s b/tests/schoolbook/auto/poly_u16_mul_32_mve_simd.s deleted file mode 100644 index 0d85860..0000000 --- a/tests/schoolbook/auto/poly_u16_mul_32_mve_simd.s +++ /dev/null @@ -1,386 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_mve_simd, %function -.global poly_u16_mul_32_mve_simd -poly_u16_mul_32_mve_simd: -push {r4-r11,lr} -vpush {d8-d15} -mov r0, r0 -mov r0, r0 -mov r12, #0 -ldrh r14, [r1, #14] -ldrh r11, [r1, #30] -ldrh r10, [r1, #46] -ldrh r9, [r1, #62] -vldrh.u16 Q0, [r2, #(2 * 0)] -vmul.u16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vmul.u16 Q1, Q0, r11 -vldrh.u16 Q2, [r2, #(2 * 8)] -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vmul.u16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vldrh.u16 Q3, [r2, #(2 * 16)] -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vmul.u16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vldrh.u16 Q4, [r2, #(2 * 24)] -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vmul.u16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vmul.u16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vmul.u16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vmov.u16 Q1, #0 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #12] -ldrh r11, [r1, #28] -ldrh r10, [r1, #44] -ldrh r9, [r1, #60] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #10] -ldrh r11, [r1, #26] -ldrh r10, [r1, #42] -ldrh r9, [r1, #58] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #8] -ldrh r11, [r1, #24] -ldrh r10, [r1, #40] -ldrh r9, [r1, #56] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #6] -ldrh r11, [r1, #22] -ldrh r10, [r1, #38] -ldrh r9, [r1, #54] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #4] -ldrh r11, [r1, #20] -ldrh r10, [r1, #36] -ldrh r9, [r1, #52] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #2] -ldrh r11, [r1, #18] -ldrh r10, [r1, #34] -ldrh r9, [r1, #50] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vshlc Q1, r12, #16 -vstrh.u16 Q1, [r0,#(112)] -mov r12, #0 -ldrh r14, [r1, #0] -ldrh r11, [r1, #16] -ldrh r10, [r1, #32] -ldrh r9, [r1, #48] -vldrh.u16 Q1, [r0, #(2 * 0)] -vmla.s16 Q1, Q0, r14 -vstrh.u16 Q1, [r0,#(0)] -vldrh.u16 Q1, [r0, #(2 * 8)] -vmla.s16 Q1, Q0, r11 -vmla.s16 Q1, Q2, r14 -vstrh.u16 Q1, [r0,#(16)] -vldrh.u16 Q1, [r0, #(2 * 16)] -vmla.s16 Q1, Q0, r10 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q1, Q3, r14 -vstrh.u16 Q1, [r0,#(32)] -vldrh.u16 Q1, [r0, #(2 * 24)] -vmla.s16 Q1, Q0, r9 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q1, Q3, r11 -vmla.s16 Q1, Q4, r14 -vstrh.u16 Q1, [r0,#(48)] -vldrh.u16 Q1, [r0, #(2 * 32)] -vmla.s16 Q1, Q2, r9 -vmla.s16 Q1, Q3, r10 -vmla.s16 Q1, Q4, r11 -vstrh.u16 Q1, [r0,#(64)] -vldrh.u16 Q1, [r0, #(2 * 40)] -vmla.s16 Q1, Q3, r9 -vmla.s16 Q1, Q4, r10 -vstrh.u16 Q1, [r0,#(80)] -vldrh.u16 Q1, [r0, #(2 * 48)] -vmla.s16 Q1, Q4, r9 -vstrh.u16 Q1, [r0,#(96)] -vldrh.u16 Q1, [r0, #(2 * 56)] -vstrh.u16 Q1, [r0,#(112)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/schoolbook/poly_u16_32_acc.s b/tests/schoolbook/poly_u16_32_acc.s deleted file mode 100644 index a6e5821..0000000 --- a/tests/schoolbook/poly_u16_32_acc.s +++ /dev/null @@ -1,1062 +0,0 @@ -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// -.syntax unified -.type poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd_handshuffle, %function -.global poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd_handshuffle -poly_u16_mul_32_anticyclic_acc_karatsuba_mve_simd_handshuffle: -push {r4-r11,lr} -vpush {d0-d15} -vld20.u16 {Q4, Q5}, [r2] -sub sp, sp, #224 -vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -mov r14, #19 -wls r14, r14, loop_end -loop_start: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vld20.u16 {Q4, Q5}, [r2] -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vld21.u16 {Q4, Q5}, [r2]! -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -add r0, r0, #64 - -le r14, loop_start -loop_end: - -//vld20.u16 {Q4, Q5}, [r2] -//vld21.u16 {Q4, Q5}, [r2]! -mov r11, sp -vld20.u16 {Q6, Q7}, [r2] -ldrd r10, r9, [r1, #24] -vld21.u16 {Q6, Q7}, [r2]! -vmul.u16 Q2, Q4, r9 -vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -ldrd r8, r7, [r1, #56] -vmov.u16 Q5, #0 -vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -vmul.u16 Q3, Q4, r7 -vneg.s16 Q7, Q6 -vmla.s16 Q2, Q7, r7 -ldrd r6, r5, [r1, #16] -vmla.s16 Q3, Q6, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r8 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r7, [r1, #48] -vmla.s16 Q3, Q6, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r7 -vld21.u16 {Q0, Q1}, [r1]! -ldrd r10, r8, [r1, #(-32 + 8)] -vmla.s16 Q3, Q6, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vst20.u16 {Q1, Q2}, [r11] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q4, r9 -vshlc Q5, r12, #16 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q2, Q7, r9 -vst20.u16 {Q0, Q1}, [r11] -ldrd r7, r5, [r1, #(-32 + 40)] -vmla.s16 Q3, Q6, r6 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vld20.u16 {Q0, Q1}, [r1] -ldrd r9, r6, [r1, #(-32 + 0)] -vmla.s16 Q3, Q6, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r7 -vshlc Q5, r12, #16 -vld21.u16 {Q0, Q1}, [r1]! -vmla.s16 Q2, Q7, r7 -ldrd r8, r5, [r1, #(-32 -32 + 32)] -vadd.u16 Q0, Q0, Q1 -vmla.s16 Q3, Q6, r10 -vst20.u16 {Q1, Q2}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -vst21.u16 {Q1, Q2}, [r11]! -vmla.s16 Q3, Q6, r6 -vst20.u16 {Q0, Q1}, [r11] -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vst21.u16 {Q0, Q1}, [r11]! -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -vshlc Q5, r12, #16 -vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -vmla.s16 Q3, Q6, r9 -vstrh.u16 Q3, [r11,#(-32-32-32-32 + 144)] -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r8 -vstrh.u16 Q2, [r11,#(-32-32-32-32 + 128)] -//UP: vld20.u16 {Q0, Q1}, [r1] -//UP: vld21.u16 {Q0, Q1}, [r1]! -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//UP:vst20.u16 {Q0, Q1}, [r11] -//UP:vst21.u16 {Q0, Q1}, [r11]! -//UP:vld20.u16 {Q0, Q1}, [r1] -//UP:vld21.u16 {Q0, Q1}, [r1]! -//UP:vadd.u16 Q0, Q0, Q1 -//UP:vst20.u16 {Q1, Q2}, [r11] -//UP:vst21.u16 {Q1, Q2}, [r11]! -//vst20.u16 {Q0, Q1}, [r11] -//vst21.u16 {Q0, Q1}, [r11]! -//vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -//vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -ldrd r10, r9, [r11, #-104] -vmov.u16 Q5, #0 -//ldrd r10, r9, [r11, #-104] -vmul.u16 Q2, Q0, r10 -ldrd r8, r7, [r11, #-40] -ldrd r6, r5, [r11, #-112] -vmul.u16 Q3, Q0, r8 -vneg.s16 Q7, Q1 -vmla.s16 Q2, Q7, r8 -//ldrd r6, r5, [r11, #-112] -ldrd r4, r3, [r11, #-48] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r10, r8, [r11, #-120] -vmla.s16 Q3, Q1, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -ldrd r5, r3, [r11, #-56] -vmla.s16 Q3, Q1, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r3 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r3 -ldrd r6, r4, [r11, #-64] -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r5 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r5 -ldrd r8, r3, [r11, #-128] -vmla.s16 Q3, Q1, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r3 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r4 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r4 -vmla.s16 Q3, Q1, r3 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q0, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r6 -vshlc Q5, r12, #16 -vmla.s16 Q2, Q7, r6 -vmla.s16 Q3, Q1, r8 -vshlc Q2, r12, #16 -neg r7, r7 -vmla.s16 Q2, Q0, r7 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q0, r9 -vshlc Q5, r12, #16 -vmla.s16 Q3, Q1, r7 -vsub.u16 Q2, Q2, Q5 -vmla.s16 Q2, Q7, r9 -vadd.u16 Q4, Q4, Q0 -vldrh.u16 Q5, [r11,#0] -vadd.u16 Q5, Q5, Q2 -vldrh.u16 Q7, [r11,#16] -vadd.u16 Q7, Q7, Q3 -vldrh.u16 Q0, [r0, #0] -vadd.u16 Q5, Q0, Q5 -vldrh.u16 Q0, [r0, #16] -vadd.u16 Q7, Q0, Q7 -//DOWN:vstrh.u16 Q5, [r0, #0] -//DOWN:vstrh.u16 Q7, [r0, #16] -vadd.u16 Q6, Q6, Q1 -mov r12, #0 -vneg.s16 Q3, Q3 -ldrd r10, r9, [r11, #-72] -vmov.u16 Q0, #0 -vmla.s16 Q3, Q4, r9 -ldrd r8, r7, [r11, #-8] -vmla.s16 Q2, Q4, r7 -vneg.s16 Q1, Q6 -vmla.s16 Q3, Q1, r7 -ldrd r6, r5, [r11, #-80] -vmla.s16 Q2, Q6, r9 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r8 -vstrh.u16 Q5, [r0, #0] -ldrd r9, r7, [r11, #-16] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r5 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -vstrh.u16 Q7, [r0, #16] -ldrd r10, r8, [r11, #-88] -vmla.s16 Q2, Q6, r5 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r9 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r9 -ldrd r7, r5, [r11, #-24] -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r8 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -ldrd r9, r6, [r11, #-96] -vmla.s16 Q2, Q6, r8 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r10 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r7 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r7 -ldrd r8, r5, [r11, #-32] -vmla.s16 Q2, Q6, r10 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r6 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r5 -vshlc Q0, r12, #16 -vmla.s16 Q3, Q1, r5 -vmla.s16 Q2, Q6, r6 -vshlc Q3, r12, #16 -vmla.s16 Q3, Q4, r9 -vshlc Q2, r12, #16 -vmla.s16 Q2, Q4, r8 -vshlc Q0, r12, #16 -vmla.s16 Q2, Q6, r9 -vsub.u16 Q3, Q3, Q0 -vldrh.u16 Q0, [r11,#0] -vmla.s16 Q3, Q1, r8 -vldrh.u16 Q1, [r11,#16] -vsub.u16 Q0, Q3, Q0 -vstrh.u16 Q0, [r0, #32] -vsub.u16 Q1, Q2, Q1 -vstrh.u16 Q1, [r0, #48] - -// vld20.u16 {Q4, Q5}, [r2] -// vld21.u16 {Q4, Q5}, [r2]! -// vld20.u16 {Q6, Q7}, [r2] -// vld21.u16 {Q6, Q7}, [r2]! -// vstrh.u16 Q5, [sp, #(128 + 3*32 - 16)] -// vstrh.u16 Q7, [sp, #(128 + 3*32 - 32)] -// mov r12, #0 -// mov r11, sp -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r1, #24] -// vmul.u16 Q2, Q4, r9 -// ldrd r8, r7, [r1, #56] -// vmul.u16 Q3, Q4, r7 -// vneg.s16 Q7, Q6 -// vmla.s16 Q2, Q7, r7 -// ldrd r6, r5, [r1, #16] -// vmla.s16 Q3, Q6, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r8 -// ldrd r9, r7, [r1, #48] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r10, r8, [r1, #8] -// vmla.s16 Q3, Q6, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r9 -// ldrd r7, r5, [r1, #40] -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r9, r6, [r1, #0] -// vmla.s16 Q3, Q6, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r7 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r7 -// ldrd r8, r5, [r1, #32] -// vmla.s16 Q3, Q6, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// vmla.s16 Q3, Q6, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q6, r9 -// vstrh.u16 Q3, [r11,#(144)] -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r8 -// vstrh.u16 Q2, [r11,#(128)] -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vld20.u16 {Q0, Q1}, [r1] -// vld21.u16 {Q0, Q1}, [r1]! -// vadd.u16 Q0, Q0, Q1 -// vst20.u16 {Q1, Q2}, [r11] -// vst21.u16 {Q1, Q2}, [r11]! -// vst20.u16 {Q0, Q1}, [r11] -// vst21.u16 {Q0, Q1}, [r11]! -// vldrh.u16 Q0, [sp, #(128 + 3*32 - 16)] -// vldrh.u16 Q1, [sp, #(128 + 3*32 - 32)] -// vmov.u16 Q5, #0 -// ldrd r10, r9, [r11, #-104] -// vmul.u16 Q2, Q0, r10 -// ldrd r8, r7, [r11, #-40] -// vmul.u16 Q3, Q0, r8 -// vneg.s16 Q7, Q1 -// vmla.s16 Q2, Q7, r8 -// ldrd r6, r5, [r11, #-112] -// ldrd r4, r3, [r11, #-48] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r10, r8, [r11, #-120] -// vmla.s16 Q3, Q1, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// ldrd r5, r3, [r11, #-56] -// vmla.s16 Q3, Q1, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r3 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r3 -// ldrd r6, r4, [r11, #-64] -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r5 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r5 -// ldrd r8, r3, [r11, #-128] -// vmla.s16 Q3, Q1, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r3 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r4 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r4 -// vmla.s16 Q3, Q1, r3 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q0, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r6 -// vshlc Q5, r12, #16 -// vmla.s16 Q2, Q7, r6 -// vmla.s16 Q3, Q1, r8 -// vshlc Q2, r12, #16 -// neg r7, r7 -// vmla.s16 Q2, Q0, r7 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q0, r9 -// vshlc Q5, r12, #16 -// vmla.s16 Q3, Q1, r7 -// vsub.u16 Q2, Q2, Q5 -// vmla.s16 Q2, Q7, r9 -// vadd.u16 Q4, Q4, Q0 -// vldrh.u16 Q5, [r11,#0] -// vadd.u16 Q5, Q5, Q2 -// vldrh.u16 Q7, [r11,#16] -// vadd.u16 Q7, Q7, Q3 -// vstrh.u16 Q5, [r0, #0] -// vstrh.u16 Q7, [r0, #16] -// vadd.u16 Q6, Q6, Q1 -// vneg.s16 Q3, Q3 -// vmov.u16 Q0, #0 -// mov r12, #0 -// ldrd r10, r9, [r11, #-72] -// vmla.s16 Q3, Q4, r9 -// ldrd r8, r7, [r11, #-8] -// vmla.s16 Q2, Q4, r7 -// vneg.s16 Q1, Q6 -// vmla.s16 Q3, Q1, r7 -// ldrd r6, r5, [r11, #-80] -// vmla.s16 Q2, Q6, r9 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r8 -// ldrd r9, r7, [r11, #-16] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r5 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r10, r8, [r11, #-88] -// vmla.s16 Q2, Q6, r5 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r9 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r9 -// ldrd r7, r5, [r11, #-24] -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r8 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// ldrd r9, r6, [r11, #-96] -// vmla.s16 Q2, Q6, r8 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r10 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r7 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r7 -// ldrd r8, r5, [r11, #-32] -// vmla.s16 Q2, Q6, r10 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r6 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r5 -// vshlc Q0, r12, #16 -// vmla.s16 Q3, Q1, r5 -// vmla.s16 Q2, Q6, r6 -// vshlc Q3, r12, #16 -// vmla.s16 Q3, Q4, r9 -// vshlc Q2, r12, #16 -// vmla.s16 Q2, Q4, r8 -// vshlc Q0, r12, #16 -// vmla.s16 Q2, Q6, r9 -// vsub.u16 Q3, Q3, Q0 -// vmla.s16 Q3, Q1, r8 -// vldrh.u16 Q0, [r11,#0] -// vldrh.u16 Q1, [r11,#16] -// vsub.u16 Q0, Q3, Q0 -// vsub.u16 Q1, Q2, Q1 -// vstrh.u16 Q0, [r0, #32] -// vstrh.u16 Q1, [r0, #48] - -add sp, sp, #224 -vpop {d0-d15} -pop {r4-r11,lr} -bx lr diff --git a/tests/schoolbook/schoolbook.mk b/tests/schoolbook/schoolbook.mk index 997ad85..37f95da 100644 --- a/tests/schoolbook/schoolbook.mk +++ b/tests/schoolbook/schoolbook.mk @@ -11,7 +11,8 @@ SCHOOLBOOK_PLATFORMS += m85-an555 SCHOOLBOOK_SOURCES += main.c # Assembly sources required for this test -SCHOOLBOOK_ASMS += poly_u16_32_acc.s -SCHOOLBOOK_ASMS += auto/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s -SCHOOLBOOK_ASMS += auto/ -SCHOOLBOOK_ASMS += auto/poly_u16_mul_32_anticyclic_mve_simd.s \ No newline at end of file +SCHOOLBOOK_ASM_DIR = ../../asm/auto/poly/simd +SCHOOLBOOK_ASMS += ../../asm/manual/schoolbook/poly_u16_32_acc.s +SCHOOLBOOK_ASMS += $(SCHOOLBOOK_ASM_DIR)/poly_u16_mul_32_anticyclic_karatsuba_fwd_mve_simd.s +SCHOOLBOOK_ASMS += $(SCHOOLBOOK_ASM_DIR)/poly_u16_mul_32_mve_simd.s +SCHOOLBOOK_ASMS += $(SCHOOLBOOK_ASM_DIR)/poly_u16_mul_32_anticyclic_mve_simd.s \ No newline at end of file From 587f49c30ad353ee8b1c3b28d8b1770c6c687602 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 15:06:03 +0800 Subject: [PATCH 24/32] remove more duplicate assembly files --- .../manual/fx_fft}/base_symbolic.s | 0 asm/manual/sqmag/cmplx_mag_sqr_fx.s | 2 +- .../sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s | 2 +- .../sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s | 2 +- .../sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s | 2 +- .../sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s | 2 +- .../sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s | 2 +- .../sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s | 2 +- tests/crt/crt.mk | 2 +- tests/crt/crt.s | 2840 ------------ tests/ct/ct.mk | 2 +- tests/ct/ct.s | 206 - tests/fx-fft/base_concrete.s | 73 - tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s | 179 - tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s | 177 - tests/fx-fft/fx-fft.mk | 13 +- tests/fx-fft/ref_handwritten_asm.s | 75 - tests/fx-fft/ref_intrinsics.s | 72 - tests/intmulntt/crt.s | 2840 ------------ tests/intmulntt/intmulntt.mk | 35 +- ...2_u32_106117153_62524596_incomplete_good.s | 1390 ------ ...06117153_62524596_incomplete_good_bitrev.s | 1285 ------ ...92_u32_108643009_1793055_incomplete_good.s | 1390 ------ ...108643009_1793055_incomplete_good_bitrev.s | 1285 ------ ...9_1793055_incomplete_good_oop_half_input.s | 1237 ----- ...92_u32_33556993_27792935_incomplete_good.s | 1390 ------ ...33556993_27792935_incomplete_good_bitrev.s | 1285 ------ ...92_u32_45387457_16877098_incomplete_good.s | 1390 ------ ...45387457_16877098_incomplete_good_bitrev.s | 1285 ------ ...192_u32_88299073_9670361_incomplete_good.s | 1390 ------ ..._88299073_9670361_incomplete_good_bitrev.s | 1285 ------ ...3_9670361_incomplete_good_oop_half_input.s | 1237 ----- ...84_u32_106117153_1392340_incomplete_good.s | 3383 -------------- ...106117153_1392340_incomplete_good_bitrev.s | 3182 ------------- ...384_u32_108643009_640922_incomplete_good.s | 3383 -------------- ..._108643009_640922_incomplete_good_bitrev.s | 3182 ------------- ...u32_108643009_640922_incomplete_good_oop.s | 3388 -------------- ...09_640922_incomplete_good_oop_half_input.s | 3075 ------------- ...84_u32_33556993_15047299_incomplete_good.s | 3383 -------------- ...33556993_15047299_incomplete_good_bitrev.s | 3182 ------------- ..._384_u32_45387457_923104_incomplete_good.s | 3383 -------------- ...2_45387457_923104_incomplete_good_bitrev.s | 3182 ------------- ...384_u32_88299073_4883425_incomplete_good.s | 3383 -------------- ..._88299073_4883425_incomplete_good_bitrev.s | 3182 ------------- ...u32_88299073_4883425_incomplete_good_oop.s | 3388 -------------- ...3_4883425_incomplete_good_oop_half_input.s | 3075 ------------- ..._u16_toom4_fwd_256_dual_packed_limbs_oop.s | 198 - ..._u16_toom4_inv_dual_packed_limbs_oop_256.s | 380 -- tests/poly/poly.mk | 7 +- tests/sqmag/cmplx_mag_sqr_fx.s | 38 - .../sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s | 69 - .../sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s | 94 - .../sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s | 140 - .../sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s | 70 - .../sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s | 93 - .../sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s | 141 - tests/sqmag/sqmag.mk | 15 +- tests/toom/auto/poly_u16_mul_192_toom3_mve.s | 708 --- tests/toom/auto/poly_u16_mul_256_toom4_mve.s | 1287 ------ tests/toom/auto/poly_u16_mul_512_toom4_mve.s | 2501 ---------- tests/toom/auto/poly_u16_mul_64_toom4_mve.s | 379 -- tests/toom/auto/poly_u16_mul_768_toom3_mve.s | 2662 ----------- tests/toom/auto/poly_u16_mul_768_toom4_mve.s | 3759 --------------- tests/toom/auto/poly_u16_mul_832_toom4_mve.s | 4065 ----------------- tests/toom/auto/poly_u16_toom3_fwd_192.s | 100 - tests/toom/auto/poly_u16_toom3_fwd_768.s | 366 -- tests/toom/auto/poly_u16_toom3_inv_full_192.s | 390 -- tests/toom/auto/poly_u16_toom3_inv_full_768.s | 1522 ------ tests/toom/auto/poly_u16_toom3_inv_half_192.s | 165 - tests/toom/auto/poly_u16_toom3_inv_half_768.s | 623 --- tests/toom/auto/poly_u16_toom4_fwd_256.s | 182 - .../auto/poly_u16_toom4_fwd_256_dual_bottom.s | 198 - ...d_256_dual_packed_limbs_karatsuba_x1_oop.s | 198 - ...d_256_dual_packed_limbs_karatsuba_x2_oop.s | 199 - ..._u16_toom4_fwd_256_dual_packed_limbs_oop.s | 198 - .../poly_u16_toom4_fwd_256_dual_packed_oop.s | 198 - .../auto/poly_u16_toom4_fwd_256_dual_top.s | 198 - .../poly_u16_toom4_fwd_256_dual_top_oop.s | 198 - tests/toom/auto/poly_u16_toom4_fwd_512.s | 351 -- tests/toom/auto/poly_u16_toom4_fwd_768.s | 520 --- tests/toom/auto/poly_u16_toom4_fwd_832.s | 562 --- .../poly_u16_toom4_fwd_karatsuba_x1_oop_256.s | 199 - .../poly_u16_toom4_fwd_karatsuba_x2_oop_256.s | 200 - tests/toom/auto/poly_u16_toom4_fwd_oop_256.s | 199 - .../auto/poly_u16_toom4_inv_dual_bottom_256.s | 381 -- .../poly_u16_toom4_inv_dual_bottom_oop_256.s | 380 -- ..._u16_toom4_inv_dual_packed_limbs_oop_256.s | 380 -- .../auto/poly_u16_toom4_inv_dual_top_256.s | 381 -- .../poly_u16_toom4_inv_dual_top_oop_256.s | 380 -- tests/toom/auto/poly_u16_toom4_inv_full_256.s | 765 ---- tests/toom/auto/poly_u16_toom4_inv_full_512.s | 1511 ------ tests/toom/auto/poly_u16_toom4_inv_full_768.s | 2303 ---------- tests/toom/auto/poly_u16_toom4_inv_full_832.s | 2493 ---------- tests/toom/auto/poly_u16_toom4_inv_half_256.s | 340 -- tests/toom/auto/poly_u16_toom4_inv_half_512.s | 661 --- tests/toom/auto/poly_u16_toom4_inv_half_768.s | 982 ---- tests/toom/auto/poly_u16_toom4_inv_half_832.s | 1062 ----- tests/toom/toom.mk | 61 +- 98 files changed, 55 insertions(+), 103525 deletions(-) rename {tests/fx-fft => asm/manual/fx_fft}/base_symbolic.s (100%) delete mode 100644 tests/crt/crt.s delete mode 100644 tests/ct/ct.s delete mode 100644 tests/fx-fft/base_concrete.s delete mode 100644 tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s delete mode 100644 tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s delete mode 100644 tests/fx-fft/ref_handwritten_asm.s delete mode 100644 tests/fx-fft/ref_intrinsics.s delete mode 100644 tests/intmulntt/crt.s delete mode 100644 tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s delete mode 100644 tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s delete mode 100644 tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop.s delete mode 100644 tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s delete mode 100644 tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good.s delete mode 100644 tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s delete mode 100644 tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop.s delete mode 100644 tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s delete mode 100644 tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s delete mode 100644 tests/poly/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s delete mode 100644 tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s delete mode 100644 tests/toom/auto/poly_u16_mul_192_toom3_mve.s delete mode 100644 tests/toom/auto/poly_u16_mul_256_toom4_mve.s delete mode 100644 tests/toom/auto/poly_u16_mul_512_toom4_mve.s delete mode 100644 tests/toom/auto/poly_u16_mul_64_toom4_mve.s delete mode 100644 tests/toom/auto/poly_u16_mul_768_toom3_mve.s delete mode 100644 tests/toom/auto/poly_u16_mul_768_toom4_mve.s delete mode 100644 tests/toom/auto/poly_u16_mul_832_toom4_mve.s delete mode 100644 tests/toom/auto/poly_u16_toom3_fwd_192.s delete mode 100644 tests/toom/auto/poly_u16_toom3_fwd_768.s delete mode 100644 tests/toom/auto/poly_u16_toom3_inv_full_192.s delete mode 100644 tests/toom/auto/poly_u16_toom3_inv_full_768.s delete mode 100644 tests/toom/auto/poly_u16_toom3_inv_half_192.s delete mode 100644 tests/toom/auto/poly_u16_toom3_inv_half_768.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_bottom.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_top.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_256_dual_top_oop.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_512.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_768.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_832.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_fwd_oop_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_dual_bottom_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_dual_top_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_dual_top_oop_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_full_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_full_512.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_full_768.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_full_832.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_half_256.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_half_512.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_half_768.s delete mode 100644 tests/toom/auto/poly_u16_toom4_inv_half_832.s diff --git a/tests/fx-fft/base_symbolic.s b/asm/manual/fx_fft/base_symbolic.s similarity index 100% rename from tests/fx-fft/base_symbolic.s rename to asm/manual/fx_fft/base_symbolic.s diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx.s b/asm/manual/sqmag/cmplx_mag_sqr_fx.s index 181e65f..f5e1f8b 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx.s @@ -1 +1 @@ -../../../helight/examples/naive/cmplx_mag_sqr/cmplx_mag_sqr_fx.s \ No newline at end of file +../../../slothy/examples/naive/cmplx_mag_sqr/cmplx_mag_sqr_fx.s \ No newline at end of file diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s index c29a83e..45dedf0 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s @@ -1 +1 @@ -../../../helight/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M55_unroll1.s \ No newline at end of file +../../../slothy/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M55_unroll1.s \ No newline at end of file diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s index 003c960..2f694f2 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s @@ -1 +1 @@ -../../../helight/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M55_unroll2.s \ No newline at end of file +../../../slothy/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M55_unroll2.s \ No newline at end of file diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s index 08fa59d..d7fffc2 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s @@ -1 +1 @@ -../../../helight/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M55_unroll4.s \ No newline at end of file +../../../slothy/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M55_unroll4.s \ No newline at end of file diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s index d2d07d9..722c154 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s @@ -1 +1 @@ -../../../helight/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M85_unroll1.s \ No newline at end of file +../../../slothy/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M85_unroll1.s \ No newline at end of file diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s index e833eca..9e03374 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s @@ -1 +1 @@ -../../../helight/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M85_unroll2.s \ No newline at end of file +../../../slothy/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M85_unroll2.s \ No newline at end of file diff --git a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s index 961007d..745551f 120000 --- a/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s +++ b/asm/manual/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s @@ -1 +1 @@ -../../../helight/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M85_unroll4.s \ No newline at end of file +../../../slothy/examples/opt/cmplx_mag_sqr/cmplx_mag_sqr_fx_opt_M85_unroll4.s \ No newline at end of file diff --git a/tests/crt/crt.mk b/tests/crt/crt.mk index 91d35e1..2ca37cc 100644 --- a/tests/crt/crt.mk +++ b/tests/crt/crt.mk @@ -11,5 +11,5 @@ CRT_PLATFORMS += m85-an555 CRT_SOURCES += main.c # Assembly sources required for this test -CRT_ASMS += crt.s +CRT_ASMS += ../../asm/manual/crt/crt.s diff --git a/tests/crt/crt.s b/tests/crt/crt.s deleted file mode 100644 index 642c5d6..0000000 --- a/tests/crt/crt.s +++ /dev/null @@ -1,2840 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "crt_const.h" - - .syntax unified - -.type crt_s32_dechunk_chunk_add_optim, %function -.global crt_s32_dechunk_chunk_add_optim - .data - .align 4 -crt_s32_dechunk_chunk_add_optim_data: - .word (1<<22) - 1 - .word (1<<(31-22)) - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED - .text - .align 4 -crt_s32_dechunk_chunk_add_optim_data_ptr: - .word crt_s32_dechunk_chunk_add_optim_data -crt_s32_dechunk_chunk_add_optim: - - loop_cnt .req r14 - init_tmp .req r10 // Temporary prior to main loop - init_tmp2 .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA0 .req r4 - curB0 .req r5 - mask .req r6 - rcarry .req r7 - curA1 .req r9 - curB1 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - const_rshift22 .req r8 - cur0 .req q0 - cur1 .req q1 - masked0 .req q2 - masked1 .req q4 - - push {r4-r11,lr} - vpush {d8-d15} - - ldr addr, crt_s32_dechunk_chunk_add_optim_data_ptr - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - ldrd init_tmp, init_tmp2, [addr], #+8 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {dst, size, init_tmp, init_tmp2} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+(9+4)*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vadd.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 in0, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vmla.s32 in0p, in0, mod_p_neg - vadd.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 in0p, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 in0, in0p, mod_p_neg - vldrw.u32 in0p, [src0], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.i32 quot_low, quot_low, tmpp - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmpp, [src0p], #+16 - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 in0, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 in0p, in0, mod_p_neg - vldrw.u32 in0, [src0], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.i32 quot_low, quot_low, tmpp - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmpp, [src0p], #+16 - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 in0p, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vmla.s32 in0, in0p, mod_p_neg - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {dst, size, mask, const_rshift22} - mov rcarry, #0 - mov loop_cnt, size, LSR #3 - sub loop_cnt, loop_cnt, #1 - - ldrd curA0, curB0, [dst] - add rcarry, curA0, rcarry, ASR #22 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - add rcarry, curA0, rcarry, ASR #22 - strd curA1, curB1, [dst], #+8 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #+8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - le loop_cnt, 1b -2: - strd curA1, curB1, [dst], #+8 - - mov loop_cnt, size, LSR #4 - sub loop_cnt, loop_cnt, #1 - - vldrw.u32 cur0, [dst] - vand.u32 masked0, cur0, qmask - vshlc cur0, rcarry, #32 - vqdmlah.s32 masked0, cur0, const_rshift22 - vldrw.u32 cur1, [dst, #+16] - vand.u32 masked1, cur1, qmask - vstrw.u32 masked0, [dst], #+16 - vshlc cur1, rcarry, #32 - vqdmlah.s32 masked1, cur1, const_rshift22 - - wls loop_cnt, loop_cnt, 2 - .align 2 - 1: - vldrw.u32 cur0, [dst, #+16] - vand.u32 masked0, cur0, qmask - vstrw.u32 masked1, [dst], #+16 - vshlc cur0, rcarry, #32 - vqdmlah.s32 masked0, cur0, const_rshift22 - vldrw.u32 cur1, [dst, #+16] - vand.u32 masked1, cur1, qmask - vstrw.u32 masked0, [dst], #+16 - vshlc cur1, rcarry, #32 - vqdmlah.s32 masked1, cur1, const_rshift22 - le loop_cnt, 1b - 2: - vstrw.u32 masked1, [dst], #+16 - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA0 - .unreq curB0 - .unreq curA1 - .unreq curB1 - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq const_rshift22 - .unreq in0 - .unreq in0p - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - -.type crt_s32_dechunk_chunk_optim, %function - .global crt_s32_dechunk_chunk_optim - .data - .align 4 -crt_s32_dechunk_chunk_optim_data: - .word (1<<(31-(CRT_32_P_REFINED_BARRETT_SHIFT+1))) - .word (1<<22) - 1 - .word (1<<(31-22)) - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(9)) - .word -CRT_32_P_TWISTED - .text - .align 4 -crt_s32_dechunk_chunk_optim_data_ptr: - .word crt_s32_dechunk_chunk_optim_data -crt_s32_dechunk_chunk_optim: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_tw .req r4 - mod_q_neg .req r5 - const_prshift .req r6 - const_shift9 .req r7 - const_rshift22 .req r10 - p_inv_mod_q .req r9 - p_inv_mod_q_tw .req r8 - rcarry .req r11 - rcarry_red .req r12 - - in0p .req q7 // q0 - in0 .req q0 // q6 - in1 .req q5 // q2 - diff .req in1 - quot_low .req q2 // q5 - qmask .req q1 // q3 - tmpp .req q4 // q7 - tmp .req q6 // q1 - red_tmp .req q3 // q4 - - push {r4-r11,lr} - sub.w sp, sp, #(4*16) - - vstrw.32 q7, [sp, #(0*16)] - mov loop_cnt, size, LSR #3 - vstrw.32 q6, [sp, #(1*16)] - sub loop_cnt, loop_cnt, #1 - vstrw.32 q5, [sp, #(2*16)] - - ldr addr, crt_s32_dechunk_chunk_optim_data_ptr - ldr const_prshift, [addr], #+4 - ldrd init_tmp, const_rshift22, [addr], #+8 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - - vldrw.u32 in0p, [src0], #+16 - vqdmulh.s32 diff, in0p, mod_p_tw - ldrd const_shift9, mod_p_tw, [addr], #+8 - .unreq addr - - vqrdmulh.s32 tmp, diff, const_prshift - vdup.u32 qmask, init_tmp - vmla.s32 in0p, tmp, mod_p - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vstrw.32 q4, [sp, #(3*16)] - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 in0, [src0], #+16 - vmla.s32 diff, tmp, mod_q_neg - movs.w rcarry, #0 - vmul.u32 quot_low, diff, mod_p - movs.w rcarry_red, #0 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vqdmulh.s32 diff, in0, mod_p_tw - vorr.u32 tmpp, tmpp, tmp - vqrdmulh.s32 tmp, diff, const_prshift - vshlc tmpp, rcarry, #32 - vmla.s32 in0, tmp, mod_p - vadd.i32 in0p, in0p, tmpp - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.u32 tmpp, quot_low, in0p - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 in0p, [src0], #+16 - vmla.s32 diff, tmp, mod_q_neg - vand.u32 red_tmp, tmpp, qmask - vmul.u32 quot_low, diff, mod_p - vshlc tmpp, rcarry_red, #32 - vqdmlah.s32 red_tmp, tmpp, const_rshift22 - vstrw.32 red_tmp, [dst], #+16 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - vqdmulh.s32 diff, in0p, mod_p_tw - vorr.u32 tmpp, tmpp, tmp - vqrdmulh.s32 tmp, diff, const_prshift - vshlc tmpp, rcarry, #32 - vmla.s32 in0p, tmp, mod_p - vadd.s32 in0, in0, tmpp - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.s32 tmpp, quot_low, in0 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 in0, [src0], #+16 - vmla.s32 diff, tmp, mod_q_neg - vand.u32 red_tmp, tmpp, qmask - vmul.u32 quot_low, diff, mod_p - vshlc tmpp, rcarry_red, #32 - vqdmlah.s32 red_tmp, tmpp, const_rshift22 - vstrw.32 red_tmp, [dst], #+16 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - le loop_cnt, 1b -2: - - vqdmulh.s32 diff, in0, mod_p_tw - vorr.u32 tmpp, tmpp, tmp - vqrdmulh.s32 tmp, diff, const_prshift - vshlc tmpp, rcarry, #32 - vmla.s32 in0, tmp, mod_p - vadd.i32 in0p, in0p, tmpp - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.u32 tmpp, quot_low, in0p - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vand.u32 red_tmp, tmpp, qmask - vmul.u32 quot_low, diff, mod_p - vshlc tmpp, rcarry_red, #32 - vqdmlah.s32 red_tmp, tmpp, const_rshift22 - vstrw.32 red_tmp, [dst], #+16 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - vldrw.u32 q7, [sp, #(0*16)] - vorr.u32 tmpp, tmpp, tmp - vldrw.u32 q6, [sp, #(1*16)] - vshlc tmpp, rcarry, #32 - vldrw.u32 q5, [sp, #(2*16)] - vadd.s32 in0, tmpp, in0 - vldrw.u32 q4, [sp, #(3*16)] - vadd.s32 quot_low, quot_low, in0 - ldrd r4, r5, [sp, #(4*16)] - vand.u32 red_tmp, quot_low, qmask - ldrd r6, r7, [sp, #(4*16 + 1*8)] - vshlc quot_low, rcarry_red, #32 - ldrd r8, r9, [sp, #(4*16 + 2*8)] - vqdmlah.s32 red_tmp, quot_low, const_rshift22 - ldrd r10, r11, [sp, #(4*16 + 3*8)] - vstrw.32 red_tmp, [dst], #+16 - adds.w sp, sp, #(4*16+4*8) - pop {pc} - - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const_prshift - .unreq const_shift9 - .unreq const_rshift22 - .unreq in0 - .unreq in0p - .unreq in1 - .unreq tmp - .unreq tmpp - .unreq diff - .unreq quot_low - .unreq qmask - -.type crt_s32_dechunk_chunk_add, %function -.global crt_s32_dechunk_chunk_add - .align 4 - .data -crt_s32_dechunk_chunk_add_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED - - .text - .align 4 -crt_s32_dechunk_chunk_add_data_ptr: - .word crt_s32_dechunk_chunk_add_data -crt_s32_dechunk_chunk_add: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA0 .req r3 - curB0 .req r4 - mask .req r5 - rcarry .req r7 - curA1 .req r9 - curB1 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - ldr addr, crt_s32_dechunk_chunk_add_data_ptr - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vadd.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0p, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {dst, size, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #2 - sub loop_cnt, loop_cnt, #1 - - ldrd curA0, curB0, [dst] - add rcarry, curA0, rcarry, ASR #22 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - add rcarry, curA0, rcarry, ASR #22 - strd curA1, curB1, [dst], #+8 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #+8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - le loop_cnt, 1b -2: - strd curA1, curB1, [dst], #+8 - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA0 - .unreq curB0 - .unreq curA1 - .unreq curB1 - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in0p - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - -.type crt_s32_dechunk_chunk, %function -.global crt_s32_dechunk_chunk - .data - .align 4 -crt_s32_dechunk_chunk_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED - .text - .align 4 -crt_s32_dechunk_chunk_data_ptr: - .word crt_s32_dechunk_chunk_data -crt_s32_dechunk_chunk: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA0 .req r3 - curB0 .req r4 - mask .req r5 - rcarry .req r7 - curA1 .req r9 - curB1 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q5 - tmp .req q6 - - push {r4-r11,lr} - vpush {d8-d15} - - ldr addr, crt_s32_dechunk_chunk_data_ptr - mov loop_cnt, size, LSR #2 - - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0], #+16 - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vadd.s32 in0, tmpp, in0 - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst], #+16 - le loop_cnt, 1b -2: - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {dst, size, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #2 - sub loop_cnt, loop_cnt, #1 - - ldrd curA0, curB0, [dst] - add rcarry, curA0, rcarry, ASR #22 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - add rcarry, curA0, rcarry, ASR #22 - strd curA1, curB1, [dst], #+8 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #+8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - le loop_cnt, 1b -2: - strd curA1, curB1, [dst], #+8 - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA0 - .unreq curB0 - .unreq curA1 - .unreq curB1 - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq tmp - .unreq tmpp - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - - -.type crt_s32_pure_reduce, %function -.global crt_s32_pure_reduce - .align 4 -crt_s32_pure_reduce_data: - .word CRT_32_P - .word CRT_32_P_TWISTED - .word CRT_32_Q - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED -crt_s32_pure_reduce: - - loop_cnt .req r14 - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_q .req r4 - mod_q_neg .req r5 - p_inv_mod_q .req r6 - p_inv_mod_q_tw .req r7 - mask .req r8 - const1 .req r9 - const0 .req r10 - mod_p_tw .req r11 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q3 - quot_high .req q4 - quot .req q5 - mod_p_vect .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - addr .req r12 - adr addr, crt_s32_pure_reduce_data - ldrd mod_p, mod_p_tw, [addr], #+8 - ldrd mod_q, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - .unreq addr - - mov const1, #1 - mov const0, #0 - - wls loop_cnt, loop_cnt, 2 -1: - // PRELIMINARY ASSUMPTION: - // x and y are already scaled and reduced - - vldrw.s32 in0, [src0], #+16 - vqdmulh.s32 tmp, in0, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.s32 in1, [src1], #+16 - - /* CRT interpolation of (x mod p) and (y mod q) - * - * x + ((y-x)*(p mod q)^{-1} mod q)*p - */ - vsub.s32 diff, in1, in0 - - /* Signed refined Barrett multiplication */ - vqdmulh.s32 quot, diff, p_inv_mod_q_tw - vrshr.s32 quot, quot, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmul.u32 diff, diff, p_inv_mod_q - vmla.s32 diff, quot, mod_q_neg - - /* Compute high and low products separately */ - vmul.i32 quot_low, diff, mod_p_vect - vmulh.s32 quot_high, diff, mod_p_vect - - /* Need to do a 64-bit addition to quot_high and quot_low */ - /* Add as u32, and manually add the carry to the upperlanes */ - vadd.u32 quot_low, quot_low, in0 - vpt.u32 HI, in0, quot_low - vaddt.i32 quot_high, quot_high, const1 - /* Need to add the sign bit of in0 */ - vqdmlah.s32 quot_high, in0, const1 - - vst20.32 {quot_low, quot_high}, [dst] - vst21.32 {quot_low, quot_high}, [dst]! - - le loop_cnt, 1b - 2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - - .unreq mod_p - .unreq mod_p_neg - .unreq mod_q - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq mask - .unreq const1 - .unreq mod_p_tw - - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq quot_high - .unreq quot - .unreq mod_p_vect - .unreq tmp - -.type crt_s32_chunk_dechunk, %function -.global crt_s32_chunk_dechunk - .align 4 -crt_s32_chunk_dechunk_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) -crt_s32_chunk_dechunk: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldr const_shift10, [addr], #+4 - vdup.u32 qmask, init_tmp - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0, q_off] - vldrw.u32 in1, [src1, q_off] - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n src0, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - adds.n dst, #4 - - le loop_cnt, 1b -2: - - /* Use dummy loop for the sake of tail predication */ - mov loop_cnt, #3 - dlstp.32 loop_cnt, loop_cnt -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vshr.s32 carry, in0, #22 - vand in0, in0, qmask - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - vldrw.u32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - letp loop_cnt, 1b - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - -.type crt_s32_chunk_dechunk_reduce, %function -.global crt_s32_chunk_dechunk_reduce - .align 4 -crt_s32_chunk_dechunk_reduce_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0, q_off] - vqdmulh.s32 tmp, in0, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 in1, [src1, q_off] - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n src0, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - adds.n dst, #4 - - le loop_cnt, 1b -2: - - /* Use dummy loop for the sake of tail predication */ - mov loop_cnt, #3 - dlstp.32 loop_cnt, loop_cnt -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vshr.s32 carry, in0, #22 - vand in0, in0, qmask - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - vldrw.u32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - letp loop_cnt, 1b - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - -.type crt_s32_chunk_dechunk_reduce_v2, %function -.global crt_s32_chunk_dechunk_reduce_v2 - .align 4 -crt_s32_chunk_dechunk_reduce_v2_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce_v2: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_v2_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - sub loop_cnt, loop_cnt, #1 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - subs.n dst, #4 - - vldrw.u32 in0, [src0, q_off] - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vldrw.u32 in0, [src0, q_off] - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.u32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - - /* Use dummy loop for the sake of tail predication */ - adds.n dst, #4 - movs.n const1, #3 - dlstp.32 loop_cnt, const1 -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vshr.s32 carry, in0, #22 - vand in0, in0, qmask - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - vldrw.u32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - letp loop_cnt, 1b - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - -.type crt_s32_chunk_dechunk_reduce_canonical, %function -.global crt_s32_chunk_dechunk_reduce_canonical - .align 4 -crt_s32_chunk_dechunk_reduce_canonical_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce_canonical: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - sub loop_cnt, loop_cnt, #1 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - /* Save original destination pointer and mask for later */ - push {dst, init_tmp} - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - subs.n dst, #4 - - vldrw.u32 in0, [src0, q_off] - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vldrw.u32 in0, [src0, q_off] - vstrw.32 quot_low, [dst, q_off] - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst, q_off] - - /* Use dummy loop for the sake of tail predication */ - adds.n dst, #4 - movs.n const1, #3 - dlstp.32 loop_cnt, const1 -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - letp loop_cnt, 1b - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {dst, mask} - mov rcarry, #0 - mov loop_cnt, #(CRT_32_SIZE/2) - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - .unreq tmp - -.type crt_s32_chunk_dechunk_reduce_canonical_v2, %function -.global crt_s32_chunk_dechunk_reduce_canonical_v2 - .align 4 -crt_s32_chunk_dechunk_reduce_canonical_v2_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce_canonical_v2: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q5 - tmp .req q6 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_v2_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0], #+16 - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vadd.s32 in0, tmpp, in0 - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst], #+16 - le loop_cnt, 1b -2: - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq tmp - .unreq tmpp - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - -.type crt_s32_chunk_dechunk_sub_reduce_canonical_v2, %function -.global crt_s32_chunk_dechunk_sub_reduce_canonical_v2 - .align 4 -crt_s32_chunk_dechunk_sub_reduce_canonical_v2_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_sub_reduce_canonical_v2: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - subs loop_cnt, #1 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_v2_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0, [src0], #+16 - vldrw.u32 in0p, [src0p], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vsub.s32 in0, in0, in0p - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vldrw.u32 in0p, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vldrw.u32 in0, [src0], #+16 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - le loop_cnt, 1b -2: - - vsub.s32 in0, in0, in0p - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - -.type crt_s32_chunk_dechunk_sub_reduce_canonical_v3, %function -.global crt_s32_chunk_dechunk_sub_reduce_canonical_v3 - .align 4 -crt_s32_chunk_dechunk_sub_reduce_canonical_v3_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_sub_reduce_canonical_v3: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_v2_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vsub.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0p, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vsub.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vsub.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - - - - -.type crt_s32_chunk_dechunk_add_reduce_canonical, %function -.global crt_s32_chunk_dechunk_add_reduce_canonical - .align 4 -crt_s32_chunk_dechunk_add_reduce_canonical_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_add_reduce_canonical: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - adr addr, crt_s32_chunk_dechunk_add_reduce_canonical_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vadd.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0p, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - - - - - - -.type crt_s32_chunk_dechunk_sub_reduce_canonical, %function -.global crt_s32_chunk_dechunk_sub_reduce_canonical - .align 4 -crt_s32_chunk_dechunk_sub_reduce_canonical_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_sub_reduce_canonical: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r12,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - sub loop_cnt, loop_cnt, #1 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - - /* Save original destination pointer and mask for later */ - push {dst, init_tmp} - .unreq addr - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - veor carry, carry, carry - - movs.n const1, #1 - subs.n dst, #4 - - vldrw.u32 in0, [src0, q_off] - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 tmp, [src0p, q_off] - vsub.s32 in0, in0, tmp - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p, q_off] - vsub.s32 in1, in1, tmp - adds.n src0, #4 - adds src0p, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - adds src1p, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vldrw.u32 in0, [src0, q_off] - vstrw.32 quot_low, [dst, q_off] - le loop_cnt, 1b -2: - vldrw.u32 tmp, [src0p, q_off] - vsub.s32 in0, in0, tmp - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p, q_off] - vsub.s32 in1, in1, tmp - adds.n src0, #4 - adds src0p, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - adds src1p, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst, q_off] - - /* Use dummy loop for the sake of tail predication */ - adds.n dst, #4 - movs.n const1, #3 - dlstp.32 loop_cnt, const1 -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - letp loop_cnt, 1b - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {dst, mask} - mov rcarry, #0 - mov loop_cnt, #(CRT_32_SIZE/2) - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r12,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - - -.type crt_s32_pure, %function -.global crt_s32_pure - .align 4 -crt_s32_pure_data: - .word CRT_32_P - .word CRT_32_Q - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED -crt_s32_pure: - - loop_cnt .req r14 - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_q .req r4 - mod_q_neg .req r5 - p_inv_mod_q .req r6 - p_inv_mod_q_tw .req r7 - mask .req r8 - const1 .req r9 - const0 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q3 - quot_high .req q4 - quot .req q5 - mod_p_vect .req q6 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - addr .req r11 - adr addr, crt_s32_pure_data - ldrd mod_p, mod_q, [addr], #+8 - ldr mod_q_neg, [addr], #+4 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - .unreq addr - - mov const1, #1 - mov const0, #0 - - wls loop_cnt, loop_cnt, 2 -1: - // PRELIMINARY ASSUMPTION: - // x and y are already scaled and reduced - - vldrw.u32 in0, [src0], #+16 - vldrw.u32 in1, [src1], #+16 - - /* CRT interpolation of (x mod p) and (y mod q) - * - * x + ((y-x)*(p mod q)^{-1} mod q)*p - */ - vsub.s32 diff, in1, in0 - - /* Unsigned (!) refined Barrett multiplication */ - vqdmulh.s32 quot, diff, p_inv_mod_q_tw - vrshr.s32 quot, quot, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmul.u32 diff, diff, p_inv_mod_q - vmla.s32 diff, quot, mod_q_neg - - /* Compute high and low products separately */ - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 quot_high, diff, mod_p_vect - - /* Need to do a 64-bit addition to quot_high and quot_low */ - /* Add as u32, and manually add the carry to the upperlanes */ - vadd.u32 quot_low, quot_low, in0 - vpt.u32 HI, in0, quot_low - vaddt.i32 quot_high, quot_high, const1 - /* Need to add the sign bit of in0 */ - vqdmlah.s32 quot_high, in0, const1 - - vst20.32 {quot_low, quot_high}, [dst] - vst21.32 {quot_low, quot_high}, [dst]! - - le loop_cnt, 1b - 2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - - .unreq mod_p - .unreq mod_q - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq mask - .unreq const1 - - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq quot_high - .unreq quot - .unreq mod_p_vect diff --git a/tests/ct/ct.mk b/tests/ct/ct.mk index 6b22e34..4519dbe 100644 --- a/tests/ct/ct.mk +++ b/tests/ct/ct.mk @@ -11,5 +11,5 @@ CT_PLATFORMS += m85-an555 CT_SOURCES += main.c # Assembly sources required for this test -CT_ASMS += ct.s +CT_ASMS += ../../asm/manual/ct/ct.s diff --git a/tests/ct/ct.s b/tests/ct/ct.s deleted file mode 100644 index cc578a0..0000000 --- a/tests/ct/ct.s +++ /dev/null @@ -1,206 +0,0 @@ -/// -/// Copyright (c) 2022 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - #include "ct_const.h" - - .syntax unified - -.macro cmp_set_0_1 dst, idx, imm - cmp \idx, \imm - cset \dst, EQ -.endm - -.global ct_table_lookup -.type ct_table_lookup, %function -ct_table_lookup: - - dst .req r0 // Destination - tbl .req r1 // Table - idx .req r2 // Secret table index - - mask .req r3 // idx == cur_idx - base .req r4 - loop_init .req r12 - loop_cnt .req r14 - - dst0 .req q0 - dst1 .req q1 - dst2 .req q2 - dst3 .req q3 - dst4 .req q4 - dst5 .req q5 - dst6 .req q6 - cur .req q7 - - push {r4,lr} - sub.w sp, sp, #(4*16) - vstrw.32 q7, [sp, #(0*16)] - add tbl, tbl, #(CT_SZ_TABLE) - vstrw.32 q6, [sp, #(1*16)] - movs.n base, tbl - adds.n idx, idx, #1 - vstrw.32 q5, [sp, #(2*16)] - mov loop_init, #(CT_NUM_ENTRY-1) - vstrw.32 q4, [sp, #(3*16)] - cmp_set_0_1 mask, idx, #(CT_NUM_ENTRY) - - // Establish output in chunks of 7*128-bit first - #define CT_SZ_ENTRY_WORDS (CT_SZ_ENTRY/4) - #define CT_SZ_ENTRY_CHUNKS (CT_SZ_ENTRY_WORDS/4) - - #define CT_SZ_ENTRY_CHUNKS_7 (CT_SZ_ENTRY_CHUNKS/7) - #define CT_SZ_ENTRY_CHUNKS_7_REM (CT_SZ_ENTRY_CHUNKS - 7*CT_SZ_ENTRY_CHUNKS_7) - - #define CT_SZ_ENTRY_CHUNKS_4 (CT_SZ_ENTRY_CHUNKS_7_REM / 4) - #define CT_SZ_ENTRY_CHUNKS_4_REM (CT_SZ_ENTRY_CHUNKS_7_REM - 4*CT_SZ_ENTRY_CHUNKS_4) - - #define CT_SZ_ENTRY_CHUNKS_2 (CT_SZ_ENTRY_CHUNKS_4_REM / 2) - #define CT_SZ_ENTRY_CHUNKS_2_REM (CT_SZ_ENTRY_CHUNKS_4_REM - 2*CT_SZ_ENTRY_CHUNKS_2) - - #if CT_SZ_ENTRY_CHUNKS_2_REM != 0 - #error "Invalid configuration" - #endif - -#if CT_SZ_ENTRY_CHUNKS_7 > 0 - .rept CT_SZ_ENTRY_CHUNKS_7 - - vldrw.u32 cur, [base, #(-CT_SZ_ENTRY + 6*16)]! - vmul.s32 dst6, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst5, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst4, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst3, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst2, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst1, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst0, cur, mask - - dls loop_cnt, loop_init -1: - cmp_set_0_1 mask, idx, loop_cnt - vldrw.u32 cur, [base, #(-CT_SZ_ENTRY + 6*16)]! - vmla.s32 dst6, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst5, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst4, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst3, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst2, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst1, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst0, cur, mask - le loop_cnt, 1b - - vstrw.32 dst0, [dst], #+16 - vstrw.32 dst1, [dst], #+16 - vstrw.32 dst2, [dst], #+16 - cmp idx, #(CT_NUM_ENTRY) - vstrw.32 dst3, [dst], #+16 - cset mask, EQ - vstrw.32 dst4, [dst], #+16 - add tbl, tbl, #(7*16) - vstrw.32 dst5, [dst], #+16 - mov base, tbl - vstrw.32 dst6, [dst], #+16 - - .endr - -#endif - -#if CT_SZ_ENTRY_CHUNKS_4 > 0 - .rept CT_SZ_ENTRY_CHUNKS_4 - - vldrw.u32 cur, [base, #(-CT_SZ_ENTRY + 3*16)]! - vmul.s32 dst3, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst2, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst1, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst0, cur, mask - - dls loop_cnt, loop_init -1: - cmp_set_0_1 mask, idx, loop_cnt - vldrw.u32 cur, [base, #(-CT_SZ_ENTRY + 3*16)]! - vmla.s32 dst3, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst2, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst1, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst0, cur, mask - le loop_cnt, 1b - - vstrw.32 dst0, [dst], #+16 - cmp idx, #(CT_NUM_ENTRY) - vstrw.32 dst1, [dst], #+16 - cset mask, EQ - vstrw.32 dst2, [dst], #+16 - add tbl, tbl, #(4*16) - vstrw.32 dst3, [dst], #+16 - mov base, tbl - .endr - -#endif - -#if CT_SZ_ENTRY_CHUNKS_2 > 0 -.rept CT_SZ_ENTRY_CHUNKS_2 - - vldrw.u32 cur, [base, #(-CT_SZ_ENTRY + 1*16)]! - vmul.s32 dst1, cur, mask - vldrw.u32 cur, [base, #-16]! - vmul.s32 dst0, cur, mask - - dls loop_cnt, loop_init -1: - cmp_set_0_1 mask, idx, loop_cnt - vldrw.u32 cur, [base, #(-CT_SZ_ENTRY + 1*16)]! - vmla.s32 dst1, cur, mask - vldrw.u32 cur, [base, #-16]! - vmla.s32 dst0, cur, mask - le loop_cnt, 1b - - vstrw.32 dst0, [dst], #+16 - add tbl, tbl, #(2*16) - vstrw.32 dst1, [dst], #+16 - cmp_set_0_1 mask, idx, #(CT_NUM_ENTRY) - mov base, tbl - - .endr - -#endif - - vpop {d8-d15} - pop {r4,pc} - - .unreq dst - .unreq tbl - .unreq idx diff --git a/tests/fx-fft/base_concrete.s b/tests/fx-fft/base_concrete.s deleted file mode 100644 index 2a1195d..0000000 --- a/tests/fx-fft/base_concrete.s +++ /dev/null @@ -1,73 +0,0 @@ - .syntax unified - .type fixedpoint_radix4_fft_base, %function - .global fixedpoint_radix4_fft_base - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing ref registers, but not - // yet reordering. - - .text - .align 4 -fixedpoint_radix4_fft_base: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pw0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.p2align 2 -fixedpoint_radix4_fft_loop_start: - vldrw.s32 q6, [r0] - vldrw.s32 q3, [r4] - vldrw.s32 q2, [r3] - vldrw.s32 q7, [r5] - vhadd.s32 q5, q6, q3 - vhsub.s32 q1, q6, q3 - vhadd.s32 q3, q2, q7 - vhsub.s32 q7, q2, q7 - vhadd.s32 q2, q5, q3 - vstrw.u32 q2, [r0] , #16 - vhsub.s32 q5, q5, q3 - vldrw.s32 q3, [r6] , #16 - vqdmlsdh.s32 q2, q3, q5 - vqdmladhx.s32 q2, q3, q5 - vstrw.u32 q2, [r3] , #16 - vhcadd.s32 q3, q1, q7, #270 - vldrw.s32 q5, [r7] , #16 - vqdmlsdh.s32 q2, q5, q3 - vqdmladhx.s32 q2, q5, q3 - vstrw.u32 q2, [r4] , #16 - vhcadd.s32 q3, q1, q7, #90 - vldrw.s32 q2, [r8] , #16 - vqdmlsdh.s32 q7, q2, q3 - vqdmladhx.s32 q7, q2, q3 - vstrw.u32 q7, [r5] , #16 - le lr, fixedpoint_radix4_fft_loop_start - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s b/tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s deleted file mode 100644 index a3930a1..0000000 --- a/tests/fx-fft/fixedpoint_radix4_fft_opt_M55.s +++ /dev/null @@ -1,179 +0,0 @@ - .syntax unified - .type fixedpoint_radix4_fft_opt_M55, %function - .global fixedpoint_radix4_fft_opt_M55 - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - -.macro load_data - vldrw.s32 qA, [inA] - vldrw.s32 qB, [inB] - vldrw.s32 qC, [inC] - vldrw.s32 qD, [inD] -.endm - -.macro load_twiddles - vldrw.s32 qTw1, [pW1], #16 - vldrw.s32 qTw2, [pW2], #16 - vldrw.s32 qTw3, [pW3], #16 -.endm - -.macro store_data - vstrw.32 qA, [inA], #16 - vstrw.32 qB, [inB], #16 - vstrw.32 qC, [inC], #16 - vstrw.32 qD, [inD], #16 -.endm - -.macro cmul_fx out, in0, in1 - vqdmlsdh.s32 \out, \in0, \in1 - vqdmladhx.s32 \out, \in0, \in1 -.endm - - .text - .align 4 -fixedpoint_radix4_fft_opt_M55: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pW0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - - vldrw.s32 q6, [r4] // *...... - // gap // ....... - vldrw.s32 q3, [r0] // ..*.... - vhadd.s32 q5, q3, q6 // ....*.. - vldrw.s32 q7, [r3] // ...*... - // gap // ....... - vldrw.s32 q0, [r5] // .....*. - vhadd.s32 q1, q7, q0 // ......* - vldrw.s32 q2, [r6] , #16 // .*..... - - // original source code - // vldrw.s32 q6, [r4] // *...... - // vldrw.s32 q2, [r6] , #16 // ......* - // vldrw.s32 q3, [r0] // .*..... - // vldrw.s32 q7, [r3] // ...*... - // vhadd.s32 q5, q3, q6 // ..*.... - // vldrw.s32 q0, [r5] // ....*.. - // vhadd.s32 q1, q7, q0 // .....*. - - sub lr, lr, #1 -.p2align 2 -fixedpoint_radix4_fft_loop_start: - vhadd.s32 q4, q5, q1 // ...........*............. - vstrw.u32 q4, [r0] , #16 // .....................*... - vhsub.s32 q5, q5, q1 // ............*............ - vqdmlsdh.s32 q1, q2, q5 // ...............*......... - vldrw.s32 q4, [r7] , #16 // .....*................... - vhsub.s32 q3, q3, q6 // .........*............... - vldrw.s32 q6, [r4, #16] // ..e...................... - vhsub.s32 q7, q7, q0 // ..........*.............. - vldrw.s32 q0, [r8] , #16 // ......*.................. - vqdmladhx.s32 q1, q2, q5 // ................*........ - vstrw.u32 q1, [r3] , #16 // ......................*.. - vhcadd.s32 q1, q3, q7, #270 // .............*........... - vqdmlsdh.s32 q5, q4, q1 // .................*....... - vldrw.s32 q2, [r6] , #16 // ....e.................... - vqdmladhx.s32 q5, q4, q1 // ..................*...... - vstrw.u32 q5, [r4] , #16 // .......................*. - vhcadd.s32 q1, q3, q7, #90 // ..............*.......... - vqdmlsdh.s32 q4, q0, q1 // ...................*..... - vldrw.s32 q3, [r0] // e........................ - vqdmladhx.s32 q4, q0, q1 // ....................*.... - vldrw.s32 q7, [r3] // .e....................... - vhadd.s32 q5, q3, q6 // .......e................. - vldrw.s32 q0, [r5, #16] // ...e..................... - vhadd.s32 q1, q7, q0 // ........e................ - vstrw.u32 q4, [r5] , #16 // ........................* - - // original source code - // vldrw.s32 qA, [r0] // ............e......|.................e...... - // vldrw.s32 qB, [r3] // ..............e....|...................e.... - // vldrw.s32 qC, [r4] // e..................|.....e.................. - // vldrw.s32 qD, [r5] // ................e..|.....................e.. - // vldrw.s32 qTw1, [r6], #16 // .......e...........|............e........... - // vldrw.s32 qTw2, [r7], #16 // ...................|...*.................... - // vldrw.s32 qTw3, [r8], #16 // ..*................|.......*................ - // vhadd.s32 qSm0, qA, qC // ...............e...|....................e... - // vhadd.s32 qSm1, qB, qD // .................e.|......................e. - // vhsub.s32 qDf0, qA, qC // ...................|....*................... - // vhsub.s32 qDf1, qB, qD // .*.................|......*................. - // vhadd.s32 qA, qSm0, qSm1 // ...................*........................ - // vhsub.s32 qBp, qSm0, qSm1 // ...................|.*...................... - // vhcadd.s32 qCp, qDf0, qDf1, #270 // .....*.............|..........*............. - // vhcadd.s32 qDp, qDf0, qDf1, #90 // ..........*........|...............*........ - // vqdmlsdh.s32 qB, qTw1, qBp // ...................|..*..................... - // vqdmladhx.s32 qB, qTw1, qBp // ...*...............|........*............... - // vqdmlsdh.s32 qC, qTw2, qCp // ......*............|...........*............ - // vqdmladhx.s32 qC, qTw2, qCp // ........*..........|.............*.......... - // vqdmlsdh.s32 qD, qTw3, qDp // ...........*.......|................*....... - // vqdmladhx.s32 qD, qTw3, qDp // .............*.....|..................*..... - // vstrw.32 qA, [r0], #16 // ...................|*....................... - // vstrw.32 qB, [r3], #16 // ....*..............|.........*.............. - // vstrw.32 qC, [r4], #16 // .........*.........|..............*......... - // vstrw.32 qD, [r5], #16 // ..................*|.......................* - - le lr, fixedpoint_radix4_fft_loop_start - vhadd.s32 q4, q5, q1 // *................. - vstrw.u32 q4, [r0] , #16 // .*................ - vhsub.s32 q5, q5, q1 // ..*............... - vqdmlsdh.s32 q4, q2, q5 // ...*.............. - vhsub.s32 q1, q7, q0 // ......*........... - vqdmladhx.s32 q4, q2, q5 // ........*......... - vhsub.s32 q3, q3, q6 // .....*............ - vldrw.s32 q0, [r7] , #16 // ....*............. - vhcadd.s32 q6, q3, q1, #270 // ..........*....... - vstrw.u32 q4, [r3] , #16 // .........*........ - vqdmlsdh.s32 q4, q0, q6 // ...........*...... - vhcadd.s32 q5, q3, q1, #90 // ..............*... - vqdmladhx.s32 q4, q0, q6 // ............*..... - vldrw.s32 q2, [r8] , #16 // .......*.......... - vqdmlsdh.s32 q1, q2, q5 // ...............*.. - vstrw.u32 q4, [r4] , #16 // .............*.... - vqdmladhx.s32 q1, q2, q5 // ................*. - vstrw.u32 q1, [r5] , #16 // .................* - - // original source code - // vhadd.s32 q4, q5, q1 // *................. - // vstrw.u32 q4, [r0] , #16 // .*................ - // vhsub.s32 q5, q5, q1 // ..*............... - // vqdmlsdh.s32 q1, q2, q5 // ...*.............. - // vldrw.s32 q4, [r7] , #16 // .......*.......... - // vhsub.s32 q3, q3, q6 // ......*........... - // vhsub.s32 q7, q7, q0 // ....*............. - // vldrw.s32 q0, [r8] , #16 // .............*.... - // vqdmladhx.s32 q1, q2, q5 // .....*............ - // vstrw.u32 q1, [r3] , #16 // .........*........ - // vhcadd.s32 q1, q3, q7, #270 // ........*......... - // vqdmlsdh.s32 q5, q4, q1 // ..........*....... - // vqdmladhx.s32 q5, q4, q1 // ............*..... - // vstrw.u32 q5, [r4] , #16 // ...............*.. - // vhcadd.s32 q1, q3, q7, #90 // ...........*...... - // vqdmlsdh.s32 q4, q0, q1 // ..............*... - // vqdmladhx.s32 q4, q0, q1 // ................*. - // vstrw.u32 q4, [r5] , #16 // .................* - - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s b/tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s deleted file mode 100644 index 93f721d..0000000 --- a/tests/fx-fft/fixedpoint_radix4_fft_opt_M85.s +++ /dev/null @@ -1,177 +0,0 @@ - .syntax unified - .type fixedpoint_radix4_fft_opt_M85, %function - .global fixedpoint_radix4_fft_opt_M85 - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - -.macro load_data - vldrw.s32 qA, [inA] - vldrw.s32 qB, [inB] - vldrw.s32 qC, [inC] - vldrw.s32 qD, [inD] -.endm - -.macro load_twiddles - vldrw.s32 qTw1, [pW1], #16 - vldrw.s32 qTw2, [pW2], #16 - vldrw.s32 qTw3, [pW3], #16 -.endm - -.macro store_data - vstrw.32 qA, [inA], #16 - vstrw.32 qB, [inB], #16 - vstrw.32 qC, [inC], #16 - vstrw.32 qD, [inD], #16 -.endm - -.macro cmul_fx out, in0, in1 - vqdmlsdh.s32 \out, \in0, \in1 - vqdmladhx.s32 \out, \in0, \in1 -.endm - - .text - .align 4 -fixedpoint_radix4_fft_opt_M85: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pW0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - - vldrw.s32 q2, [r4] // * - - // original source code - // vldrw.s32 q2, [r4] // * - - sub lr, lr, #1 -.p2align 2 -fixedpoint_radix4_fft_loop_start: - vldrw.s32 q4, [r0] // *........................ - vhadd.s32 q6, q4, q2 // .......*................. - vldrw.s32 q3, [r3] // .*....................... - vhsub.s32 q4, q4, q2 // .........*............... - vldrw.s32 q1, [r5] // ...*..................... - vhadd.s32 q5, q3, q1 // ........*................ - vldrw.s32 q2, [r6] , #16 // ....*.................... - vhsub.s32 q0, q6, q5 // ............*............ - vqdmlsdh.s32 q7, q2, q0 // ...............*......... - vhadd.s32 q6, q6, q5 // ...........*............. - vqdmladhx.s32 q7, q2, q0 // ................*........ - vhsub.s32 q3, q3, q1 // ..........*.............. - vstrw.u32 q6, [r0] , #16 // .....................*... - vhcadd.s32 q6, q4, q3, #270 // .............*........... - vstrw.u32 q7, [r3] , #16 // ......................*.. - vhcadd.s32 q1, q4, q3, #90 // ..............*.......... - vldrw.s32 q4, [r7] , #16 // .....*................... - vqdmlsdh.s32 q3, q4, q6 // .................*....... - vldrw.s32 q5, [r8] , #16 // ......*.................. - vqdmladhx.s32 q3, q4, q6 // ..................*...... - vstrw.u32 q3, [r4] , #16 // .......................*. - vqdmlsdh.s32 q4, q5, q1 // ...................*..... - vldrw.s32 q2, [r4] // ..e...................... - vqdmladhx.s32 q4, q5, q1 // ....................*.... - vstrw.u32 q4, [r5] , #16 // ........................* - - // original source code - // vldrw.s32 qA, [r0] // ...*........................ - // vldrw.s32 qB, [r3] // ...|.*...................... - // vldrw.s32 qC, [r4] // e..|.....................e.. - // vldrw.s32 qD, [r5] // ...|...*.................... - // vldrw.s32 qTw1, [r6], #16 // ...|.....*.................. - // vldrw.s32 qTw2, [r7], #16 // ...|...............*........ - // vldrw.s32 qTw3, [r8], #16 // ...|.................*...... - // vhadd.s32 qSm0, qA, qC // ...|*....................... - // vhadd.s32 qSm1, qB, qD // ...|....*................... - // vhsub.s32 qDf0, qA, qC // ...|..*..................... - // vhsub.s32 qDf1, qB, qD // ...|..........*............. - // vhadd.s32 qA, qSm0, qSm1 // ...|........*............... - // vhsub.s32 qBp, qSm0, qSm1 // ...|......*................. - // vhcadd.s32 qCp, qDf0, qDf1, #270 // ...|............*........... - // vhcadd.s32 qDp, qDf0, qDf1, #90 // ...|..............*......... - // vqdmlsdh.s32 qB, qTw1, qBp // ...|.......*................ - // vqdmladhx.s32 qB, qTw1, qBp // ...|.........*.............. - // vqdmlsdh.s32 qC, qTw2, qCp // ...|................*....... - // vqdmladhx.s32 qC, qTw2, qCp // ...|..................*..... - // vqdmlsdh.s32 qD, qTw3, qDp // ...|....................*... - // vqdmladhx.s32 qD, qTw3, qDp // .*.|......................*. - // vstrw.32 qA, [r0], #16 // ...|...........*............ - // vstrw.32 qB, [r3], #16 // ...|.............*.......... - // vstrw.32 qC, [r4], #16 // ...|...................*.... - // vstrw.32 qD, [r5], #16 // ..*|.......................* - - le lr, fixedpoint_radix4_fft_loop_start - vldrw.s32 q4, [r0] // *....................... - vhadd.s32 q6, q4, q2 // .*...................... - vldrw.s32 q3, [r3] // ..*..................... - vhsub.s32 q4, q4, q2 // ...*.................... - vldrw.s32 q1, [r5] // ....*................... - vhadd.s32 q5, q3, q1 // .....*.................. - vldrw.s32 q2, [r6] , #16 // ......*................. - vhsub.s32 q0, q6, q5 // .......*................ - vqdmlsdh.s32 q7, q2, q0 // ........*............... - vhadd.s32 q6, q6, q5 // .........*.............. - vqdmladhx.s32 q7, q2, q0 // ..........*............. - vhsub.s32 q3, q3, q1 // ...........*............ - vstrw.u32 q6, [r0] , #16 // ............*........... - vhcadd.s32 q6, q4, q3, #270 // .............*.......... - vldrw.s32 q1, [r7] , #16 // ................*....... - vhcadd.s32 q5, q4, q3, #90 // ...............*........ - vqdmlsdh.s32 q4, q1, q6 // .................*...... - vstrw.u32 q7, [r3] , #16 // ..............*......... - vqdmladhx.s32 q4, q1, q6 // ...................*.... - vldrw.s32 q6, [r8] , #16 // ..................*..... - vqdmlsdh.s32 q3, q6, q5 // .....................*.. - vstrw.u32 q4, [r4] , #16 // ....................*... - vqdmladhx.s32 q3, q6, q5 // ......................*. - vstrw.u32 q3, [r5] , #16 // .......................* - - // original source code - // vldrw.s32 q4, [r0] // *....................... - // vhadd.s32 q6, q4, q2 // .*...................... - // vldrw.s32 q3, [r3] // ..*..................... - // vhsub.s32 q4, q4, q2 // ...*.................... - // vldrw.s32 q1, [r5] // ....*................... - // vhadd.s32 q5, q3, q1 // .....*.................. - // vldrw.s32 q2, [r6] , #16 // ......*................. - // vhsub.s32 q0, q6, q5 // .......*................ - // vqdmlsdh.s32 q7, q2, q0 // ........*............... - // vhadd.s32 q6, q6, q5 // .........*.............. - // vqdmladhx.s32 q7, q2, q0 // ..........*............. - // vhsub.s32 q3, q3, q1 // ...........*............ - // vstrw.u32 q6, [r0] , #16 // ............*........... - // vhcadd.s32 q6, q4, q3, #270 // .............*.......... - // vstrw.u32 q7, [r3] , #16 // .................*...... - // vhcadd.s32 q1, q4, q3, #90 // ...............*........ - // vldrw.s32 q4, [r7] , #16 // ..............*......... - // vqdmlsdh.s32 q3, q4, q6 // ................*....... - // vldrw.s32 q5, [r8] , #16 // ...................*.... - // vqdmladhx.s32 q3, q4, q6 // ..................*..... - // vstrw.u32 q3, [r4] , #16 // .....................*.. - // vqdmlsdh.s32 q4, q5, q1 // ....................*... - // vqdmladhx.s32 q4, q5, q1 // ......................*. - // vstrw.u32 q4, [r5] , #16 // .......................* - - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/fx-fft/fx-fft.mk b/tests/fx-fft/fx-fft.mk index 11eb300..366dbdc 100644 --- a/tests/fx-fft/fx-fft.mk +++ b/tests/fx-fft/fx-fft.mk @@ -11,10 +11,11 @@ FX_FFT_PLATFORMS += m85-an555 FX_FFT_SOURCES += main.c # Assembly sources required for this test -FX_FFT_ASMS += base_concrete.s -FX_FFT_ASMS += base_symbolic.s -FX_FFT_ASMS += fixedpoint_radix4_fft_opt_M55.s -FX_FFT_ASMS += fixedpoint_radix4_fft_opt_M85.s -FX_FFT_ASMS += ref_handwritten_asm.s -FX_FFT_ASMS += ref_intrinsics.s +FX_FFT_ASM_DIR = ../../asm/manual/fx_fft +FX_FFT_ASMS += $(FX_FFT_ASM_DIR)/base_concrete.s +FX_FFT_ASMS += $(FX_FFT_ASM_DIR)/base_symbolic.s +FX_FFT_ASMS += $(FX_FFT_ASM_DIR)/fixedpoint_radix4_fft_opt_M55.s +FX_FFT_ASMS += $(FX_FFT_ASM_DIR)/fixedpoint_radix4_fft_opt_M85.s +FX_FFT_ASMS += $(FX_FFT_ASM_DIR)/ref_handwritten_asm.s +FX_FFT_ASMS += $(FX_FFT_ASM_DIR)/ref_intrinsics.s diff --git a/tests/fx-fft/ref_handwritten_asm.s b/tests/fx-fft/ref_handwritten_asm.s deleted file mode 100644 index a64c6aa..0000000 --- a/tests/fx-fft/ref_handwritten_asm.s +++ /dev/null @@ -1,75 +0,0 @@ - .syntax unified - .type fixedpoint_radix4_fft_handwritten, %function - .global fixedpoint_radix4_fft_handwritten - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing symbolic registers, but not - // yet reordering. - - .text - .align 4 -fixedpoint_radix4_fft_handwritten: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pw0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - - vldrw.32 q1, [inA] - vldrw.32 q6, [inC] - -.p2align 2 -fixedpoint_radix4_fft_loop_start: - vhadd.s32 q0, q1, q6 - vldrw.32 q4, [inB] //? - vhsub.s32 q2, q1, q6 - vldrw.32 q5, [inD] - vhadd.s32 q1, q4, q5 - vhsub.s32 q3, q4, q5 //- - vldrw.32 q7, [pw2], #16 - vhadd.s32 q4, q0, q1 - vstrw.32 q4, [inA], #16 - vhsub.s32 q4, q0, q1 - vldrw.32 q5, [pw1], #16 //? - vqdmlsdh.s32 q0, q4, q5 - vhcadd.s32 q6, q2, q3, #270 - vqdmladhx.s32 q0, q4, q5 - vstrw.32 q0, [inB], #16 - vqdmlsdh.s32 q0, q6, q7 - vldrw.32 q1, [inA] //? - vqdmladhx.s32 q0, q6, q7 - vstrw.32 q0, [inC], #16 - vhcadd.s32 q4, q2, q3, #90 - vldrw.32 q5, [pw3], #16 - vqdmlsdh.s32 q0, q4, q5 - vldrw.32 q6, [inC] - vqdmladhx.s32 q0, q4, q5 - vstrw.32 q0, [inD], #16 - le lr, fixedpoint_radix4_fft_loop_start -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/fx-fft/ref_intrinsics.s b/tests/fx-fft/ref_intrinsics.s deleted file mode 100644 index fa3f45f..0000000 --- a/tests/fx-fft/ref_intrinsics.s +++ /dev/null @@ -1,72 +0,0 @@ - .syntax unified - .type fixedpoint_radix4_fft_intrinsics, %function - .global fixedpoint_radix4_fft_intrinsics - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing symbolic registers, but not - // yet reordering. - - .text - .align 4 -fixedpoint_radix4_fft_intrinsics: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pw0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.p2align 2 -fixedpoint_radix4_fft_loop_start: - vldrw.u32 q0, [inC] //- - vldrw.u32 q1, [inA] //- - vhadd.s32 q3, q1, q0 - vldrw.u32 q2, [inB] - vhsub.s32 q0, q1, q0 - vldrw.u32 q4, [inD] - vhadd.s32 q5, q2, q4 - vhadd.s32 q6, q3, q5 //- - vstrb.8 q6, [inA], #16 - vhsub.s32 q3, q3, q5 - vldrw.u32 q5, [pW1], #16 //? - vqdmlsdh.s32 q6, q5, q3 - vhsub.s32 q1, q2, q4 - vqdmladhx.s32 q6, q5, q3 - vstrb.8 q6, [inB], #16 - vhcadd.s32 q2, q0, q1, #270 - vldrw.u32 q3, [pW2], #16 //? - vqdmlsdh.s32 q4, q3, q2 - vqdmladhx.s32 q4, q3, q2 //- - vstrb.8 q4, [inC], #16 - vhcadd.s32 q2, q0, q1, #90 - vldrw.u32 q0, [pW3], #16 //? - vqdmlsdh.s32 q1, q0, q2 - vqdmladhx.s32 q1, q0, q2 //- - vstrb.8 q1, [inD], #16 - le lr, fixedpoint_radix4_fft_loop_start -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/intmulntt/crt.s b/tests/intmulntt/crt.s deleted file mode 100644 index 642c5d6..0000000 --- a/tests/intmulntt/crt.s +++ /dev/null @@ -1,2840 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "crt_const.h" - - .syntax unified - -.type crt_s32_dechunk_chunk_add_optim, %function -.global crt_s32_dechunk_chunk_add_optim - .data - .align 4 -crt_s32_dechunk_chunk_add_optim_data: - .word (1<<22) - 1 - .word (1<<(31-22)) - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED - .text - .align 4 -crt_s32_dechunk_chunk_add_optim_data_ptr: - .word crt_s32_dechunk_chunk_add_optim_data -crt_s32_dechunk_chunk_add_optim: - - loop_cnt .req r14 - init_tmp .req r10 // Temporary prior to main loop - init_tmp2 .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA0 .req r4 - curB0 .req r5 - mask .req r6 - rcarry .req r7 - curA1 .req r9 - curB1 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - const_rshift22 .req r8 - cur0 .req q0 - cur1 .req q1 - masked0 .req q2 - masked1 .req q4 - - push {r4-r11,lr} - vpush {d8-d15} - - ldr addr, crt_s32_dechunk_chunk_add_optim_data_ptr - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - ldrd init_tmp, init_tmp2, [addr], #+8 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {dst, size, init_tmp, init_tmp2} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+(9+4)*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vadd.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 in0, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vmla.s32 in0p, in0, mod_p_neg - vadd.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 in0p, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 in0, in0p, mod_p_neg - vldrw.u32 in0p, [src0], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.i32 quot_low, quot_low, tmpp - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmpp, [src0p], #+16 - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 in0, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 in0p, in0, mod_p_neg - vldrw.u32 in0, [src0], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.i32 quot_low, quot_low, tmpp - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmpp, [src0p], #+16 - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 in0p, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vldrw.u32 tmp, [src1p], #+16 - vmla.s32 in0, in0p, mod_p_neg - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {dst, size, mask, const_rshift22} - mov rcarry, #0 - mov loop_cnt, size, LSR #3 - sub loop_cnt, loop_cnt, #1 - - ldrd curA0, curB0, [dst] - add rcarry, curA0, rcarry, ASR #22 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - add rcarry, curA0, rcarry, ASR #22 - strd curA1, curB1, [dst], #+8 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #+8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - le loop_cnt, 1b -2: - strd curA1, curB1, [dst], #+8 - - mov loop_cnt, size, LSR #4 - sub loop_cnt, loop_cnt, #1 - - vldrw.u32 cur0, [dst] - vand.u32 masked0, cur0, qmask - vshlc cur0, rcarry, #32 - vqdmlah.s32 masked0, cur0, const_rshift22 - vldrw.u32 cur1, [dst, #+16] - vand.u32 masked1, cur1, qmask - vstrw.u32 masked0, [dst], #+16 - vshlc cur1, rcarry, #32 - vqdmlah.s32 masked1, cur1, const_rshift22 - - wls loop_cnt, loop_cnt, 2 - .align 2 - 1: - vldrw.u32 cur0, [dst, #+16] - vand.u32 masked0, cur0, qmask - vstrw.u32 masked1, [dst], #+16 - vshlc cur0, rcarry, #32 - vqdmlah.s32 masked0, cur0, const_rshift22 - vldrw.u32 cur1, [dst, #+16] - vand.u32 masked1, cur1, qmask - vstrw.u32 masked0, [dst], #+16 - vshlc cur1, rcarry, #32 - vqdmlah.s32 masked1, cur1, const_rshift22 - le loop_cnt, 1b - 2: - vstrw.u32 masked1, [dst], #+16 - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA0 - .unreq curB0 - .unreq curA1 - .unreq curB1 - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq const_rshift22 - .unreq in0 - .unreq in0p - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - -.type crt_s32_dechunk_chunk_optim, %function - .global crt_s32_dechunk_chunk_optim - .data - .align 4 -crt_s32_dechunk_chunk_optim_data: - .word (1<<(31-(CRT_32_P_REFINED_BARRETT_SHIFT+1))) - .word (1<<22) - 1 - .word (1<<(31-22)) - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(9)) - .word -CRT_32_P_TWISTED - .text - .align 4 -crt_s32_dechunk_chunk_optim_data_ptr: - .word crt_s32_dechunk_chunk_optim_data -crt_s32_dechunk_chunk_optim: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_tw .req r4 - mod_q_neg .req r5 - const_prshift .req r6 - const_shift9 .req r7 - const_rshift22 .req r10 - p_inv_mod_q .req r9 - p_inv_mod_q_tw .req r8 - rcarry .req r11 - rcarry_red .req r12 - - in0p .req q7 // q0 - in0 .req q0 // q6 - in1 .req q5 // q2 - diff .req in1 - quot_low .req q2 // q5 - qmask .req q1 // q3 - tmpp .req q4 // q7 - tmp .req q6 // q1 - red_tmp .req q3 // q4 - - push {r4-r11,lr} - sub.w sp, sp, #(4*16) - - vstrw.32 q7, [sp, #(0*16)] - mov loop_cnt, size, LSR #3 - vstrw.32 q6, [sp, #(1*16)] - sub loop_cnt, loop_cnt, #1 - vstrw.32 q5, [sp, #(2*16)] - - ldr addr, crt_s32_dechunk_chunk_optim_data_ptr - ldr const_prshift, [addr], #+4 - ldrd init_tmp, const_rshift22, [addr], #+8 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - - vldrw.u32 in0p, [src0], #+16 - vqdmulh.s32 diff, in0p, mod_p_tw - ldrd const_shift9, mod_p_tw, [addr], #+8 - .unreq addr - - vqrdmulh.s32 tmp, diff, const_prshift - vdup.u32 qmask, init_tmp - vmla.s32 in0p, tmp, mod_p - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vstrw.32 q4, [sp, #(3*16)] - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 in0, [src0], #+16 - vmla.s32 diff, tmp, mod_q_neg - movs.w rcarry, #0 - vmul.u32 quot_low, diff, mod_p - movs.w rcarry_red, #0 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vqdmulh.s32 diff, in0, mod_p_tw - vorr.u32 tmpp, tmpp, tmp - vqrdmulh.s32 tmp, diff, const_prshift - vshlc tmpp, rcarry, #32 - vmla.s32 in0, tmp, mod_p - vadd.i32 in0p, in0p, tmpp - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.u32 tmpp, quot_low, in0p - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 in0p, [src0], #+16 - vmla.s32 diff, tmp, mod_q_neg - vand.u32 red_tmp, tmpp, qmask - vmul.u32 quot_low, diff, mod_p - vshlc tmpp, rcarry_red, #32 - vqdmlah.s32 red_tmp, tmpp, const_rshift22 - vstrw.32 red_tmp, [dst], #+16 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - vqdmulh.s32 diff, in0p, mod_p_tw - vorr.u32 tmpp, tmpp, tmp - vqrdmulh.s32 tmp, diff, const_prshift - vshlc tmpp, rcarry, #32 - vmla.s32 in0p, tmp, mod_p - vadd.s32 in0, in0, tmpp - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.s32 tmpp, quot_low, in0 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vldrw.u32 in0, [src0], #+16 - vmla.s32 diff, tmp, mod_q_neg - vand.u32 red_tmp, tmpp, qmask - vmul.u32 quot_low, diff, mod_p - vshlc tmpp, rcarry_red, #32 - vqdmlah.s32 red_tmp, tmpp, const_rshift22 - vstrw.32 red_tmp, [dst], #+16 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - le loop_cnt, 1b -2: - - vqdmulh.s32 diff, in0, mod_p_tw - vorr.u32 tmpp, tmpp, tmp - vqrdmulh.s32 tmp, diff, const_prshift - vshlc tmpp, rcarry, #32 - vmla.s32 in0, tmp, mod_p - vadd.i32 in0p, in0p, tmpp - vldrw.u32 in1, [src1], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vadd.u32 tmpp, quot_low, in0p - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vand.u32 red_tmp, tmpp, qmask - vmul.u32 quot_low, diff, mod_p - vshlc tmpp, rcarry_red, #32 - vqdmlah.s32 red_tmp, tmpp, const_rshift22 - vstrw.32 red_tmp, [dst], #+16 - vqdmulh.s32 tmp, diff, mod_p - vshr.u32 tmpp, quot_low, #22 - vmul.u32 tmp, tmp, const_shift9 - vand.u32 quot_low, quot_low, qmask - - vldrw.u32 q7, [sp, #(0*16)] - vorr.u32 tmpp, tmpp, tmp - vldrw.u32 q6, [sp, #(1*16)] - vshlc tmpp, rcarry, #32 - vldrw.u32 q5, [sp, #(2*16)] - vadd.s32 in0, tmpp, in0 - vldrw.u32 q4, [sp, #(3*16)] - vadd.s32 quot_low, quot_low, in0 - ldrd r4, r5, [sp, #(4*16)] - vand.u32 red_tmp, quot_low, qmask - ldrd r6, r7, [sp, #(4*16 + 1*8)] - vshlc quot_low, rcarry_red, #32 - ldrd r8, r9, [sp, #(4*16 + 2*8)] - vqdmlah.s32 red_tmp, quot_low, const_rshift22 - ldrd r10, r11, [sp, #(4*16 + 3*8)] - vstrw.32 red_tmp, [dst], #+16 - adds.w sp, sp, #(4*16+4*8) - pop {pc} - - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const_prshift - .unreq const_shift9 - .unreq const_rshift22 - .unreq in0 - .unreq in0p - .unreq in1 - .unreq tmp - .unreq tmpp - .unreq diff - .unreq quot_low - .unreq qmask - -.type crt_s32_dechunk_chunk_add, %function -.global crt_s32_dechunk_chunk_add - .align 4 - .data -crt_s32_dechunk_chunk_add_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED - - .text - .align 4 -crt_s32_dechunk_chunk_add_data_ptr: - .word crt_s32_dechunk_chunk_add_data -crt_s32_dechunk_chunk_add: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA0 .req r3 - curB0 .req r4 - mask .req r5 - rcarry .req r7 - curA1 .req r9 - curB1 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - ldr addr, crt_s32_dechunk_chunk_add_data_ptr - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vadd.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0p, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {dst, size, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #2 - sub loop_cnt, loop_cnt, #1 - - ldrd curA0, curB0, [dst] - add rcarry, curA0, rcarry, ASR #22 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - add rcarry, curA0, rcarry, ASR #22 - strd curA1, curB1, [dst], #+8 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #+8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - le loop_cnt, 1b -2: - strd curA1, curB1, [dst], #+8 - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA0 - .unreq curB0 - .unreq curA1 - .unreq curB1 - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in0p - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - -.type crt_s32_dechunk_chunk, %function -.global crt_s32_dechunk_chunk - .data - .align 4 -crt_s32_dechunk_chunk_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED - .text - .align 4 -crt_s32_dechunk_chunk_data_ptr: - .word crt_s32_dechunk_chunk_data -crt_s32_dechunk_chunk: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA0 .req r3 - curB0 .req r4 - mask .req r5 - rcarry .req r7 - curA1 .req r9 - curB1 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q5 - tmp .req q6 - - push {r4-r11,lr} - vpush {d8-d15} - - ldr addr, crt_s32_dechunk_chunk_data_ptr - mov loop_cnt, size, LSR #2 - - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0], #+16 - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vadd.s32 in0, tmpp, in0 - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst], #+16 - le loop_cnt, 1b -2: - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {dst, size, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #2 - sub loop_cnt, loop_cnt, #1 - - ldrd curA0, curB0, [dst] - add rcarry, curA0, rcarry, ASR #22 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - add rcarry, curA0, rcarry, ASR #22 - strd curA1, curB1, [dst], #+8 - and curA0, rcarry, mask - add rcarry, curB0, rcarry, ASR #22 - ldrd curA1, curB1, [dst, #8] - and curB0, rcarry, mask - - add rcarry, curA1, rcarry, ASR #22 - strd curA0, curB0, [dst], #+8 - and curA1, rcarry, mask - add rcarry, curB1, rcarry, ASR #22 - ldrd curA0, curB0, [dst, #8] - and curB1, rcarry, mask - - le loop_cnt, 1b -2: - strd curA1, curB1, [dst], #+8 - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA0 - .unreq curB0 - .unreq curA1 - .unreq curB1 - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq tmp - .unreq tmpp - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - - -.type crt_s32_pure_reduce, %function -.global crt_s32_pure_reduce - .align 4 -crt_s32_pure_reduce_data: - .word CRT_32_P - .word CRT_32_P_TWISTED - .word CRT_32_Q - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED -crt_s32_pure_reduce: - - loop_cnt .req r14 - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_q .req r4 - mod_q_neg .req r5 - p_inv_mod_q .req r6 - p_inv_mod_q_tw .req r7 - mask .req r8 - const1 .req r9 - const0 .req r10 - mod_p_tw .req r11 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q3 - quot_high .req q4 - quot .req q5 - mod_p_vect .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - addr .req r12 - adr addr, crt_s32_pure_reduce_data - ldrd mod_p, mod_p_tw, [addr], #+8 - ldrd mod_q, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - .unreq addr - - mov const1, #1 - mov const0, #0 - - wls loop_cnt, loop_cnt, 2 -1: - // PRELIMINARY ASSUMPTION: - // x and y are already scaled and reduced - - vldrw.s32 in0, [src0], #+16 - vqdmulh.s32 tmp, in0, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.s32 in1, [src1], #+16 - - /* CRT interpolation of (x mod p) and (y mod q) - * - * x + ((y-x)*(p mod q)^{-1} mod q)*p - */ - vsub.s32 diff, in1, in0 - - /* Signed refined Barrett multiplication */ - vqdmulh.s32 quot, diff, p_inv_mod_q_tw - vrshr.s32 quot, quot, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmul.u32 diff, diff, p_inv_mod_q - vmla.s32 diff, quot, mod_q_neg - - /* Compute high and low products separately */ - vmul.i32 quot_low, diff, mod_p_vect - vmulh.s32 quot_high, diff, mod_p_vect - - /* Need to do a 64-bit addition to quot_high and quot_low */ - /* Add as u32, and manually add the carry to the upperlanes */ - vadd.u32 quot_low, quot_low, in0 - vpt.u32 HI, in0, quot_low - vaddt.i32 quot_high, quot_high, const1 - /* Need to add the sign bit of in0 */ - vqdmlah.s32 quot_high, in0, const1 - - vst20.32 {quot_low, quot_high}, [dst] - vst21.32 {quot_low, quot_high}, [dst]! - - le loop_cnt, 1b - 2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - - .unreq mod_p - .unreq mod_p_neg - .unreq mod_q - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq mask - .unreq const1 - .unreq mod_p_tw - - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq quot_high - .unreq quot - .unreq mod_p_vect - .unreq tmp - -.type crt_s32_chunk_dechunk, %function -.global crt_s32_chunk_dechunk - .align 4 -crt_s32_chunk_dechunk_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) -crt_s32_chunk_dechunk: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldr const_shift10, [addr], #+4 - vdup.u32 qmask, init_tmp - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0, q_off] - vldrw.u32 in1, [src1, q_off] - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n src0, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - adds.n dst, #4 - - le loop_cnt, 1b -2: - - /* Use dummy loop for the sake of tail predication */ - mov loop_cnt, #3 - dlstp.32 loop_cnt, loop_cnt -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vshr.s32 carry, in0, #22 - vand in0, in0, qmask - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - vldrw.u32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - letp loop_cnt, 1b - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - -.type crt_s32_chunk_dechunk_reduce, %function -.global crt_s32_chunk_dechunk_reduce - .align 4 -crt_s32_chunk_dechunk_reduce_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0, q_off] - vqdmulh.s32 tmp, in0, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 in1, [src1, q_off] - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n src0, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - adds.n dst, #4 - - le loop_cnt, 1b -2: - - /* Use dummy loop for the sake of tail predication */ - mov loop_cnt, #3 - dlstp.32 loop_cnt, loop_cnt -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vshr.s32 carry, in0, #22 - vand in0, in0, qmask - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - vldrw.u32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - letp loop_cnt, 1b - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - -.type crt_s32_chunk_dechunk_reduce_v2, %function -.global crt_s32_chunk_dechunk_reduce_v2 - .align 4 -crt_s32_chunk_dechunk_reduce_v2_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce_v2: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_v2_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - sub loop_cnt, loop_cnt, #1 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - subs.n dst, #4 - - vldrw.u32 in0, [src0, q_off] - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vldrw.u32 in0, [src0, q_off] - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.u32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vand.u32 tmp, quot_low, qmask - vqdmlah.s32 carry, quot_low, const_shift22 - vstrw.32 tmp, [dst, q_off] - - /* Use dummy loop for the sake of tail predication */ - adds.n dst, #4 - movs.n const1, #3 - dlstp.32 loop_cnt, const1 -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vshr.s32 carry, in0, #22 - vand in0, in0, qmask - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - vldrw.u32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - adds.n dst, #4 - letp loop_cnt, 1b - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - -.type crt_s32_chunk_dechunk_reduce_canonical, %function -.global crt_s32_chunk_dechunk_reduce_canonical - .align 4 -crt_s32_chunk_dechunk_reduce_canonical_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce_canonical: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - sub loop_cnt, loop_cnt, #1 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - /* Save original destination pointer and mask for later */ - push {dst, init_tmp} - .unreq addr - - veor carry, carry, carry - - movs.n const1, #1 - subs.n dst, #4 - - vldrw.u32 in0, [src0, q_off] - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vldrw.u32 in0, [src0, q_off] - vstrw.32 quot_low, [dst, q_off] - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - adds.n src0, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst, q_off] - - /* Use dummy loop for the sake of tail predication */ - adds.n dst, #4 - movs.n const1, #3 - dlstp.32 loop_cnt, const1 -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - letp loop_cnt, 1b - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {dst, mask} - mov rcarry, #0 - mov loop_cnt, #(CRT_32_SIZE/2) - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - .unreq tmp - -.type crt_s32_chunk_dechunk_reduce_canonical_v2, %function -.global crt_s32_chunk_dechunk_reduce_canonical_v2 - .align 4 -crt_s32_chunk_dechunk_reduce_canonical_v2_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_reduce_canonical_v2: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q5 - tmp .req q6 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_v2_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 in0, [src0], #+16 - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vadd.s32 in0, tmpp, in0 - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst], #+16 - le loop_cnt, 1b -2: - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq tmp - .unreq tmpp - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - -.type crt_s32_chunk_dechunk_sub_reduce_canonical_v2, %function -.global crt_s32_chunk_dechunk_sub_reduce_canonical_v2 - .align 4 -crt_s32_chunk_dechunk_sub_reduce_canonical_v2_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_sub_reduce_canonical_v2: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - subs loop_cnt, #1 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_v2_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0, [src0], #+16 - vldrw.u32 in0p, [src0p], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vsub.s32 in0, in0, in0p - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vldrw.u32 in0p, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vldrw.u32 in0, [src0], #+16 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - le loop_cnt, 1b -2: - - vsub.s32 in0, in0, in0p - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1], #+16 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - -.type crt_s32_chunk_dechunk_sub_reduce_canonical_v3, %function -.global crt_s32_chunk_dechunk_sub_reduce_canonical_v3 - .align 4 -crt_s32_chunk_dechunk_sub_reduce_canonical_v3_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_sub_reduce_canonical_v3: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_v2_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vsub.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0p, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vsub.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vsub.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vsub.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - - - - -.type crt_s32_chunk_dechunk_add_reduce_canonical, %function -.global crt_s32_chunk_dechunk_add_reduce_canonical - .align 4 -crt_s32_chunk_dechunk_add_reduce_canonical_data: - .word (1<<22) - 1 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_add_reduce_canonical: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r7 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q2 - qmask .req q3 - mod_p_vect .req q4 - tmpp .req q6 - tmp .req q5 - - in0p .req q7 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #3 - subs loop_cnt, #1 - - adr addr, crt_s32_chunk_dechunk_add_reduce_canonical_data - ldr init_tmp, [addr], #+4 - vdup.u32 qmask, init_tmp - /* Save size, original destination pointer and mask for later */ - push {size, dst, init_tmp} - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - ldrd const_shift10, mod_p_tw, [addr], #+8 - - .unreq addr - - movs.n const1, #1 - movs.n rcarry, #0 - neg mod_p_neg, mod_p - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - vldrw.u32 in0p, [src0], #+16 - vldrw.u32 tmp, [src0p], #+16 - vadd.i32 in0p, in0p, tmp - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vmul.u32 quot_low, diff, mod_p_vect - vldrw.u32 tmpp, [src0p], #+16 - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0p, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0p, in0p, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - vqdmulh.s32 tmp, in0p, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0p, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vldrw.u32 tmpp, [src0p], #+16 - vsub.s32 diff, in1, in0p - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vldrw.u32 in0, [src0], #+16 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vadd.s32 in0, in0, tmpp - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vldrw.u32 in1, [src1], #+16 - - le loop_cnt, 1b -2: - - vqdmulh.s32 tmp, in0, mod_p_tw - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0p, const1 - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p], #+16 - vadd.s32 in1, in1, tmp - vmla.s32 quot_low, tmpp, const1 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - vstrw.32 quot_low, [dst], #+16 - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 tmpp, quot_low, #22 - vmla.s32 tmpp, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vshlc tmpp, rcarry, #32 - vmla.s32 tmpp, in0, const1 - vmla.s32 quot_low, tmpp, const1 - vstrw.32 quot_low, [dst], #+16 - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - /* Restore mask and original destination pointer */ - pop {size, dst, mask} - mov rcarry, #0 - mov loop_cnt, size, LSR #1 - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq qmask - .unreq mod_p_vect - .unreq tmp - .unreq tmpp - - - - - - -.type crt_s32_chunk_dechunk_sub_reduce_canonical, %function -.global crt_s32_chunk_dechunk_sub_reduce_canonical - .align 4 -crt_s32_chunk_dechunk_sub_reduce_canonical_data: - // Scatter/Gather offsets - .word 4*0 - .word 4*1 - .word 4*2 - .word 4*3 - .word CRT_32_P - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED - .word (1<<22) - 1 - .word (1<<(31-22)) - .word (1<<(10)) - .word CRT_32_P_TWISTED -crt_s32_chunk_dechunk_sub_reduce_canonical: - - loop_cnt .req r14 - init_tmp .req r11 // Temporary prior to main loop - addr .req r12 - - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - src0p .req r11 - src1p .req r12 - - mod_p .req r3 - mod_p_neg .req mod_p - mod_p_tw .req r9 - mod_q_neg .req r4 - p_inv_mod_q .req r5 - p_inv_mod_q_tw .req r10 - const_shift22 .req r7 - const_shift10 .req r8 - const1 .req r6 - - curA .req r3 - curB .req r4 - mask .req r5 - rcarry .req r6 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_high .req in1 - quot_low .req q2 - carry .req q3 - qmask .req q4 // - mod_p_vect .req q5 // - q_off .req q6 - tmp .req q7 - - push {r4-r12,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - adr addr, crt_s32_chunk_dechunk_reduce_canonical_data - vldrw.u32 q_off, [addr], #+16 - vmul.i32 q_off, q_off, loop_cnt - sub loop_cnt, loop_cnt, #1 - ldrd mod_p, mod_q_neg, [addr], #+8 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - neg mod_p_neg, mod_p - ldrd init_tmp, const_shift22, [addr], #+8 - ldrd const_shift10, mod_p_tw, [addr], #+8 - vdup.u32 qmask, init_tmp - - /* Save original destination pointer and mask for later */ - push {dst, init_tmp} - .unreq addr - - /* Load address of additional inputs from stack */ - ldrd src0p, src1p, [sp, #(4*16+12*4)] - - veor carry, carry, carry - - movs.n const1, #1 - subs.n dst, #4 - - vldrw.u32 in0, [src0, q_off] - - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - vldrw.u32 tmp, [src0p, q_off] - vsub.s32 in0, in0, tmp - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p, q_off] - vsub.s32 in1, in1, tmp - adds.n src0, #4 - adds src0p, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - adds src1p, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vldrw.u32 in0, [src0, q_off] - vstrw.32 quot_low, [dst, q_off] - le loop_cnt, 1b -2: - vldrw.u32 tmp, [src0p, q_off] - vsub.s32 in0, in0, tmp - vqdmulh.s32 tmp, in0, mod_p_tw - vldrw.u32 in1, [src1, q_off] - vrshr.s32 tmp, tmp, #(CRT_32_P_REFINED_BARRETT_SHIFT+1) - vmla.s32 in0, tmp, mod_p_neg - vldrw.u32 tmp, [src1p, q_off] - vsub.s32 in1, in1, tmp - adds.n src0, #4 - adds src0p, #4 - vsub.s32 diff, in1, in0 - vqdmulh.s32 tmp, diff, p_inv_mod_q_tw - adds.n dst, #4 - vmul.u32 diff, diff, p_inv_mod_q - vrshr.s32 tmp, tmp, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmla.s32 diff, tmp, mod_q_neg - adds.n src1, #4 - adds src1p, #4 - vmul.u32 quot_low, diff, mod_p_vect - vadd.u32 in0, in0, carry - vmulh.s32 tmp, diff, mod_p_vect - vshr.u32 carry, quot_low, #22 - vmla.s32 carry, tmp, const_shift10 - vand.u32 quot_low, quot_low, qmask - vmla.s32 quot_low, in0, const1 - vstrw.32 quot_low, [dst, q_off] - - /* Use dummy loop for the sake of tail predication */ - adds.n dst, #4 - movs.n const1, #3 - dlstp.32 loop_cnt, const1 -1: - vldrw.32 in0, [dst, q_off] - vadd.u32 in0, in0, carry - vstrw.32 in0, [dst, q_off] - letp loop_cnt, 1b - - /* At this point, we have non-canonical limbs of 32-bit. - * Iterate over them in scalar for reduction to canonical form. */ - - /* Restore mask and original destination pointer */ - pop {dst, mask} - mov rcarry, #0 - mov loop_cnt, #(CRT_32_SIZE/2) - wls loop_cnt, loop_cnt, 2 -1: - ldm dst, {curA, curB} - add rcarry, curA, rcarry, ASR #22 - and curA, rcarry, mask - add rcarry, curB, rcarry, ASR #22 - and curB, rcarry, mask - stm dst!, {curA, curB} - le loop_cnt, 1b -2: - - vpop {d8-d15} - pop {r4-r12,lr} - bx lr - - .unreq curA - .unreq curB - .unreq mask - .unreq rcarry - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - .unreq mod_p - .unreq mod_p_tw - .unreq mod_p_neg - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq init_tmp - .unreq const1 - .unreq const_shift22 - .unreq const_shift10 - .unreq in0 - .unreq in1 - .unreq diff - .unreq carry - .unreq quot_low - .unreq quot_high - .unreq qmask - .unreq mod_p_vect - .unreq q_off - - -.type crt_s32_pure, %function -.global crt_s32_pure - .align 4 -crt_s32_pure_data: - .word CRT_32_P - .word CRT_32_Q - .word -CRT_32_Q - .word CRT_32_P_INV_MOD_Q - .word CRT_32_P_INV_MOD_Q_TWISTED -crt_s32_pure: - - loop_cnt .req r14 - dst .req r0 - src0 .req r1 - src1 .req r2 - size .req r3 - - mod_p .req r3 - mod_q .req r4 - mod_q_neg .req r5 - p_inv_mod_q .req r6 - p_inv_mod_q_tw .req r7 - mask .req r8 - const1 .req r9 - const0 .req r10 - - in0 .req q0 - in1 .req q1 - diff .req in1 - quot_low .req q3 - quot_high .req q4 - quot .req q5 - mod_p_vect .req q6 - - push {r4-r11,lr} - vpush {d8-d15} - - mov loop_cnt, size, LSR #2 - - addr .req r11 - adr addr, crt_s32_pure_data - ldrd mod_p, mod_q, [addr], #+8 - ldr mod_q_neg, [addr], #+4 - ldrd p_inv_mod_q, p_inv_mod_q_tw, [addr], #+8 - vdup.u32 mod_p_vect, mod_p - .unreq addr - - mov const1, #1 - mov const0, #0 - - wls loop_cnt, loop_cnt, 2 -1: - // PRELIMINARY ASSUMPTION: - // x and y are already scaled and reduced - - vldrw.u32 in0, [src0], #+16 - vldrw.u32 in1, [src1], #+16 - - /* CRT interpolation of (x mod p) and (y mod q) - * - * x + ((y-x)*(p mod q)^{-1} mod q)*p - */ - vsub.s32 diff, in1, in0 - - /* Unsigned (!) refined Barrett multiplication */ - vqdmulh.s32 quot, diff, p_inv_mod_q_tw - vrshr.s32 quot, quot, #(CRT_32_P_Q_REFINED_BARRETT_SHIFT+1) - vmul.u32 diff, diff, p_inv_mod_q - vmla.s32 diff, quot, mod_q_neg - - /* Compute high and low products separately */ - vmul.u32 quot_low, diff, mod_p_vect - vmulh.s32 quot_high, diff, mod_p_vect - - /* Need to do a 64-bit addition to quot_high and quot_low */ - /* Add as u32, and manually add the carry to the upperlanes */ - vadd.u32 quot_low, quot_low, in0 - vpt.u32 HI, in0, quot_low - vaddt.i32 quot_high, quot_high, const1 - /* Need to add the sign bit of in0 */ - vqdmlah.s32 quot_high, in0, const1 - - vst20.32 {quot_low, quot_high}, [dst] - vst21.32 {quot_low, quot_high}, [dst]! - - le loop_cnt, 1b - 2: - - vpop {d8-d15} - pop {r4-r11,lr} - bx lr - - .unreq loop_cnt - .unreq dst - .unreq src0 - .unreq src1 - - .unreq mod_p - .unreq mod_q - .unreq mod_q_neg - .unreq p_inv_mod_q - .unreq p_inv_mod_q_tw - .unreq mask - .unreq const1 - - .unreq in0 - .unreq in1 - .unreq diff - .unreq quot_low - .unreq quot_high - .unreq quot - .unreq mod_p_vect diff --git a/tests/intmulntt/intmulntt.mk b/tests/intmulntt/intmulntt.mk index a7250a0..5624209 100644 --- a/tests/intmulntt/intmulntt.mk +++ b/tests/intmulntt/intmulntt.mk @@ -11,31 +11,12 @@ INTMULNTT_PLATFORMS += m85-an555 INTMULNTT_SOURCES += main.c # Assembly sources required for this test -INTMULNTT_ASMS += crt.s +INTMULNTT_ASMS += ../../asm/manual/crt/crt.s INTMULNTT_ASMS += montgomery.s -INTMULNTT_ASMS += ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_192_u32_33556993_27792935_incomplete_good.s -INTMULNTT_ASMS += ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_192_u32_45387457_16877098_incomplete_good.s -INTMULNTT_ASMS += ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s -INTMULNTT_ASMS += ntt_192_u32_88299073_9670361_incomplete_good.s -INTMULNTT_ASMS += ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_192_u32_106117153_62524596_incomplete_good.s -INTMULNTT_ASMS += ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s -INTMULNTT_ASMS += ntt_192_u32_108643009_1793055_incomplete_good.s -INTMULNTT_ASMS += ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_384_u32_33556993_15047299_incomplete_good.s -INTMULNTT_ASMS += ntt_384_u32_45387457_923104_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_384_u32_45387457_923104_incomplete_good.s -INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s -INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good_oop.s -INTMULNTT_ASMS += ntt_384_u32_88299073_4883425_incomplete_good.s -INTMULNTT_ASMS += ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_384_u32_106117153_1392340_incomplete_good.s -INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good_bitrev.s -INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s -INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good_oop.s -INTMULNTT_ASMS += ntt_384_u32_108643009_640922_incomplete_good.s +INTMULNTT_ASM_DIR = ../../asm/auto/ntt_384 +INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s +INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s +INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good.s +INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s +INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s +INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_108643009_640922_incomplete_good.s diff --git a/tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good.s b/tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good.s deleted file mode 100644 index 85dcc66..0000000 --- a/tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_106117153_62524596_incomplete_good_twiddles -ntt_192_u32_106117153_62524596_incomplete_good_twiddles: // For base multiplication -.word 181897243 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 3242424133 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 112049651 // zeta^160 * 2^31 = 62524596^160 * 2^31 = 54660581 * 2^31 -.word 3804748909 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 62524596^160 * 2586463201 * 2^31 -.word 21893595 // zeta^ 80 * 2^31 = 62524596^ 80 * 2^31 = 91733486 * 2^31 -.word 1329145221 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 80 * 2586463201 * 2^31 -.word 167711653 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 16540923 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 200606947 // zeta^136 * 2^31 = 62524596^136 * 2^31 = 105862549 * 2^31 -.word 1509569405 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 62524596^136 * 2586463201 * 2^31 -.word 139952163 // zeta^104 * 2^31 = 62524596^104 * 2^31 = 37582414 * 2^31 -.word 4107280445 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 62524596^104 * 2586463201 * 2^31 -.word 31068557 // zeta^ 24 * 2^31 = 62524596^ 24 * 2^31 = 36202838 * 2^31 -.word 3373286419 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 24 * 2586463201 * 2^31 -.word 105143799 // zeta^184 * 2^31 = 62524596^184 * 2^31 = 52822457 * 2^31 -.word 1977355497 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 62524596^184 * 2586463201 * 2^31 -.word 55615889 // zeta^ 68 * 2^31 = 62524596^ 68 * 2^31 = 39384089 * 2^31 -.word 3137504399 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 68 * 2586463201 * 2^31 -.word 146912053 // zeta^ 36 * 2^31 = 62524596^ 36 * 2^31 = 101908685 * 2^31 -.word 1166997355 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 36 * 2586463201 * 2^31 -.word 122622335 // zeta^148 * 2^31 = 62524596^148 * 2^31 = 17280056 * 2^31 -.word 613116513 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 62524596^148 * 2586463201 * 2^31 -.word 164824309 // zeta^116 * 2^31 = 62524596^116 * 2^31 = 51886295 * 2^31 -.word 1491262891 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 62524596^116 * 2586463201 * 2^31 -.word 138976211 // zeta^ 12 * 2^31 = 62524596^ 12 * 2^31 = 87659826 * 2^31 -.word 4092213901 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 12 * 2586463201 * 2^31 -.word 96098887 // zeta^172 * 2^31 = 62524596^172 * 2^31 = 27892831 * 2^31 -.word 3642612377 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 62524596^172 * 2586463201 * 2^31 -.word 60018025 // zeta^ 92 * 2^31 = 62524596^ 92 * 2^31 = 45785556 * 2^31 -.word 2964370359 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 92 * 2586463201 * 2^31 -.word 54839591 // zeta^ 60 * 2^31 = 62524596^ 60 * 2^31 = 66124790 * 2^31 -.word 2201405369 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 60 * 2586463201 * 2^31 -.word 100184655 // zeta^ 64 * 2^31 = 62524596^ 64 * 2^31 = 51456572 * 2^31 -.word 490218385 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 64 * 2586463201 * 2^31 -.word 175964745 // zeta^ 32 * 2^31 = 62524596^ 32 * 2^31 = 51456573 * 2^31 -.word 3732642519 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 32 * 2586463201 * 2^31 -.word 44522653 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 4278426371 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 172533401 // zeta^112 * 2^31 = 62524596^112 * 2^31 = 34864379 * 2^31 -.word 1312604295 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 62524596^112 * 2586463201 * 2^31 -.word 72282143 // zeta^ 8 * 2^31 = 62524596^ 8 * 2^31 = 68534739 * 2^31 -.word 187686849 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 8 * 2586463201 * 2^31 -.word 166771937 // zeta^168 * 2^31 = 62524596^168 * 2^31 = 68280135 * 2^31 -.word 1697256255 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 62524596^168 * 2586463201 * 2^31 -.word 107090507 // zeta^ 88 * 2^31 = 62524596^ 88 * 2^31 = 53294696 * 2^31 -.word 2317611797 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 88 * 2586463201 * 2^31 -.word 32041911 // zeta^ 56 * 2^31 = 62524596^ 56 * 2^31 = 89497534 * 2^31 -.word 1395930921 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 56 * 2586463201 * 2^31 -.word 65322253 // zeta^132 * 2^31 = 62524596^132 * 2^31 = 4208468 * 2^31 -.word 3127969939 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 62524596^132 * 2586463201 * 2^31 -.word 14820989 // zeta^100 * 2^31 = 62524596^100 * 2^31 = 43592557 * 2^31 -.word 1970507043 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 62524596^100 * 2586463201 * 2^31 -.word 47409997 // zeta^ 20 * 2^31 = 62524596^ 20 * 2^31 = 54230858 * 2^31 -.word 2803704403 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 20 * 2586463201 * 2^31 -.word 63915179 // zeta^180 * 2^31 = 62524596^180 * 2^31 = 71510914 * 2^31 -.word 3416820917 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 62524596^180 * 2586463201 * 2^31 -.word 116135419 // zeta^ 76 * 2^31 = 62524596^ 76 * 2^31 = 78224322 * 2^31 -.word 652354917 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 76 * 2586463201 * 2^31 -.word 148994477 // zeta^ 44 * 2^31 = 62524596^ 44 * 2^31 = 59766995 * 2^31 -.word 449601523 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 44 * 2586463201 * 2^31 -.word 157394715 // zeta^156 * 2^31 = 62524596^156 * 2^31 = 39992363 * 2^31 -.word 2093561925 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 62524596^156 * 2586463201 * 2^31 -.word 111295587 // zeta^124 * 2^31 = 62524596^124 * 2^31 = 85777919 * 2^31 -.word 762964989 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 62524596^124 * 2586463201 * 2^31 -.word 36269561 // zeta^128 * 2^31 = 62524596^128 * 2^31 = 54660580 * 2^31 -.word 562324775 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 62524596^128 * 2586463201 * 2^31 -.word 30337063 // zeta^ 96 * 2^31 = 62524596^ 96 * 2^31 = 106117152 * 2^31 -.word 1052543161 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 96 * 2586463201 * 2^31 -.word 39700905 // zeta^ 16 * 2^31 = 62524596^ 16 * 2^31 = 71252774 * 2^31 -.word 2982362999 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 16 * 2586463201 * 2^31 -.word 190340711 // zeta^176 * 2^31 = 62524596^176 * 2^31 = 14383667 * 2^31 -.word 2965822073 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 62524596^176 * 2586463201 * 2^31 -.word 45462369 // zeta^ 72 * 2^31 = 62524596^ 72 * 2^31 = 37837018 * 2^31 -.word 2597711039 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 72 * 2586463201 * 2^31 -.word 11627359 // zeta^ 40 * 2^31 = 62524596^ 40 * 2^31 = 254604 * 2^31 -.word 2785397889 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 40 * 2586463201 * 2^31 -.word 180192395 // zeta^152 * 2^31 = 62524596^152 * 2^31 = 16619619 * 2^31 -.word 2899036373 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 62524596^152 * 2586463201 * 2^31 -.word 181165749 // zeta^120 * 2^31 = 62524596^120 * 2^31 = 69914315 * 2^31 -.word 921680875 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 62524596^120 * 2586463201 * 2^31 -.word 197413317 // zeta^ 4 * 2^31 = 62524596^ 4 * 2^31 = 62524596 * 2^31 -.word 2324460251 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 4 * 2586463201 * 2^31 -.word 156618417 // zeta^164 * 2^31 = 62524596^164 * 2^31 = 66733064 * 2^31 -.word 1157462895 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 62524596^164 * 2586463201 * 2^31 -.word 148319127 // zeta^ 84 * 2^31 = 62524596^ 84 * 2^31 = 34606239 * 2^31 -.word 878146377 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 84 * 2586463201 * 2^31 -.word 89611971 // zeta^ 52 * 2^31 = 62524596^ 52 * 2^31 = 88837097 * 2^31 -.word 3681850781 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 52 * 2586463201 * 2^31 -.word 63239829 // zeta^140 * 2^31 = 62524596^140 * 2^31 = 46350158 * 2^31 -.word 3845365771 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 62524596^140 * 2586463201 * 2^31 -.word 73258095 // zeta^108 * 2^31 = 62524596^108 * 2^31 = 18457327 * 2^31 -.word 202753393 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 62524596^108 * 2586463201 * 2^31 -.word 100938719 // zeta^ 28 * 2^31 = 62524596^ 28 * 2^31 = 20339234 * 2^31 -.word 3532002305 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 28 * 2586463201 * 2^31 -.word 152216281 // zeta^188 * 2^31 = 62524596^188 * 2^31 = 60331597 * 2^31 -.word 1330596935 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 62524596^188 * 2586463201 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_106117153_62524596_incomplete_good_scale -ntt_192_u32_106117153_62524596_incomplete_good_scale: // Constants for scaling by 1/N -.word 181897243 // 1/48 -.word 3242424133 // 1/48 twisted -.data -roots: -.word 50789515 /// zeta^ 64 * 2^31 = 62524596^ 64 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 64 * 2586463201 * 2^31 -.word 136304203 /// zeta^128 * 2^31 = 62524596^128 * 2^31 = 54660580 * 2^31 -.word 1106161429 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 62524596^128 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 86500417 // zeta^144 * 2^31 = 62524596^144 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 62524596^144 * 2586463201 * 2^31 -.word 3362131 // zeta^ 72 * 2^31 = 62524596^ 72 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 72 * 2586463201 * 2^31 -.word 74219771 // zeta^ 24 * 2^31 = 62524596^ 24 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 24 * 2586463201 * 2^31 -.word 3362131 // zeta^ 72 * 2^31 = 62524596^ 72 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 72 * 2586463201 * 2^31 -.word 207754911 // zeta^132 * 2^31 = 62524596^132 * 2^31 = 4208468 * 2^31 -.word 85166401 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 62524596^132 * 2586463201 * 2^31 -.word 86384727 // zeta^ 84 * 2^31 = 62524596^ 84 * 2^31 = 34606239 * 2^31 -.word 2847807113 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 84 * 2586463201 * 2^31 -.word 74219771 // zeta^ 24 * 2^31 = 62524596^ 24 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 24 * 2586463201 * 2^31 -.word 77895747 // zeta^ 12 * 2^31 = 62524596^ 12 * 2^31 = 87659826 * 2^31 -.word 1773964317 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 12 * 2586463201 * 2^31 -.word 42168601 // zeta^156 * 2^31 = 62524596^156 * 2^31 = 39992363 * 2^31 -.word 2956805639 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 62524596^156 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_106117153_62524596_incomplete_good, %function -.global ntt_192_u32_106117153_62524596_incomplete_good -ntt_192_u32_106117153_62524596_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 106117153 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 1708504095 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s b/tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s deleted file mode 100644 index 25257ff..0000000 --- a/tests/intmulntt/ntt_192_u32_106117153_62524596_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 54660580 /// zeta^128 * 2^31 = 62524596^128 * 2^31 = 54660580 * 2^31 -.word 1106161430 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 62524596^128 * 2586463201 * 2^31 -.word 51456572 /// zeta^ 64 * 2^31 = 62524596^ 64 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 64 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 56869107 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 1150855200 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 62524596^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 0 * 2586463201 * 2^31 -.word 56869107 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 1150855200 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 56869107 // zeta^ 48 * 2^31 = 62524596^ 48 * 2^31 = 56869107 * 2^31 -.word 1150855200 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 48 * 2586463201 * 2^31 -.word 69914315 // zeta^120 * 2^31 = 62524596^120 * 2^31 = 69914315 * 2^31 -.word 1414849946 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 62524596^120 * 2586463201 * 2^31 -.word 68280135 // zeta^168 * 2^31 = 62524596^168 * 2^31 = 68280135 * 2^31 -.word 1381779187 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 62524596^168 * 2586463201 * 2^31 -.word 69914315 // zeta^120 * 2^31 = 62524596^120 * 2^31 = 69914315 * 2^31 -.word 1414849946 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 62524596^120 * 2586463201 * 2^31 -.word 66124790 // zeta^ 60 * 2^31 = 62524596^ 60 * 2^31 = 66124790 * 2^31 -.word 1338161657 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 60 * 2586463201 * 2^31 -.word 18457327 // zeta^108 * 2^31 = 62524596^108 * 2^31 = 18457327 * 2^31 -.word 373519330 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 62524596^108 * 2586463201 * 2^31 -.word 68280135 // zeta^168 * 2^31 = 62524596^168 * 2^31 = 68280135 * 2^31 -.word 1381779187 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 62524596^168 * 2586463201 * 2^31 -.word 71510914 // zeta^180 * 2^31 = 62524596^180 * 2^31 = 71510914 * 2^31 -.word 1447160182 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 62524596^180 * 2586463201 * 2^31 -.word 101908685 // zeta^ 36 * 2^31 = 62524596^ 36 * 2^31 = 101908685 * 2^31 -.word 2062317245 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 62524596^ 36 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_106117153_62524596_incomplete_good_bitrev, %function -.global ntt_192_u32_106117153_62524596_incomplete_good_bitrev -ntt_192_u32_106117153_62524596_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -106117153 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 1708504095 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good.s b/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good.s deleted file mode 100644 index b077a5d..0000000 --- a/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_108643009_1793055_incomplete_good_twiddles -ntt_192_u32_108643009_1793055_incomplete_good_twiddles: // For base multiplication -.word 125819369 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 3200325335 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 7219049 // zeta^160 * 2^31 = 1793055^160 * 2^31 = 40973034 * 2^31 -.word 3635407191 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 1793055^160 * 3479293249 * 2^31 -.word 20524789 // zeta^ 80 * 2^31 = 1793055^ 80 * 2^31 = 13028154 * 2^31 -.word 778955 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 80 * 3479293249 * 2^31 -.word 41573363 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 255463501 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 90655441 // zeta^136 * 2^31 = 1793055^136 * 2^31 = 21310129 * 2^31 -.word 2443944943 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 1793055^136 * 3479293249 * 2^31 -.word 147417303 // zeta^104 * 2^31 = 1793055^104 * 2^31 = 26332312 * 2^31 -.word 3916510825 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 1793055^104 * 3479293249 * 2^31 -.word 11354681 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 3881929351 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 183168985 // zeta^184 * 2^31 = 1793055^184 * 2^31 = 38250802 * 2^31 -.word 3222230247 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 1793055^184 * 3479293249 * 2^31 -.word 10759601 // zeta^ 68 * 2^31 = 1793055^ 68 * 2^31 = 106639146 * 2^31 -.word 4004030735 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 68 * 3479293249 * 2^31 -.word 48748081 // zeta^ 36 * 2^31 = 1793055^ 36 * 2^31 = 108432201 * 2^31 -.word 3574358159 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 36 * 3479293249 * 2^31 -.word 118657223 // zeta^148 * 2^31 = 1793055^148 * 2^31 = 62017780 * 2^31 -.word 2448614009 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 1793055^148 * 3479293249 * 2^31 -.word 135399931 // zeta^116 * 2^31 = 1793055^116 * 2^31 = 56179088 * 2^31 -.word 3293739077 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 1793055^116 * 3479293249 * 2^31 -.word 22236245 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1317309803 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 173577835 // zeta^172 * 2^31 = 1793055^172 * 2^31 = 42747918 * 2^31 -.word 3951239125 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 1793055^172 * 3479293249 * 2^31 -.word 97528185 // zeta^ 92 * 2^31 = 1793055^ 92 * 2^31 = 105229554 * 2^31 -.word 1192661831 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 92 * 3479293249 * 2^31 -.word 73825049 // zeta^ 60 * 2^31 = 1793055^ 60 * 2^31 = 14289518 * 2^31 -.word 4126063015 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 60 * 3479293249 * 2^31 -.word 210066969 // zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 659560103 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 9957311 // zeta^ 32 * 2^31 = 1793055^ 32 * 2^31 = 67669976 * 2^31 -.word 3859885441 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 32 * 3479293249 * 2^31 -.word 175712655 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 4039503793 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87594435 // zeta^112 * 2^31 = 1793055^112 * 2^31 = 100073230 * 2^31 -.word 4040282749 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 1793055^112 * 3479293249 * 2^31 -.word 69868715 // zeta^ 8 * 2^31 = 1793055^ 8 * 2^31 = 82310697 * 2^31 -.word 378456469 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 8 * 3479293249 * 2^31 -.word 51881147 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2822401413 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 34117033 // zeta^ 88 * 2^31 = 1793055^ 88 * 2^31 = 70392207 * 2^31 -.word 1072737047 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 88 * 3479293249 * 2^31 -.word 154114723 // zeta^ 56 * 2^31 = 1793055^ 56 * 2^31 = 44058032 * 2^31 -.word 659699101 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 56 * 3479293249 * 2^31 -.word 168537937 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 720609135 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 70654529 // zeta^100 * 2^31 = 1793055^100 * 2^31 = 106849954 * 2^31 -.word 429672575 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 1793055^100 * 3479293249 * 2^31 -.word 81886087 // zeta^ 20 * 2^31 = 1793055^ 20 * 2^31 = 52463921 * 2^31 -.word 1001228217 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 20 * 3479293249 * 2^31 -.word 91900301 // zeta^180 * 2^31 = 1793055^180 * 2^31 = 5838692 * 2^31 -.word 3449842227 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1793055^180 * 3479293249 * 2^31 -.word 43708183 // zeta^ 76 * 2^31 = 1793055^ 76 * 2^31 = 65895091 * 2^31 -.word 343728169 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 76 * 3479293249 * 2^31 -.word 174587437 // zeta^ 44 * 2^31 = 1793055^ 44 * 2^31 = 56126250 * 2^31 -.word 1661037971 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 44 * 3479293249 * 2^31 -.word 143460969 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 168904279 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.word 132346145 // zeta^124 * 2^31 = 1793055^124 * 2^31 = 90940036 * 2^31 -.word 1361566111 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 1793055^124 * 3479293249 * 2^31 -.word 207328707 // zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 435081853 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 91466649 // zeta^ 96 * 2^31 = 1793055^ 96 * 2^31 = 108643008 * 2^31 -.word 1094641959 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 96 * 3479293249 * 2^31 -.word 129691583 // zeta^ 16 * 2^31 = 1793055^ 16 * 2^31 = 8569779 * 2^31 -.word 254684545 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 16 * 3479293249 * 2^31 -.word 196761229 // zeta^176 * 2^31 = 1793055^176 * 2^31 = 95614855 * 2^31 -.word 4294188339 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 1793055^176 * 3479293249 * 2^31 -.word 165404871 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 1472565881 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 126630577 // zeta^ 40 * 2^31 = 1793055^ 40 * 2^31 = 87332880 * 2^31 -.word 1851022351 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 40 * 3479293249 * 2^31 -.word 63171295 // zeta^152 * 2^31 = 1793055^152 * 2^31 = 64584977 * 2^31 -.word 3635268193 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 1793055^152 * 3479293249 * 2^31 -.word 205931337 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 413037943 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 146631489 // zeta^ 4 * 2^31 = 1793055^ 4 * 2^31 = 1793055 * 2^31 -.word 3865294719 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 4 * 3479293249 * 2^31 -.word 206526417 // zeta^164 * 2^31 = 1793055^164 * 2^31 = 2003863 * 2^31 -.word 290936559 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 1793055^164 * 3479293249 * 2^31 -.word 125385717 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 845125067 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 98628795 // zeta^ 52 * 2^31 = 1793055^ 52 * 2^31 = 46625229 * 2^31 -.word 1846353285 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 52 * 3479293249 * 2^31 -.word 42698581 // zeta^140 * 2^31 = 1793055^140 * 2^31 = 52516759 * 2^31 -.word 2633929323 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 1793055^140 * 3479293249 * 2^31 -.word 195049773 // zeta^108 * 2^31 = 1793055^108 * 2^31 = 9768841 * 2^31 -.word 2977657491 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1793055^108 * 3479293249 * 2^31 -.word 84939873 // zeta^ 28 * 2^31 = 1793055^ 28 * 2^31 = 17702973 * 2^31 -.word 2933401183 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 28 * 3479293249 * 2^31 -.word 119757833 // zeta^188 * 2^31 = 1793055^188 * 2^31 = 3413455 * 2^31 -.word 3102305463 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 1793055^188 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_108643009_1793055_incomplete_good_scale -ntt_192_u32_108643009_1793055_incomplete_good_scale: // Constants for scaling by 1/N -.word 125819369 // 1/48 -.word 3200325335 // 1/48 twisted -.data -roots: -.word 67669975 /// zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 40973033 /// zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 210808 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 102804317 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 98874168 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 94353491 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_108643009_1793055_incomplete_good, %function -.global ntt_192_u32_108643009_1793055_incomplete_good -ntt_192_u32_108643009_1793055_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -108643009 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 815674047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s b/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s deleted file mode 100644 index a66e58d..0000000 --- a/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 40973033 /// zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 67669975 /// zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 21597933 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 26334175 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 103620826 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 26334175 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 14289518 // zeta^ 60 * 2^31 = 1793055^ 60 * 2^31 = 14289518 * 2^31 -.word 282452654 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 60 * 3479293249 * 2^31 -.word 9768841 // zeta^108 * 2^31 = 1793055^108 * 2^31 = 9768841 * 2^31 -.word 193095041 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1793055^108 * 3479293249 * 2^31 -.word 103620826 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 5838692 // zeta^180 * 2^31 = 1793055^180 * 2^31 = 5838692 * 2^31 -.word 115410055 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1793055^180 * 3479293249 * 2^31 -.word 108432201 // zeta^ 36 * 2^31 = 1793055^ 36 * 2^31 = 108432201 * 2^31 -.word 2143316728 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 36 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_108643009_1793055_incomplete_good_bitrev, %function -.global ntt_192_u32_108643009_1793055_incomplete_good_bitrev -ntt_192_u32_108643009_1793055_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -108643009 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 815674047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s b/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s deleted file mode 100644 index 8f2c6ae..0000000 --- a/tests/intmulntt/ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,1237 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_twiddles -ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 125819369 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 3200325335 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 7219049 // zeta^160 * 2^31 = 1793055^160 * 2^31 = 40973034 * 2^31 -.word 3635407191 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 1793055^160 * 3479293249 * 2^31 -.word 20524789 // zeta^ 80 * 2^31 = 1793055^ 80 * 2^31 = 13028154 * 2^31 -.word 778955 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 80 * 3479293249 * 2^31 -.word 41573363 // zeta^ 48 * 2^31 = 1793055^ 48 * 2^31 = 21597933 * 2^31 -.word 255463501 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 48 * 3479293249 * 2^31 -.word 90655441 // zeta^136 * 2^31 = 1793055^136 * 2^31 = 21310129 * 2^31 -.word 2443944943 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 1793055^136 * 3479293249 * 2^31 -.word 147417303 // zeta^104 * 2^31 = 1793055^104 * 2^31 = 26332312 * 2^31 -.word 3916510825 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 1793055^104 * 3479293249 * 2^31 -.word 11354681 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 3881929351 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 183168985 // zeta^184 * 2^31 = 1793055^184 * 2^31 = 38250802 * 2^31 -.word 3222230247 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 1793055^184 * 3479293249 * 2^31 -.word 10759601 // zeta^ 68 * 2^31 = 1793055^ 68 * 2^31 = 106639146 * 2^31 -.word 4004030735 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 68 * 3479293249 * 2^31 -.word 48748081 // zeta^ 36 * 2^31 = 1793055^ 36 * 2^31 = 108432201 * 2^31 -.word 3574358159 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 36 * 3479293249 * 2^31 -.word 118657223 // zeta^148 * 2^31 = 1793055^148 * 2^31 = 62017780 * 2^31 -.word 2448614009 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 1793055^148 * 3479293249 * 2^31 -.word 135399931 // zeta^116 * 2^31 = 1793055^116 * 2^31 = 56179088 * 2^31 -.word 3293739077 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 1793055^116 * 3479293249 * 2^31 -.word 22236245 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1317309803 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 173577835 // zeta^172 * 2^31 = 1793055^172 * 2^31 = 42747918 * 2^31 -.word 3951239125 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 1793055^172 * 3479293249 * 2^31 -.word 97528185 // zeta^ 92 * 2^31 = 1793055^ 92 * 2^31 = 105229554 * 2^31 -.word 1192661831 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 92 * 3479293249 * 2^31 -.word 73825049 // zeta^ 60 * 2^31 = 1793055^ 60 * 2^31 = 14289518 * 2^31 -.word 4126063015 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 60 * 3479293249 * 2^31 -.word 210066969 // zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 659560103 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 9957311 // zeta^ 32 * 2^31 = 1793055^ 32 * 2^31 = 67669976 * 2^31 -.word 3859885441 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 32 * 3479293249 * 2^31 -.word 175712655 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 4039503793 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87594435 // zeta^112 * 2^31 = 1793055^112 * 2^31 = 100073230 * 2^31 -.word 4040282749 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 1793055^112 * 3479293249 * 2^31 -.word 69868715 // zeta^ 8 * 2^31 = 1793055^ 8 * 2^31 = 82310697 * 2^31 -.word 378456469 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 8 * 3479293249 * 2^31 -.word 51881147 // zeta^168 * 2^31 = 1793055^168 * 2^31 = 103620826 * 2^31 -.word 2822401413 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1793055^168 * 3479293249 * 2^31 -.word 34117033 // zeta^ 88 * 2^31 = 1793055^ 88 * 2^31 = 70392207 * 2^31 -.word 1072737047 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 88 * 3479293249 * 2^31 -.word 154114723 // zeta^ 56 * 2^31 = 1793055^ 56 * 2^31 = 44058032 * 2^31 -.word 659699101 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 56 * 3479293249 * 2^31 -.word 168537937 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 720609135 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 70654529 // zeta^100 * 2^31 = 1793055^100 * 2^31 = 106849954 * 2^31 -.word 429672575 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 1793055^100 * 3479293249 * 2^31 -.word 81886087 // zeta^ 20 * 2^31 = 1793055^ 20 * 2^31 = 52463921 * 2^31 -.word 1001228217 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 20 * 3479293249 * 2^31 -.word 91900301 // zeta^180 * 2^31 = 1793055^180 * 2^31 = 5838692 * 2^31 -.word 3449842227 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1793055^180 * 3479293249 * 2^31 -.word 43708183 // zeta^ 76 * 2^31 = 1793055^ 76 * 2^31 = 65895091 * 2^31 -.word 343728169 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 76 * 3479293249 * 2^31 -.word 174587437 // zeta^ 44 * 2^31 = 1793055^ 44 * 2^31 = 56126250 * 2^31 -.word 1661037971 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 44 * 3479293249 * 2^31 -.word 143460969 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 168904279 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.word 132346145 // zeta^124 * 2^31 = 1793055^124 * 2^31 = 90940036 * 2^31 -.word 1361566111 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 1793055^124 * 3479293249 * 2^31 -.word 207328707 // zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 435081853 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 91466649 // zeta^ 96 * 2^31 = 1793055^ 96 * 2^31 = 108643008 * 2^31 -.word 1094641959 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 96 * 3479293249 * 2^31 -.word 129691583 // zeta^ 16 * 2^31 = 1793055^ 16 * 2^31 = 8569779 * 2^31 -.word 254684545 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 16 * 3479293249 * 2^31 -.word 196761229 // zeta^176 * 2^31 = 1793055^176 * 2^31 = 95614855 * 2^31 -.word 4294188339 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 1793055^176 * 3479293249 * 2^31 -.word 165404871 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 1472565881 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 126630577 // zeta^ 40 * 2^31 = 1793055^ 40 * 2^31 = 87332880 * 2^31 -.word 1851022351 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 40 * 3479293249 * 2^31 -.word 63171295 // zeta^152 * 2^31 = 1793055^152 * 2^31 = 64584977 * 2^31 -.word 3635268193 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 1793055^152 * 3479293249 * 2^31 -.word 205931337 // zeta^120 * 2^31 = 1793055^120 * 2^31 = 26334175 * 2^31 -.word 413037943 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1793055^120 * 3479293249 * 2^31 -.word 146631489 // zeta^ 4 * 2^31 = 1793055^ 4 * 2^31 = 1793055 * 2^31 -.word 3865294719 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 4 * 3479293249 * 2^31 -.word 206526417 // zeta^164 * 2^31 = 1793055^164 * 2^31 = 2003863 * 2^31 -.word 290936559 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 1793055^164 * 3479293249 * 2^31 -.word 125385717 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 845125067 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 98628795 // zeta^ 52 * 2^31 = 1793055^ 52 * 2^31 = 46625229 * 2^31 -.word 1846353285 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 52 * 3479293249 * 2^31 -.word 42698581 // zeta^140 * 2^31 = 1793055^140 * 2^31 = 52516759 * 2^31 -.word 2633929323 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 1793055^140 * 3479293249 * 2^31 -.word 195049773 // zeta^108 * 2^31 = 1793055^108 * 2^31 = 9768841 * 2^31 -.word 2977657491 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1793055^108 * 3479293249 * 2^31 -.word 84939873 // zeta^ 28 * 2^31 = 1793055^ 28 * 2^31 = 17702973 * 2^31 -.word 2933401183 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 28 * 3479293249 * 2^31 -.word 119757833 // zeta^188 * 2^31 = 1793055^188 * 2^31 = 3413455 * 2^31 -.word 3102305463 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 1793055^188 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_scale -ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 125819369 // 1/48 -.word 3200325335 // 1/48 twisted -.data -roots: -.word 67669975 /// zeta^ 64 * 2^31 = 1793055^ 64 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 64 * 3479293249 * 2^31 -.word 40973033 /// zeta^128 * 2^31 = 1793055^128 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1793055^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 1793055^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 87045076 // zeta^144 * 2^31 = 1793055^144 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1793055^144 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 5022183 // zeta^ 72 * 2^31 = 1793055^ 72 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 72 * 3479293249 * 2^31 -.word 210808 // zeta^132 * 2^31 = 1793055^132 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1793055^132 * 3479293249 * 2^31 -.word 102804317 // zeta^ 84 * 2^31 = 1793055^ 84 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 84 * 3479293249 * 2^31 -.word 82308834 // zeta^ 24 * 2^31 = 1793055^ 24 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 24 * 3479293249 * 2^31 -.word 98874168 // zeta^ 12 * 2^31 = 1793055^ 12 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1793055^ 12 * 3479293249 * 2^31 -.word 94353491 // zeta^156 * 2^31 = 1793055^156 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1793055^156 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input, %function -.global ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input -ntt_192_u32_108643009_1793055_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -108643009 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q0 -// input[4]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r7 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r10 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r1,#(256)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[0] from Q1 -// Release input[64] from Q0 -// input[4]: Already loaded as Q6 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -vmul.u32 Q1, Q0, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vadd.s32 Q2, Q6, Q7 -vqrdmulh.s32 Q0, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q6 -// Release input[4] from Q6 -vstrw.u32 Q2, [r11,#(-480)] -vsub.s32 Q5, Q1, Q7 -// Release input[68] from Q7 -vstrw.u32 Q5, [r1,#(16)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(272)] -// input[8]: Already loaded as Q4 -// input[72]: Already loaded as Q3 -vmul.u32 Q0, Q4, r8 -vadd.s32 Q2, Q3, Q4 -// input[12]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q4, r7 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q5, Q3, Q4 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(288)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(32)] -vsub.s32 Q5, Q5, Q0 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[72] from Q3 -// Release input[8] from Q4 -// input[76]: Already loaded as Q7 -// input[12]: Already loaded as Q6 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q6, Q7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmulh.s32 Q1, Q7, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[12] from Q6 -// Release input[76] from Q7 -// input[16]: Already loaded as Q4 -// input[80]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r7 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[16] from Q4 -vstrw.u32 Q2, [r11,#(-432)] -vsub.s32 Q4, Q1, Q5 -// Release input[80] from Q5 -vstrw.u32 Q4, [r1,#(64)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(320)] -// input[20]: Already loaded as Q6 -// input[84]: Already loaded as Q3 -vmul.u32 Q0, Q6, r8 -vadd.s32 Q2, Q3, Q6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q6, r7 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(80)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[84] from Q3 -// Release input[20] from Q6 -// input[88]: Already loaded as Q7 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q5, Q7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmulh.s32 Q1, Q7, r7 -// input[92]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 92)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[24] from Q5 -// Release input[88] from Q7 -// input[28]: Already loaded as Q4 -// input[92]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[28] from Q4 -vstrw.u32 Q2, [r11,#(-384)] -vsub.s32 Q4, Q1, Q6 -// Release input[92] from Q6 -vstrw.u32 Q4, [r1,#(112)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(368)] -// input[32]: Already loaded as Q3 -vmul.u32 Q0, Q3, r8 -vneg.s32 Q1, Q3 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmulh.s32 Q2, Q3, r7 -vstrw.u32 Q3, [r1,#(384)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(128)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[32] from Q3 -// input[36]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(144)] -vstrw.u32 Q4, [r1,#(400)] -vstrw.u32 Q4, [r11,#(-352)] -// Release input[36] from Q4 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-336)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(416)] -// Release input[40] from Q0 -// input[44]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(432)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(176)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[44] from Q4 -// input[48]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(192)] -vstrw.u32 Q3, [r1,#(448)] -vstrw.u32 Q3, [r11,#(-304)] -// Release input[48] from Q3 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-288)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(464)] -// Release input[52] from Q0 -// input[56]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(480)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(224)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[56] from Q4 -// input[60]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(240)] -vstrw.u32 Q3, [r1,#(496)] -vstrw.u32 Q3, [r11,#(-256)] -// Release input[60] from Q3 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 815674047 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1204 -// Instruction count: 900 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good.s b/tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good.s deleted file mode 100644 index 2f9ae0e..0000000 --- a/tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_33556993_27792935_incomplete_good_twiddles -ntt_192_u32_33556993_27792935_incomplete_good_twiddles: // For base multiplication -.word 56716939 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2862874485 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 25646259 // zeta^160 * 2^31 = 27792935^160 * 2^31 = 25038562 * 2^31 -.word 3487279437 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 27792935^160 * 375649793 * 2^31 -.word 17110297 // zeta^ 80 * 2^31 = 27792935^ 80 * 2^31 = 2013241 * 2^31 -.word 3456754919 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 80 * 375649793 * 2^31 -.word 35519885 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 11895923 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 43957141 // zeta^136 * 2^31 = 27792935^136 * 2^31 = 29356361 * 2^31 -.word 478221931 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 27792935^136 * 375649793 * 2^31 -.word 35166687 // zeta^104 * 2^31 = 27792935^104 * 2^31 = 32616688 * 2^31 -.word 2932743201 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 27792935^104 * 375649793 * 2^31 -.word 17906339 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 3247252317 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 2473265 // zeta^184 * 2^31 = 27792935^184 * 2^31 = 23624597 * 2^31 -.word 1545219279 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 27792935^184 * 375649793 * 2^31 -.word 62160579 // zeta^ 68 * 2^31 = 27792935^ 68 * 2^31 = 2711401 * 2^31 -.word 3272810301 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 68 * 375649793 * 2^31 -.word 21606357 // zeta^ 36 * 2^31 = 27792935^ 36 * 2^31 = 26823146 * 2^31 -.word 3797983787 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 36 * 375649793 * 2^31 -.word 65117653 // zeta^148 * 2^31 = 27792935^148 * 2^31 = 21166324 * 2^31 -.word 3608458283 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 27792935^148 * 375649793 * 2^31 -.word 14684899 // zeta^116 * 2^31 = 27792935^116 * 2^31 = 518908 * 2^31 -.word 3710962461 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 27792935^116 * 375649793 * 2^31 -.word 57316651 // zeta^ 12 * 2^31 = 27792935^ 12 * 2^31 = 14745691 * 2^31 -.word 798824661 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 12 * 375649793 * 2^31 -.word 2740735 // zeta^172 * 2^31 = 27792935^172 * 2^31 = 15739856 * 2^31 -.word 2960008193 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 27792935^172 * 375649793 * 2^31 -.word 1288749 // zeta^ 92 * 2^31 = 27792935^ 92 * 2^31 = 33153165 * 2^31 -.word 1828591571 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 92 * 375649793 * 2^31 -.word 50827131 // zeta^ 60 * 2^31 = 27792935^ 60 * 2^31 = 20044445 * 2^31 -.word 2860990085 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 60 * 375649793 * 2^31 -.word 41467727 // zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 807687857 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 64627673 // zeta^ 32 * 2^31 = 27792935^ 32 * 2^31 = 8518432 * 2^31 -.word 3670562343 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 32 * 375649793 * 2^31 -.word 31594101 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 4283071371 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 15147405 // zeta^112 * 2^31 = 27792935^112 * 2^31 = 19715532 * 2^31 -.word 3444858995 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 27792935^112 * 375649793 * 2^31 -.word 31947299 // zeta^ 8 * 2^31 = 27792935^ 8 * 2^31 = 940305 * 2^31 -.word 1362224093 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 8 * 375649793 * 2^31 -.word 42347447 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1840446025 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 64640721 // zeta^ 88 * 2^31 = 27792935^ 88 * 2^31 = 9932396 * 2^31 -.word 2749748015 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 88 * 375649793 * 2^31 -.word 48990067 // zeta^ 56 * 2^31 = 27792935^ 56 * 2^31 = 24511972 * 2^31 -.word 1702033037 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 56 * 375649793 * 2^31 -.word 45507629 // zeta^132 * 2^31 = 27792935^132 * 2^31 = 6733847 * 2^31 -.word 496983507 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 27792935^132 * 375649793 * 2^31 -.word 6997229 // zeta^100 * 2^31 = 27792935^100 * 2^31 = 9445248 * 2^31 -.word 3769793811 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 27792935^100 * 375649793 * 2^31 -.word 52429087 // zeta^ 20 * 2^31 = 27792935^ 20 * 2^31 = 33038085 * 2^31 -.word 584004833 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 20 * 375649793 * 2^31 -.word 16875761 // zeta^180 * 2^31 = 27792935^180 * 2^31 = 20647416 * 2^31 -.word 4192463119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 27792935^180 * 375649793 * 2^31 -.word 64373251 // zeta^ 76 * 2^31 = 27792935^ 76 * 2^31 = 17817137 * 2^31 -.word 1334959101 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 76 * 375649793 * 2^31 -.word 21018923 // zeta^ 44 * 2^31 = 27792935^ 44 * 2^31 = 32562828 * 2^31 -.word 2133783765 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 44 * 375649793 * 2^31 -.word 16286855 // zeta^156 * 2^31 = 27792935^156 * 2^31 = 13512548 * 2^31 -.word 1433977209 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 27792935^156 * 375649793 * 2^31 -.word 51132597 // zeta^124 * 2^31 = 27792935^124 * 2^31 = 13108720 * 2^31 -.word 3262568779 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 27792935^124 * 375649793 * 2^31 -.word 2486313 // zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 624404951 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 10397047 // zeta^ 96 * 2^31 = 27792935^ 96 * 2^31 = 33556992 * 2^31 -.word 1432092809 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 96 * 375649793 * 2^31 -.word 51966581 // zeta^ 16 * 2^31 = 27792935^ 16 * 2^31 = 13841461 * 2^31 -.word 850108299 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 16 * 375649793 * 2^31 -.word 50003689 // zeta^176 * 2^31 = 27792935^176 * 2^31 = 31543752 * 2^31 -.word 838212375 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 27792935^176 * 375649793 * 2^31 -.word 24766539 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2454521269 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 23156845 // zeta^ 40 * 2^31 = 27792935^ 40 * 2^31 = 4200632 * 2^31 -.word 3816745363 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 40 * 375649793 * 2^31 -.word 18123919 // zeta^152 * 2^31 = 27792935^152 * 2^31 = 9045021 * 2^31 -.word 2592934257 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 27792935^152 * 375649793 * 2^31 -.word 49207647 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 1047714977 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 60116757 // zeta^ 4 * 2^31 = 27792935^ 4 * 2^31 = 24111745 * 2^31 -.word 525173483 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 4 * 375649793 * 2^31 -.word 4953407 // zeta^164 * 2^31 = 27792935^164 * 2^31 = 30845592 * 2^31 -.word 1022156993 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 27792935^164 * 375649793 * 2^31 -.word 50238225 // zeta^ 84 * 2^31 = 27792935^ 84 * 2^31 = 12909577 * 2^31 -.word 102504175 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 84 * 375649793 * 2^31 -.word 1996333 // zeta^ 52 * 2^31 = 27792935^ 52 * 2^31 = 12390669 * 2^31 -.word 686509011 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 52 * 375649793 * 2^31 -.word 46095063 // zeta^140 * 2^31 = 27792935^140 * 2^31 = 994165 * 2^31 -.word 2161183529 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 27792935^140 * 375649793 * 2^31 -.word 9797335 // zeta^108 * 2^31 = 27792935^108 * 2^31 = 18811302 * 2^31 -.word 3496142633 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 27792935^108 * 375649793 * 2^31 -.word 15981389 // zeta^ 28 * 2^31 = 27792935^ 28 * 2^31 = 20448273 * 2^31 -.word 1032398515 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 28 * 375649793 * 2^31 -.word 65825237 // zeta^188 * 2^31 = 27792935^188 * 2^31 = 403828 * 2^31 -.word 2466375723 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 27792935^188 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_33556993_27792935_incomplete_good_scale -ntt_192_u32_33556993_27792935_incomplete_good_scale: // Constants for scaling by 1/N -.word 56716939 // 1/48 -.word 2862874485 // 1/48 twisted -.data -roots: -.word 893127 /// zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 66384763 /// zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 29095681 // zeta^144 * 2^31 = 27792935^144 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 27792935^144 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 14476917 // zeta^ 72 * 2^31 = 27792935^ 72 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 72 * 375649793 * 2^31 -.word 18598075 // zeta^132 * 2^31 = 27792935^132 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 27792935^132 * 375649793 * 2^31 -.word 4885007 // zeta^ 84 * 2^31 = 27792935^ 84 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 84 * 375649793 * 2^31 -.word 43317805 // zeta^ 24 * 2^31 = 27792935^ 24 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 24 * 375649793 * 2^31 -.word 64683161 // zeta^ 12 * 2^31 = 27792935^ 12 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 12 * 375649793 * 2^31 -.word 34427601 // zeta^156 * 2^31 = 27792935^156 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 27792935^156 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_33556993_27792935_incomplete_good, %function -.global ntt_192_u32_33556993_27792935_incomplete_good -ntt_192_u32_33556993_27792935_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s b/tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s deleted file mode 100644 index 0107319..0000000 --- a/tests/intmulntt/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 66384763 /// zeta^128 * 2^31 = 27792935^128 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 27792935^128 * 375649793 * 2^31 -.word 893127 /// zeta^ 64 * 2^31 = 27792935^ 64 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 64 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 27792935^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 38018305 // zeta^ 48 * 2^31 = 27792935^ 48 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 48 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 23796181 // zeta^120 * 2^31 = 27792935^120 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 27792935^120 * 375649793 * 2^31 -.word 32686385 // zeta^ 60 * 2^31 = 27792935^ 60 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 60 * 375649793 * 2^31 -.word 2430825 // zeta^108 * 2^31 = 27792935^108 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 27792935^108 * 375649793 * 2^31 -.word 52637069 // zeta^168 * 2^31 = 27792935^168 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 27792935^168 * 375649793 * 2^31 -.word 62228979 // zeta^180 * 2^31 = 27792935^180 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 27792935^180 * 375649793 * 2^31 -.word 48515911 // zeta^ 36 * 2^31 = 27792935^ 36 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 27792935^ 36 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_33556993_27792935_incomplete_good_bitrev, %function -.global ntt_192_u32_33556993_27792935_incomplete_good_bitrev -ntt_192_u32_33556993_27792935_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 33556993 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vmul.u32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 3919317503 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good.s b/tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good.s deleted file mode 100644 index bb1b225..0000000 --- a/tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_45387457_16877098_incomplete_good_twiddles -ntt_192_u32_45387457_16877098_incomplete_good_twiddles: // For base multiplication -.word 3050923 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 1370315925 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 58792077 // zeta^160 * 2^31 = 16877098^160 * 2^31 = 27201077 * 2^31 -.word 2040215347 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 16877098^160 * 450429249 * 2^31 -.word 9560897 // zeta^ 80 * 2^31 = 16877098^ 80 * 2^31 = 43749424 * 2^31 -.word 217647999 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 80 * 450429249 * 2^31 -.word 16322801 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 4050174415 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 65472817 // zeta^136 * 2^31 = 16877098^136 * 2^31 = 6908982 * 2^31 -.word 1455716751 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 16877098^136 * 450429249 * 2^31 -.word 34650023 // zeta^104 * 2^31 = 16877098^104 * 2^31 = 38432301 * 2^31 -.word 3329931161 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 16877098^104 * 450429249 * 2^31 -.word 22720737 // zeta^ 24 * 2^31 = 16877098^ 24 * 2^31 = 40340716 * 2^31 -.word 2223977951 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 24 * 450429249 * 2^31 -.word 41602881 // zeta^184 * 2^31 = 16877098^184 * 2^31 = 24079121 * 2^31 -.word 2615746431 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 16877098^184 * 450429249 * 2^31 -.word 16163777 // zeta^ 68 * 2^31 = 16877098^ 68 * 2^31 = 4138342 * 2^31 -.word 1646504703 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 68 * 450429249 * 2^31 -.word 34282441 // zeta^ 36 * 2^31 = 16877098^ 36 * 2^31 = 21015440 * 2^31 -.word 2024876279 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 36 * 450429249 * 2^31 -.word 63685223 // zeta^148 * 2^31 = 16877098^148 * 2^31 = 12104035 * 2^31 -.word 4264705241 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 16877098^148 * 450429249 * 2^31 -.word 43044525 // zeta^116 * 2^31 = 16877098^116 * 2^31 = 41757216 * 2^31 -.word 96983315 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 16877098^116 * 450429249 * 2^31 -.word 87208369 // zeta^ 12 * 2^31 = 16877098^ 12 * 2^31 = 38013065 * 2^31 -.word 3798348047 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 12 * 450429249 * 2^31 -.word 8670255 // zeta^172 * 2^31 = 16877098^172 * 2^31 = 4764854 * 2^31 -.word 4552977 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 16877098^172 * 450429249 * 2^31 -.word 26490773 // zeta^ 92 * 2^31 = 16877098^ 92 * 2^31 = 34257499 * 2^31 -.word 3319352875 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 92 * 450429249 * 2^31 -.word 25361455 // zeta^ 60 * 2^31 = 16877098^ 60 * 2^31 = 20563366 * 2^31 -.word 2431306001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 60 * 450429249 * 2^31 -.word 31982837 // zeta^ 64 * 2^31 = 16877098^ 64 * 2^31 = 18186380 * 2^31 -.word 2254751947 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 64 * 450429249 * 2^31 -.word 80421217 // zeta^ 32 * 2^31 = 16877098^ 32 * 2^31 = 18186381 * 2^31 -.word 3625067871 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 32 * 450429249 * 2^31 -.word 74452113 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 244792879 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 38625553 // zeta^112 * 2^31 = 16877098^112 * 2^31 = 29011006 * 2^31 -.word 462440879 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 16877098^112 * 450429249 * 2^31 -.word 56124891 // zeta^ 8 * 2^31 = 16877098^ 8 * 2^31 = 6955156 * 2^31 -.word 965036133 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 8 * 450429249 * 2^31 -.word 76210251 // zeta^168 * 2^31 = 16877098^168 * 2^31 = 13864138 * 2^31 -.word 2420752885 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 16877098^168 * 450429249 * 2^31 -.word 49172033 // zeta^ 88 * 2^31 = 16877098^ 88 * 2^31 = 21308336 * 2^31 -.word 1679220863 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 88 * 450429249 * 2^31 -.word 26505313 // zeta^ 56 * 2^31 = 16877098^ 56 * 2^31 = 16261595 * 2^31 -.word 3903198815 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 56 * 450429249 * 2^31 -.word 56492473 // zeta^132 * 2^31 = 16877098^132 * 2^31 = 24372017 * 2^31 -.word 2270091015 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 16877098^132 * 450429249 * 2^31 -.word 27268793 // zeta^100 * 2^31 = 16877098^100 * 2^31 = 28510359 * 2^31 -.word 3916595719 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 16877098^100 * 450429249 * 2^31 -.word 47730389 // zeta^ 20 * 2^31 = 16877098^ 20 * 2^31 = 3630241 * 2^31 -.word 4197983979 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 20 * 450429249 * 2^31 -.word 66028155 // zeta^180 * 2^31 = 16877098^180 * 2^31 = 15734276 * 2^31 -.word 4167721925 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 16877098^180 * 450429249 * 2^31 -.word 82104659 // zeta^ 76 * 2^31 = 16877098^ 76 * 2^31 = 40622603 * 2^31 -.word 4290414317 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 76 * 450429249 * 2^31 -.word 33150657 // zeta^ 44 * 2^31 = 16877098^ 44 * 2^31 = 33248211 * 2^31 -.word 3793795071 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 44 * 450429249 * 2^31 -.word 65413459 // zeta^156 * 2^31 = 16877098^156 * 2^31 = 24824091 * 2^31 -.word 1863661293 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 16877098^156 * 450429249 * 2^31 -.word 46516775 // zeta^124 * 2^31 = 16877098^124 * 2^31 = 13694133 * 2^31 -.word 888046873 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 16877098^124 * 450429249 * 2^31 -.word 10353697 // zeta^128 * 2^31 = 16877098^128 * 2^31 = 27201076 * 2^31 -.word 669899423 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 16877098^128 * 450429249 * 2^31 -.word 87723991 // zeta^ 96 * 2^31 = 16877098^ 96 * 2^31 = 45387456 * 2^31 -.word 2924651369 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 96 * 450429249 * 2^31 -.word 52149361 // zeta^ 16 * 2^31 = 16877098^ 16 * 2^31 = 16376451 * 2^31 -.word 3832526415 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 16 * 450429249 * 2^31 -.word 81214017 // zeta^176 * 2^31 = 16877098^176 * 2^31 = 1638033 * 2^31 -.word 4077319295 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 16877098^176 * 450429249 * 2^31 -.word 14564663 // zeta^ 72 * 2^31 = 16877098^ 72 * 2^31 = 31523319 * 2^31 -.word 1874214409 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 72 * 450429249 * 2^31 -.word 25302097 // zeta^ 40 * 2^31 = 16877098^ 40 * 2^31 = 38478475 * 2^31 -.word 2839250543 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 40 * 450429249 * 2^31 -.word 64269601 // zeta^152 * 2^31 = 16877098^152 * 2^31 = 29125862 * 2^31 -.word 391768479 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 16877098^152 * 450429249 * 2^31 -.word 68054177 // zeta^120 * 2^31 = 16877098^120 * 2^31 = 5046741 * 2^31 -.word 2070989343 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 16877098^120 * 450429249 * 2^31 -.word 63506121 // zeta^ 4 * 2^31 = 16877098^ 4 * 2^31 = 16877098 * 2^31 -.word 378371575 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 4 * 450429249 * 2^31 -.word 74611137 // zeta^164 * 2^31 = 16877098^164 * 2^31 = 41249115 * 2^31 -.word 2648462591 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 16877098^164 * 450429249 * 2^31 -.word 24746759 // zeta^ 84 * 2^31 = 16877098^ 84 * 2^31 = 29653181 * 2^31 -.word 127245369 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 84 * 450429249 * 2^31 -.word 27089691 // zeta^ 52 * 2^31 = 16877098^ 52 * 2^31 = 33283422 * 2^31 -.word 30262053 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 52 * 450429249 * 2^31 -.word 57624257 // zeta^140 * 2^31 = 16877098^140 * 2^31 = 12139246 * 2^31 -.word 501172223 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 16877098^140 * 450429249 * 2^31 -.word 3566545 // zeta^108 * 2^31 = 16877098^108 * 2^31 = 7374392 * 2^31 -.word 496619247 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 16877098^108 * 450429249 * 2^31 -.word 44258139 // zeta^ 28 * 2^31 = 16877098^ 28 * 2^31 = 31693324 * 2^31 -.word 3406920421 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 28 * 450429249 * 2^31 -.word 64284141 // zeta^188 * 2^31 = 16877098^188 * 2^31 = 11129958 * 2^31 -.word 975614419 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 16877098^188 * 450429249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_45387457_16877098_incomplete_good_scale -ntt_192_u32_45387457_16877098_incomplete_good_scale: // Constants for scaling by 1/N -.word 3050923 // 1/48 -.word 1370315925 // 1/48 twisted -.data -roots: -.word 9023783 /// zeta^ 64 * 2^31 = 16877098^ 64 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 64 * 450429249 * 2^31 -.word 22090505 /// zeta^128 * 2^31 = 16877098^128 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 16877098^128 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 78782351 // zeta^144 * 2^31 = 16877098^144 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 16877098^144 * 450429249 * 2^31 -.word 88323005 // zeta^ 72 * 2^31 = 16877098^ 72 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 72 * 450429249 * 2^31 -.word 84188761 // zeta^ 24 * 2^31 = 16877098^ 24 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 24 * 450429249 * 2^31 -.word 88323005 // zeta^ 72 * 2^31 = 16877098^ 72 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 72 * 450429249 * 2^31 -.word 16804439 // zeta^132 * 2^31 = 16877098^132 * 2^31 = 24372017 * 2^31 -.word 3300632809 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 16877098^132 * 450429249 * 2^31 -.word 19157039 // zeta^ 84 * 2^31 = 16877098^ 84 * 2^31 = 29653181 * 2^31 -.word 3550508305 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 84 * 450429249 * 2^31 -.word 84188761 // zeta^ 24 * 2^31 = 16877098^ 24 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 24 * 450429249 * 2^31 -.word 65804887 // zeta^ 12 * 2^31 = 16877098^ 12 * 2^31 = 38013065 * 2^31 -.word 3946051817 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 12 * 450429249 * 2^31 -.word 82969997 // zeta^156 * 2^31 = 16877098^156 * 2^31 = 24824091 * 2^31 -.word 3322022451 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 16877098^156 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_45387457_16877098_incomplete_good, %function -.global ntt_192_u32_45387457_16877098_incomplete_good -ntt_192_u32_45387457_16877098_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 45387457 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vmul.u32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vmul.u32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmul.u32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vmul.u32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vmul.u32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 3844538047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s b/tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s deleted file mode 100644 index b2d471f..0000000 --- a/tests/intmulntt/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 22090505 /// zeta^128 * 2^31 = 16877098^128 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 16877098^128 * 450429249 * 2^31 -.word 9023783 /// zeta^ 64 * 2^31 = 16877098^ 64 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 64 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 16877098^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 11992563 // zeta^ 48 * 2^31 = 16877098^ 48 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 48 * 450429249 * 2^31 -.word 6586153 // zeta^120 * 2^31 = 16877098^120 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 16877098^120 * 450429249 * 2^31 -.word 2451909 // zeta^168 * 2^31 = 16877098^168 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 16877098^168 * 450429249 * 2^31 -.word 6586153 // zeta^120 * 2^31 = 16877098^120 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 16877098^120 * 450429249 * 2^31 -.word 7804917 // zeta^ 60 * 2^31 = 16877098^ 60 * 2^31 = 20563366 * 2^31 -.word 972944843 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 60 * 450429249 * 2^31 -.word 24970027 // zeta^108 * 2^31 = 16877098^108 * 2^31 = 7374392 * 2^31 -.word 348915477 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 16877098^108 * 450429249 * 2^31 -.word 2451909 // zeta^168 * 2^31 = 16877098^168 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 16877098^168 * 450429249 * 2^31 -.word 71617875 // zeta^180 * 2^31 = 16877098^180 * 2^31 = 15734276 * 2^31 -.word 744458989 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 16877098^180 * 450429249 * 2^31 -.word 73970475 // zeta^ 36 * 2^31 = 16877098^ 36 * 2^31 = 21015440 * 2^31 -.word 994334485 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 16877098^ 36 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_45387457_16877098_incomplete_good_bitrev, %function -.global ntt_192_u32_45387457_16877098_incomplete_good_bitrev -ntt_192_u32_45387457_16877098_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, 45387457 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vmul.u32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vmul.u32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vqrdmlah.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vqrdmlah.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r12 -vqrdmulh.s32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vqrdmulh.s32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vqrdmlah.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vqrdmulh.s32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r12 -vqrdmulh.s32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 3844538047 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good.s b/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good.s deleted file mode 100644 index 6713e33..0000000 --- a/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good.s +++ /dev/null @@ -1,1390 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_88299073_9670361_incomplete_good_twiddles -ntt_192_u32_88299073_9670361_incomplete_good_twiddles: // For base multiplication -.word 62163489 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 3607823391 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 31257951 // zeta^160 * 2^31 = 9670361^160 * 2^31 = 2534357 * 2^31 -.word 3835714657 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 9670361^160 * 2066201025 * 2^31 -.word 149330681 // zeta^ 80 * 2^31 = 9670361^ 80 * 2^31 = 5579523 * 2^31 -.word 1698183495 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 80 * 2066201025 * 2^31 -.word 12706985 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 1188395927 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 128346011 // zeta^136 * 2^31 = 9670361^136 * 2^31 = 41822566 * 2^31 -.word 3104825637 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 9670361^136 * 2066201025 * 2^31 -.word 75665841 // zeta^104 * 2^31 = 9670361^104 * 2^31 = 76960665 * 2^31 -.word 22036623 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 9670361^104 * 2066201025 * 2^31 -.word 153143541 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 415003211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 63785177 // zeta^184 * 2^31 = 9670361^184 * 2^31 = 22220342 * 2^31 -.word 1820869479 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 9670361^184 * 2066201025 * 2^31 -.word 135693903 // zeta^ 68 * 2^31 = 9670361^ 68 * 2^31 = 55309930 * 2^31 -.word 887924593 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 68 * 2066201025 * 2^31 -.word 159293731 // zeta^ 36 * 2^31 = 9670361^ 36 * 2^31 = 64980291 * 2^31 -.word 4156635549 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 36 * 2066201025 * 2^31 -.word 96170719 // zeta^148 * 2^31 = 9670361^148 * 2^31 = 62204288 * 2^31 -.word 2461155041 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 9670361^148 * 2066201025 * 2^31 -.word 63279659 // zeta^116 * 2^31 = 9670361^116 * 2^31 = 32274711 * 2^31 -.word 1943976597 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 9670361^116 * 2066201025 * 2^31 -.word 53905215 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1284407937 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 146011323 // zeta^172 * 2^31 = 9670361^172 * 2^31 = 41675533 * 2^31 -.word 1123834885 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 9670361^172 * 2066201025 * 2^31 -.word 24870969 // zeta^ 92 * 2^31 = 9670361^ 92 * 2^31 = 67630520 * 2^31 -.word 3062637575 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 92 * 2066201025 * 2^31 -.word 150915321 // zeta^ 60 * 2^31 = 9670361^ 60 * 2^31 = 78801296 * 2^31 -.word 3619654471 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 60 * 2066201025 * 2^31 -.word 145340195 // zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 459252637 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 119204611 // zeta^ 32 * 2^31 = 9670361^ 32 * 2^31 = 85764717 * 2^31 -.word 4067076029 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 32 * 2066201025 * 2^31 -.word 163891161 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 3106571367 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 48324623 // zeta^112 * 2^31 = 9670361^112 * 2^31 = 69154324 * 2^31 -.word 509787569 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 9670361^112 * 2066201025 * 2^31 -.word 100932305 // zeta^ 8 * 2^31 = 9670361^ 8 * 2^31 = 11338408 * 2^31 -.word 4272930671 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 8 * 2066201025 * 2^31 -.word 140979243 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 3082789013 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 112812969 // zeta^ 88 * 2^31 = 9670361^ 88 * 2^31 = 66078731 * 2^31 -.word 2474097815 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 88 * 2066201025 * 2^31 -.word 1059291 // zeta^ 56 * 2^31 = 9670361^ 56 * 2^31 = 43898970 * 2^31 -.word 2889101029 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 56 * 2066201025 * 2^31 -.word 17304415 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 138331745 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 64699245 // zeta^100 * 2^31 = 9670361^100 * 2^31 = 78628712 * 2^31 -.word 1026256339 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 9670361^100 * 2066201025 * 2^31 -.word 113318487 // zeta^ 20 * 2^31 = 9670361^ 20 * 2^31 = 56024362 * 2^31 -.word 2350990697 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 20 * 2066201025 * 2^31 -.word 121190133 // zeta^180 * 2^31 = 9670361^180 * 2^31 = 29929577 * 2^31 -.word 517178443 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 9670361^180 * 2066201025 * 2^31 -.word 30586823 // zeta^ 76 * 2^31 = 9670361^ 76 * 2^31 = 46623540 * 2^31 -.word 3171132409 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 76 * 2066201025 * 2^31 -.word 172791111 // zeta^ 44 * 2^31 = 9670361^ 44 * 2^31 = 23363129 * 2^31 -.word 160573049 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 44 * 2066201025 * 2^31 -.word 25682825 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 675312823 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.word 138852867 // zeta^124 * 2^31 = 9670361^124 * 2^31 = 77128297 * 2^31 -.word 3737950397 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 9670361^124 * 2066201025 * 2^31 -.word 57393535 // zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 227891265 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 114434657 // zeta^ 96 * 2^31 = 9670361^ 96 * 2^31 = 88299072 * 2^31 -.word 687143903 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 96 * 2066201025 * 2^31 -.word 128273523 // zeta^ 16 * 2^31 = 9670361^ 16 * 2^31 = 19144749 * 2^31 -.word 3785179725 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 16 * 2066201025 * 2^31 -.word 27267465 // zeta^176 * 2^31 = 9670361^176 * 2^31 = 82719550 * 2^31 -.word 2596783799 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 9670361^176 * 2066201025 * 2^31 -.word 35618903 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 1212178281 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 48252135 // zeta^ 40 * 2^31 = 9670361^ 40 * 2^31 = 46476507 * 2^31 -.word 1190141657 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 40 * 2066201025 * 2^31 -.word 175538855 // zeta^152 * 2^31 = 9670361^152 * 2^31 = 44400103 * 2^31 -.word 1405866265 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 9670361^152 * 2066201025 * 2^31 -.word 23454605 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 3879964083 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 111898901 // zeta^ 4 * 2^31 = 9670361^ 4 * 2^31 = 9670361 * 2^31 -.word 3268710955 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 4 * 2066201025 * 2^31 -.word 40904243 // zeta^164 * 2^31 = 9670361^164 * 2^31 = 32989143 * 2^31 -.word 3407042701 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 9670361^164 * 2066201025 * 2^31 -.word 55408013 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 3777788851 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 80427427 // zeta^ 52 * 2^31 = 9670361^ 52 * 2^31 = 26094785 * 2^31 -.word 1833812253 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 52 * 2066201025 * 2^31 -.word 3807035 // zeta^140 * 2^31 = 9670361^140 * 2^31 = 64935944 * 2^31 -.word 4134394245 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 9670361^140 * 2066201025 * 2^31 -.word 122692931 // zeta^108 * 2^31 = 9670361^108 * 2^31 = 23260411 * 2^31 -.word 3010559357 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 9670361^108 * 2066201025 * 2^31 -.word 37745279 // zeta^ 28 * 2^31 = 9670361^ 28 * 2^31 = 11170776 * 2^31 -.word 557016897 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 28 * 2066201025 * 2^31 -.word 151727177 // zeta^188 * 2^31 = 9670361^188 * 2^31 = 20668553 * 2^31 -.word 1232329719 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 9670361^188 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_88299073_9670361_incomplete_good_scale -ntt_192_u32_88299073_9670361_incomplete_good_scale: // Constants for scaling by 1/N -.word 62163489 // 1/48 -.word 3607823391 // 1/48 twisted -.data -roots: -.word 85764716 /// zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 2534356 /// zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 23318782 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 58369496 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 1419579322 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 65038662 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1581777230 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 9497777 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_88299073_9670361_incomplete_good, %function -.global ntt_192_u32_88299073_9670361_incomplete_good -ntt_192_u32_88299073_9670361_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -88299073 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[68] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release input[48] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release input[96] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[84]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r14,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release input[144] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r0,#(384)] -// input[84]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[180] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release input[84] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[168]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[168] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[156]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r0,#(288)] -// Release input[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-336)] -// input[156]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[108]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[108] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[16]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 16)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release input[156] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(432)] -// input[16]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release input[160] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[148]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release input[16] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-368)] -// input[148]: Already loaded as Q7 -// input[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[52] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release input[148] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[184] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[40] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(160)] -// input[28]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release input[172] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[80]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 80)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release input[28] from Q7 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-320)] -// input[80]: Already loaded as Q6 -// input[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release input[176] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[32] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release input[80] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(128)] -// input[20]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release input[20] from Q7 -vstrw.u32 Q3, [r0,#(272)] -// Release input[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[104]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release input[56] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[104] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(416)] -// input[92]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[44]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[12]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release input[44] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[92] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(176)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r9 -// input[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(48)] -// Release input[12] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(-480)] -// Release input[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r9 -// input[4]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 4)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[140]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -112)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(304)] -// Release input[76] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(16)] -// Release input[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r9 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[156]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-448)] -// Release input[140] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[144]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -108)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[28]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-384)] -// Release input[156] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-432)] -// Release input[144] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(112)] -// Release input[28] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-416)] -// Release input[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[108]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[168]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(432)] -// Release input[108] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-336)] -// Release input[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[40]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[44]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 44)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(160)] -// Release input[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(176)] -// Release input[44] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[48]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 48)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[184]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r9 -// input[52]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 52)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(192)] -// Release input[48] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-272)] -// Release input[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(208)] -// Release input[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[56]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r9 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(224)] -// Release input[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -.equ modulus_inv, 2228766271 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1357 -// Instruction count: 998 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s b/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s deleted file mode 100644 index 4e2abfa..0000000 --- a/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_bitrev.s +++ /dev/null @@ -1,1285 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 2534356 /// zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 85764716 /// zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 24724272 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 22179761 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 53160974 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 22179761 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 78801296 // zeta^ 60 * 2^31 = 9670361^ 60 * 2^31 = 78801296 * 2^31 -.word 1916492312 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 60 * 2066201025 * 2^31 -.word 23260411 // zeta^108 * 2^31 = 9670361^108 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 9670361^108 * 2066201025 * 2^31 -.word 53160974 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 29929577 // zeta^180 * 2^31 = 9670361^180 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 9670361^180 * 2066201025 * 2^31 -.word 64980291 // zeta^ 36 * 2^31 = 9670361^ 36 * 2^31 = 64980291 * 2^31 -.word 1580357614 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 36 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_88299073_9670361_incomplete_good_bitrev, %function -.global ntt_192_u32_88299073_9670361_incomplete_good_bitrev -ntt_192_u32_88299073_9670361_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -.equ modulus, -88299073 -movw r12, #:lower16:modulus -movt r12, #:upper16:modulus -ldr r11, roots_addr -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r10 -vadd.s32 Q5, Q0, Q1 -// Release input[64] from Q0 -vqrdmulh.s32 Q4, Q2, r9 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r12 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vadd.s32 Q6, Q4, Q3 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[160]: Already loaded as Q1 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r10 -vadd.s32 Q4, Q1, Q7 -// Release input[160] from Q1 -vqrdmulh.s32 Q3, Q0, r9 -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vmla.s32 Q2, Q3, r12 -vsub.s32 Q3, Q1, Q7 -// Release input[32] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q3, Q2 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r0,#(128)] -vadd.s32 Q4, Q4, Q1 -// Release input[96] from Q1 -vstrw.u32 Q4, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[112]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[88]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[100]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[76]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[124]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r10 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r9 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r12 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r6 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[108]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 108)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r5 -vsub.s32 Q4, Q1, Q0 -// input[168]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s32 Q3, Q2, r12 -vstrw.u32 Q4, [r0,#(288)] -vadd.s32 Q1, Q1, Q0 -// Release input[72] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[168]: Already loaded as Q7 -// input[108]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r6 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release input[108] from Q6 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q7 -// Release input[168] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[120]: Already loaded as Q7 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q7, Q7, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[136]: Already loaded as Q6 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[40]: Already loaded as Q7 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[88]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[40] from Q7 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[88]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[184]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release input[88] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[184]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q7 -// Release input[184] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[104]: Already loaded as Q7 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[152]: Already loaded as Q6 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r6 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q6 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[56]: Already loaded as Q7 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r6 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r5 -vsub.s32 Q2, Q3, Q7 -// input[96]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 96)] -vmla.s32 Q1, Q0, r12 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[56] from Q7 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r10 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r9 -// input[96]: Already loaded as Q6 -vmla.s32 Q0, Q5, r12 -vmul.u32 Q2, Q1, r10 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r9 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r12 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r6 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r12 -// input[112]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 112)] -vmul.u32 Q4, Q6, r8 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r7 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r12 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r0,#(384)] -// Release input[96] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[112]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q2, Q2, r9 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[176]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(448)] -// Release input[112] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[176]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r9 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-304)] -// Release input[176] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r9 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[72]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 72)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[184]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r9 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(288)] -// Release input[72] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[56]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[56]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r9 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(224)] -// Release input[56] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r9 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[52]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q2, Q2, r9 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r9 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r10, r9, [r11], #+8 -ldrd r8, r7, [r11], #+8 -ldrd r6, r5, [r11], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r10 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r9 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q0, Q1, r12 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r10 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r12 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r6 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r5 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r12 -// input[124]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r10 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q2, Q2, r9 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q1, Q2, r12 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r10 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r12 -// input[76]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vmul.u32 Q5, Q2, r6 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r5 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r12 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r8 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(496)] -// Release input[124] from Q2 -vmla.s32 Q6, Q4, r12 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r10 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r9 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r12 -vstrw.u32 Q1, [r0,#(304)] -// Release input[76] from Q1 -vmul.u32 Q1, Q3, r10 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r9 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r12 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q5, Q0, r6 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r5 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r12 -vmul.u32 Q1, Q4, r8 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r7 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q1, Q4, r12 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -.equ modulus_inv, 2228766271 -movw r10, #:lower16:modulus_inv -movt r10, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1253 -// Instruction count: 895 \ No newline at end of file diff --git a/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s b/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s deleted file mode 100644 index e258229..0000000 --- a/tests/intmulntt/ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,1237 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_twiddles -ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 62163489 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 3607823391 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 31257951 // zeta^160 * 2^31 = 9670361^160 * 2^31 = 2534357 * 2^31 -.word 3835714657 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 9670361^160 * 2066201025 * 2^31 -.word 149330681 // zeta^ 80 * 2^31 = 9670361^ 80 * 2^31 = 5579523 * 2^31 -.word 1698183495 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 80 * 2066201025 * 2^31 -.word 12706985 // zeta^ 48 * 2^31 = 9670361^ 48 * 2^31 = 24724272 * 2^31 -.word 1188395927 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 48 * 2066201025 * 2^31 -.word 128346011 // zeta^136 * 2^31 = 9670361^136 * 2^31 = 41822566 * 2^31 -.word 3104825637 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 9670361^136 * 2066201025 * 2^31 -.word 75665841 // zeta^104 * 2^31 = 9670361^104 * 2^31 = 76960665 * 2^31 -.word 22036623 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 9670361^104 * 2066201025 * 2^31 -.word 153143541 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 415003211 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 63785177 // zeta^184 * 2^31 = 9670361^184 * 2^31 = 22220342 * 2^31 -.word 1820869479 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 9670361^184 * 2066201025 * 2^31 -.word 135693903 // zeta^ 68 * 2^31 = 9670361^ 68 * 2^31 = 55309930 * 2^31 -.word 887924593 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 68 * 2066201025 * 2^31 -.word 159293731 // zeta^ 36 * 2^31 = 9670361^ 36 * 2^31 = 64980291 * 2^31 -.word 4156635549 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 36 * 2066201025 * 2^31 -.word 96170719 // zeta^148 * 2^31 = 9670361^148 * 2^31 = 62204288 * 2^31 -.word 2461155041 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 9670361^148 * 2066201025 * 2^31 -.word 63279659 // zeta^116 * 2^31 = 9670361^116 * 2^31 = 32274711 * 2^31 -.word 1943976597 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 9670361^116 * 2066201025 * 2^31 -.word 53905215 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1284407937 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 146011323 // zeta^172 * 2^31 = 9670361^172 * 2^31 = 41675533 * 2^31 -.word 1123834885 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 9670361^172 * 2066201025 * 2^31 -.word 24870969 // zeta^ 92 * 2^31 = 9670361^ 92 * 2^31 = 67630520 * 2^31 -.word 3062637575 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 92 * 2066201025 * 2^31 -.word 150915321 // zeta^ 60 * 2^31 = 9670361^ 60 * 2^31 = 78801296 * 2^31 -.word 3619654471 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 60 * 2066201025 * 2^31 -.word 145340195 // zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 459252637 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 119204611 // zeta^ 32 * 2^31 = 9670361^ 32 * 2^31 = 85764717 * 2^31 -.word 4067076029 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 32 * 2066201025 * 2^31 -.word 163891161 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 3106571367 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 48324623 // zeta^112 * 2^31 = 9670361^112 * 2^31 = 69154324 * 2^31 -.word 509787569 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 9670361^112 * 2066201025 * 2^31 -.word 100932305 // zeta^ 8 * 2^31 = 9670361^ 8 * 2^31 = 11338408 * 2^31 -.word 4272930671 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 8 * 2066201025 * 2^31 -.word 140979243 // zeta^168 * 2^31 = 9670361^168 * 2^31 = 53160974 * 2^31 -.word 3082789013 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 9670361^168 * 2066201025 * 2^31 -.word 112812969 // zeta^ 88 * 2^31 = 9670361^ 88 * 2^31 = 66078731 * 2^31 -.word 2474097815 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 88 * 2066201025 * 2^31 -.word 1059291 // zeta^ 56 * 2^31 = 9670361^ 56 * 2^31 = 43898970 * 2^31 -.word 2889101029 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 56 * 2066201025 * 2^31 -.word 17304415 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 138331745 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 64699245 // zeta^100 * 2^31 = 9670361^100 * 2^31 = 78628712 * 2^31 -.word 1026256339 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 9670361^100 * 2066201025 * 2^31 -.word 113318487 // zeta^ 20 * 2^31 = 9670361^ 20 * 2^31 = 56024362 * 2^31 -.word 2350990697 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 20 * 2066201025 * 2^31 -.word 121190133 // zeta^180 * 2^31 = 9670361^180 * 2^31 = 29929577 * 2^31 -.word 517178443 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 9670361^180 * 2066201025 * 2^31 -.word 30586823 // zeta^ 76 * 2^31 = 9670361^ 76 * 2^31 = 46623540 * 2^31 -.word 3171132409 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 76 * 2066201025 * 2^31 -.word 172791111 // zeta^ 44 * 2^31 = 9670361^ 44 * 2^31 = 23363129 * 2^31 -.word 160573049 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 44 * 2066201025 * 2^31 -.word 25682825 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 675312823 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.word 138852867 // zeta^124 * 2^31 = 9670361^124 * 2^31 = 77128297 * 2^31 -.word 3737950397 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 9670361^124 * 2066201025 * 2^31 -.word 57393535 // zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 227891265 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 114434657 // zeta^ 96 * 2^31 = 9670361^ 96 * 2^31 = 88299072 * 2^31 -.word 687143903 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 96 * 2066201025 * 2^31 -.word 128273523 // zeta^ 16 * 2^31 = 9670361^ 16 * 2^31 = 19144749 * 2^31 -.word 3785179725 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 16 * 2066201025 * 2^31 -.word 27267465 // zeta^176 * 2^31 = 9670361^176 * 2^31 = 82719550 * 2^31 -.word 2596783799 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 9670361^176 * 2066201025 * 2^31 -.word 35618903 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 1212178281 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 48252135 // zeta^ 40 * 2^31 = 9670361^ 40 * 2^31 = 46476507 * 2^31 -.word 1190141657 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 40 * 2066201025 * 2^31 -.word 175538855 // zeta^152 * 2^31 = 9670361^152 * 2^31 = 44400103 * 2^31 -.word 1405866265 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 9670361^152 * 2066201025 * 2^31 -.word 23454605 // zeta^120 * 2^31 = 9670361^120 * 2^31 = 22179761 * 2^31 -.word 3879964083 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 9670361^120 * 2066201025 * 2^31 -.word 111898901 // zeta^ 4 * 2^31 = 9670361^ 4 * 2^31 = 9670361 * 2^31 -.word 3268710955 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 4 * 2066201025 * 2^31 -.word 40904243 // zeta^164 * 2^31 = 9670361^164 * 2^31 = 32989143 * 2^31 -.word 3407042701 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 9670361^164 * 2066201025 * 2^31 -.word 55408013 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 3777788851 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 80427427 // zeta^ 52 * 2^31 = 9670361^ 52 * 2^31 = 26094785 * 2^31 -.word 1833812253 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 52 * 2066201025 * 2^31 -.word 3807035 // zeta^140 * 2^31 = 9670361^140 * 2^31 = 64935944 * 2^31 -.word 4134394245 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 9670361^140 * 2066201025 * 2^31 -.word 122692931 // zeta^108 * 2^31 = 9670361^108 * 2^31 = 23260411 * 2^31 -.word 3010559357 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 9670361^108 * 2066201025 * 2^31 -.word 37745279 // zeta^ 28 * 2^31 = 9670361^ 28 * 2^31 = 11170776 * 2^31 -.word 557016897 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 28 * 2066201025 * 2^31 -.word 151727177 // zeta^188 * 2^31 = 9670361^188 * 2^31 = 20668553 * 2^31 -.word 1232329719 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 9670361^188 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_scale -ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 62163489 // 1/48 -.word 3607823391 // 1/48 twisted -.data -roots: -.word 85764716 /// zeta^ 64 * 2^31 = 9670361^ 64 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 64 * 2066201025 * 2^31 -.word 2534356 /// zeta^128 * 2^31 = 9670361^128 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 9670361^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 9670361^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 63574801 // zeta^144 * 2^31 = 9670361^144 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 9670361^144 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 35138099 // zeta^ 72 * 2^31 = 9670361^ 72 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 72 * 2066201025 * 2^31 -.word 23318782 // zeta^132 * 2^31 = 9670361^132 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 9670361^132 * 2066201025 * 2^31 -.word 58369496 // zeta^ 84 * 2^31 = 9670361^ 84 * 2^31 = 58369496 * 2^31 -.word 1419579322 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 84 * 2066201025 * 2^31 -.word 66119312 // zeta^ 24 * 2^31 = 9670361^ 24 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 24 * 2066201025 * 2^31 -.word 65038662 // zeta^ 12 * 2^31 = 9670361^ 12 * 2^31 = 65038662 * 2^31 -.word 1581777230 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 9670361^ 12 * 2066201025 * 2^31 -.word 9497777 // zeta^156 * 2^31 = 9670361^156 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 9670361^156 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input, %function -.global ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input -ntt_192_u32_88299073_9670361_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 256 -add r14, r0, #256 -// Use r12 as marker for r0 + 512 -add r12, r14, #256 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -.equ modulus, -88299073 -movw r10, #:lower16:modulus -movt r10, #:upper16:modulus -ldr r9, roots_addr -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r8 -vadd.s32 Q4, Q1, Q0 -// input[4]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r7 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r10 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r1,#(256)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(-496)] -// Release input[0] from Q1 -// Release input[64] from Q0 -// input[4]: Already loaded as Q6 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q6, Q7 -vmul.u32 Q1, Q0, r8 -// input[72]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 72)] -vadd.s32 Q2, Q6, Q7 -vqrdmulh.s32 Q0, Q0, r7 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q6 -// Release input[4] from Q6 -vstrw.u32 Q2, [r11,#(-480)] -vsub.s32 Q5, Q1, Q7 -// Release input[68] from Q7 -vstrw.u32 Q5, [r1,#(16)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(272)] -// input[8]: Already loaded as Q4 -// input[72]: Already loaded as Q3 -vmul.u32 Q0, Q4, r8 -vadd.s32 Q2, Q3, Q4 -// input[12]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 12)] -vqrdmulh.s32 Q1, Q4, r7 -// input[76]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 76)] -vsub.s32 Q5, Q3, Q4 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(288)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(32)] -vsub.s32 Q5, Q5, Q0 -vstrw.u32 Q5, [r11,#(-464)] -// Release input[72] from Q3 -// Release input[8] from Q4 -// input[76]: Already loaded as Q7 -// input[12]: Already loaded as Q6 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q6, Q7 -// input[16]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 16)] -vqrdmulh.s32 Q1, Q7, r7 -// input[80]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 80)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release input[12] from Q6 -// Release input[76] from Q7 -// input[16]: Already loaded as Q4 -// input[80]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r8 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r7 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[16] from Q4 -vstrw.u32 Q2, [r11,#(-432)] -vsub.s32 Q4, Q1, Q5 -// Release input[80] from Q5 -vstrw.u32 Q4, [r1,#(64)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(320)] -// input[20]: Already loaded as Q6 -// input[84]: Already loaded as Q3 -vmul.u32 Q0, Q6, r8 -vadd.s32 Q2, Q3, Q6 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q6, r7 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r1,#(80)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(-416)] -// Release input[84] from Q3 -// Release input[20] from Q6 -// input[88]: Already loaded as Q7 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q7, r8 -vadd.s32 Q2, Q5, Q7 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmulh.s32 Q1, Q7, r7 -// input[92]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 92)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(-400)] -// Release input[24] from Q5 -// Release input[88] from Q7 -// input[28]: Already loaded as Q4 -// input[92]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r8 -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r7 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r10 -vneg.s32 Q0, Q4 -// Release input[28] from Q4 -vstrw.u32 Q2, [r11,#(-384)] -vsub.s32 Q4, Q1, Q6 -// Release input[92] from Q6 -vstrw.u32 Q4, [r1,#(112)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(368)] -// input[32]: Already loaded as Q3 -vmul.u32 Q0, Q3, r8 -vneg.s32 Q1, Q3 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmulh.s32 Q2, Q3, r7 -vstrw.u32 Q3, [r1,#(384)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(128)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-368)] -// Release input[32] from Q3 -// input[36]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(144)] -vstrw.u32 Q4, [r1,#(400)] -vstrw.u32 Q4, [r11,#(-352)] -// Release input[36] from Q4 -// input[40]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 40)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-336)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(416)] -// Release input[40] from Q0 -// input[44]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(432)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(176)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-320)] -// Release input[44] from Q4 -// input[48]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(192)] -vstrw.u32 Q3, [r1,#(448)] -vstrw.u32 Q3, [r11,#(-304)] -// Release input[48] from Q3 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q1, Q0, r8 -vneg.s32 Q2, Q0 -// input[56]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q0, r7 -vstrw.u32 Q0, [r11,#(-288)] -vmla.s32 Q1, Q3, r10 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r1,#(464)] -// Release input[52] from Q0 -// input[56]: Already loaded as Q4 -vmul.u32 Q0, Q4, r8 -vneg.s32 Q1, Q4 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmulh.s32 Q2, Q4, r7 -vstrw.u32 Q4, [r1,#(480)] -vmla.s32 Q0, Q2, r10 -vstrw.u32 Q0, [r1,#(224)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-272)] -// Release input[56] from Q4 -// input[60]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(240)] -vstrw.u32 Q3, [r1,#(496)] -vstrw.u32 Q3, [r11,#(-256)] -// Release input[60] from Q3 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -// output[48]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 48)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r4 -// output[96]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 96)] -vadd.s32 Q0, Q0, Q1 -// Release output[48] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[180]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -72)] -vadd.s32 Q1, Q1, Q4 -// Release output[96] from Q4 -vqrdmulh.s32 Q2, Q2, r3 -vsub.s32 Q4, Q1, Q0 -// output[84]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 84)] -vmla.s32 Q3, Q2, r10 -vstrw.u32 Q4, [r11,#(-432)] -vadd.s32 Q1, Q1, Q0 -// Release output[144] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r1,#(384)] -// output[84]: Already loaded as Q7 -// output[180]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r4 -// output[36]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 36)] -vadd.s32 Q7, Q7, Q6 -// Release output[180] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[36] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[24]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 24)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(336)] -vadd.s32 Q3, Q3, Q7 -// Release output[84] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-288)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(144)] -// output[24]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[168]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -84)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[72]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 72)] -vsub.s32 Q4, Q3, Q2 -// output[60]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release output[168] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[156]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -96)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release output[24] from Q6 -vstrw.u32 Q3, [r1,#(288)] -// Release output[72] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-336)] -// output[156]: Already loaded as Q7 -// output[60]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[108]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 108)] -vadd.s32 Q7, Q7, Q5 -// Release output[60] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[108] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[16]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 16)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q3, Q3, Q7 -// Release output[156] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(432)] -// output[16]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[64]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// output[52]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 52)] -vadd.s32 Q3, Q3, Q2 -// Release output[160] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[148]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -104)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(64)] -vadd.s32 Q3, Q3, Q6 -// Release output[16] from Q6 -vstrw.u32 Q3, [r1,#(256)] -// Release output[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-368)] -// output[148]: Already loaded as Q7 -// output[52]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[100]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release output[52] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[184]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -68)] -vadd.s32 Q3, Q3, Q2 -// Release output[100] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[88]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 88)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-416)] -vadd.s32 Q3, Q3, Q7 -// Release output[148] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(208)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(400)] -// output[88]: Already loaded as Q6 -// output[184]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release output[184] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[40] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[28]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 28)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(352)] -vadd.s32 Q3, Q3, Q6 -// Release output[88] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-272)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(160)] -// output[28]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[76]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 76)] -vsub.s32 Q4, Q3, Q2 -// output[176]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -76)] -vadd.s32 Q3, Q3, Q2 -// Release output[172] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[80]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 80)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(112)] -vadd.s32 Q3, Q3, Q7 -// Release output[28] from Q7 -vstrw.u32 Q3, [r1,#(304)] -// Release output[76] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-320)] -// output[80]: Already loaded as Q6 -// output[176]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vadd.s32 Q6, Q6, Q5 -// Release output[176] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[32] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[20]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 20)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q6 -// Release output[80] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(128)] -// output[20]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[68]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// output[56]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 56)] -vadd.s32 Q3, Q3, Q2 -// Release output[164] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[152]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -100)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q7 -// Release output[20] from Q7 -vstrw.u32 Q3, [r1,#(272)] -// Release output[68] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-352)] -// output[152]: Already loaded as Q6 -// output[56]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r4 -// output[104]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 104)] -vadd.s32 Q6, Q6, Q5 -// Release output[56] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[188]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release output[104] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q6 -// output[92]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 92)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release output[152] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(224)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(416)] -// output[92]: Already loaded as Q7 -// output[188]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r4 -// output[44]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 44)] -vadd.s32 Q7, Q7, Q5 -// Release output[188] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[12]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 12)] -vadd.s32 Q3, Q3, Q2 -// Release output[44] from Q2 -vqrdmulh.s32 Q0, Q0, r3 -vsub.s32 Q2, Q3, Q7 -// output[132]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -120)] -vmla.s32 Q1, Q0, r10 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release output[92] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(176)] -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[12]: Already loaded as Q5 -vmul.u32 Q0, Q5, r8 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q5, Q5, r7 -// output[132]: Already loaded as Q6 -vmla.s32 Q0, Q5, r10 -vmul.u32 Q2, Q1, r8 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r7 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r10 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r4 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r10 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vmul.u32 Q4, Q6, r6 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r5 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(48)] -// Release output[12] from Q5 -vmla.s32 Q4, Q6, r10 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(-480)] -// Release output[132] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[76]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vqrdmulh.s32 Q2, Q2, r7 -// output[4]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 4)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[140]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -112)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(304)] -// Release output[76] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(16)] -// Release output[4] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[140]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vqrdmulh.s32 Q0, Q0, r7 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[156]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -96)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-448)] -// Release output[140] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[156]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r7 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[144]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -108)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[28]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-384)] -// Release output[156] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[28]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[88]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 88)] -vqrdmulh.s32 Q2, Q2, r7 -// output[148]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -104)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r11,#(-432)] -// Release output[144] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(112)] -// Release output[28] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(352)] -// Release output[88] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-416)] -// Release output[148] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vqrdmulh.s32 Q0, Q0, r7 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[108]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 108)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[108]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[168]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -84)] -vqrdmulh.s32 Q1, Q1, r7 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[172]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -80)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(432)] -// Release output[108] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-336)] -// Release output[168] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[172]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[40]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 40)] -vqrdmulh.s32 Q2, Q2, r7 -// output[100]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 100)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[44]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 44)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-320)] -// Release output[172] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(160)] -// Release output[40] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(400)] -// Release output[100] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[44]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[104]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 104)] -vqrdmulh.s32 Q0, Q0, r7 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -// output[60]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(176)] -// Release output[44] from Q0 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(416)] -// Release output[104] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r8, r7, [r9], #+8 -ldrd r6, r5, [r9], #+8 -ldrd r4, r3, [r9], #+8 -// output[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r8 -// output[120]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 120)] -vqrdmulh.s32 Q1, Q1, r7 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q1, r10 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vmul.u32 Q2, Q3, r8 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r10 -// output[48]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 48)] -vmul.u32 Q5, Q1, r4 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r3 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r10 -// output[124]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(240)] -// Release output[60] from Q1 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r1,#(480)] -// Release output[120] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[124]: Already loaded as Q2 -vmul.u32 Q1, Q2, r8 -// output[184]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -68)] -vqrdmulh.s32 Q2, Q2, r7 -// output[52]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 52)] -vmla.s32 Q1, Q2, r10 -vstrw.u32 Q0, [r1,#(192)] -// Release output[48] from Q0 -vmul.u32 Q0, Q3, r8 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r10 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vmul.u32 Q5, Q2, r4 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r3 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r10 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r6 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(496)] -// Release output[124] from Q2 -vmla.s32 Q6, Q4, r10 -vstrw.u32 Q3, [r11,#(-272)] -// Release output[184] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(208)] -// Release output[52] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r8 -// output[56]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r7 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q2, Q0, r10 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vmul.u32 Q1, Q3, r8 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r7 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r10 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vmul.u32 Q5, Q0, r4 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r3 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r10 -vmul.u32 Q1, Q4, r6 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r5 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q1, Q4, r10 -vstrw.u32 Q3, [r1,#(224)] -// Release output[56] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -.equ modulus_inv, 2228766271 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 1204 -// Instruction count: 900 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good.s b/tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good.s deleted file mode 100644 index ff18dcc..0000000 --- a/tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_106117153_1392340_incomplete_good_twiddles -ntt_384_u32_106117153_1392340_incomplete_good_twiddles: // For base multiplication -.word 37890045 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 3768695715 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 141040949 // zeta^ 64 * 2^31 = 1392340^ 64 * 2^31 = 51456573 * 2^31 -.word 1866321259 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 64 * 2586463201 * 2^31 -.word 72909029 // zeta^ 32 * 2^31 = 1392340^ 32 * 2^31 = 71252774 * 2^31 -.word 1491181499 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 32 * 2586463201 * 2^31 -.word 136914403 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 8270461 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 195316801 // zeta^ 16 * 2^31 = 1392340^ 16 * 2^31 = 68534739 * 2^31 -.word 93843423 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 16 * 2586463201 * 2^31 -.word 164989409 // zeta^ 80 * 2^31 = 1392340^ 80 * 2^31 = 254604 * 2^31 -.word 3540182591 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 80 * 2586463201 * 2^31 -.word 68592855 // zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 1686643209 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 175196685 // zeta^112 * 2^31 = 1392340^112 * 2^31 = 89497534 * 2^31 -.word 2845449107 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 1392340^112 * 2586463201 * 2^31 -.word 151765235 // zeta^ 8 * 2^31 = 1392340^ 8 * 2^31 = 62524596 * 2^31 -.word 3309713773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 8 * 2586463201 * 2^31 -.word 126514603 // zeta^ 72 * 2^31 = 1392340^ 72 * 2^31 = 101908685 * 2^31 -.word 2730982325 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 72 * 2586463201 * 2^31 -.word 76763575 // zeta^ 40 * 2^31 = 1392340^ 40 * 2^31 = 54230858 * 2^31 -.word 1401852201 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 40 * 2586463201 * 2^31 -.word 203981715 // zeta^104 * 2^31 = 1392340^104 * 2^31 = 88837097 * 2^31 -.word 1840925389 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 1392340^104 * 2586463201 * 2^31 -.word 16429529 // zeta^ 24 * 2^31 = 1392340^ 24 * 2^31 = 87659826 * 2^31 -.word 4193590599 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 24 * 2586463201 * 2^31 -.word 127555815 // zeta^ 88 * 2^31 = 1392340^ 88 * 2^31 = 59766995 * 2^31 -.word 2372284409 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 88 * 2586463201 * 2^31 -.word 209645089 // zeta^ 56 * 2^31 = 1392340^ 56 * 2^31 = 20339234 * 2^31 -.word 3913484799 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 56 * 2586463201 * 2^31 -.word 186595525 // zeta^120 * 2^31 = 1392340^120 * 2^31 = 66124790 * 2^31 -.word 1100702683 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1392340^120 * 2586463201 * 2^31 -.word 992809 // zeta^ 4 * 2^31 = 1392340^ 4 * 2^31 = 1392340 * 2^31 -.word 2512876279 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 4 * 2586463201 * 2^31 -.word 53454909 // zeta^ 68 * 2^31 = 1392340^ 68 * 2^31 = 49002870 * 2^31 -.word 4040246115 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 68 * 2586463201 * 2^31 -.word 48183541 // zeta^ 36 * 2^31 = 1392340^ 36 * 2^31 = 9948684 * 2^31 -.word 1508714923 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 36 * 2586463201 * 2^31 -.word 105529301 // zeta^100 * 2^31 = 1392340^100 * 2^31 = 14737829 * 2^31 -.word 488144587 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 1392340^100 * 2586463201 * 2^31 -.word 11656863 // zeta^ 20 * 2^31 = 1392340^ 20 * 2^31 = 37124223 * 2^31 -.word 459063617 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 20 * 2586463201 * 2^31 -.word 108201343 // zeta^ 84 * 2^31 = 1392340^ 84 * 2^31 = 64042340 * 2^31 -.word 1433794145 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 84 * 2586463201 * 2^31 -.word 93085077 // zeta^ 52 * 2^31 = 1392340^ 52 * 2^31 = 56731543 * 2^31 -.word 63248651 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 52 * 2586463201 * 2^31 -.word 48800199 // zeta^116 * 2^31 = 1392340^116 * 2^31 = 64416179 * 2^31 -.word 159286041 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 1392340^116 * 2586463201 * 2^31 -.word 161225519 // zeta^ 12 * 2^31 = 1392340^ 12 * 2^31 = 61070877 * 2^31 -.word 371152561 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 12 * 2586463201 * 2^31 -.word 157992763 // zeta^ 76 * 2^31 = 1392340^ 76 * 2^31 = 64736387 * 2^31 -.word 1125817381 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 76 * 2586463201 * 2^31 -.word 117865359 // zeta^ 44 * 2^31 = 1392340^ 44 * 2^31 = 26493417 * 2^31 -.word 2711913041 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 44 * 2586463201 * 2^31 -.word 58891053 // zeta^108 * 2^31 = 1392340^108 * 2^31 = 16694344 * 2^31 -.word 526216819 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1392340^108 * 2586463201 * 2^31 -.word 134087109 // zeta^ 28 * 2^31 = 1392340^ 28 * 2^31 = 46852595 * 2^31 -.word 3270097627 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 28 * 2586463201 * 2^31 -.word 106564557 // zeta^ 92 * 2^31 = 1392340^ 92 * 2^31 = 73724383 * 2^31 -.word 3351548371 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 92 * 2586463201 * 2^31 -.word 47641089 // zeta^ 60 * 2^31 = 1392340^ 60 * 2^31 = 68915062 * 2^31 -.word 973406751 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 60 * 2586463201 * 2^31 -.word 16048813 // zeta^124 * 2^31 = 1392340^124 * 2^31 = 99228576 * 2^31 -.word 670701299 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 1392340^124 * 2586463201 * 2^31 -.word 209268057 // zeta^128 * 2^31 = 1392340^128 * 2^31 = 51456572 * 2^31 -.word 2392592839 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1392340^128 * 2586463201 * 2^31 -.word 174344261 // zeta^192 * 2^31 = 1392340^192 * 2^31 = 106117152 * 2^31 -.word 526271579 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 1392340^192 * 2586463201 * 2^31 -.word 170122527 // zeta^160 * 2^31 = 1392340^160 * 2^31 = 91733486 * 2^31 -.word 2812056257 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 1392340^160 * 2586463201 * 2^31 -.word 139325277 // zeta^224 * 2^31 = 1392340^224 * 2^31 = 34864379 * 2^31 -.word 2803785795 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 1392340^224 * 2586463201 * 2^31 -.word 75789761 // zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 3446339167 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 16917505 // zeta^208 * 2^31 = 1392340^208 * 2^31 = 37582414 * 2^31 -.word 4201123871 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 1392340^208 * 2586463201 * 2^31 -.word 486677 // zeta^176 * 2^31 = 1392340^176 * 2^31 = 53294696 * 2^31 -.word 1158805899 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 1392340^176 * 2586463201 * 2^31 -.word 143641451 // zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 2608324085 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 80866521 // zeta^136 * 2^31 = 1392340^136 * 2^31 = 39384089 * 2^31 -.word 3716235847 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 1392340^136 * 2586463201 * 2^31 -.word 60469071 // zeta^200 * 2^31 = 1392340^200 * 2^31 = 43592557 * 2^31 -.word 985253521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 1392340^200 * 2586463201 * 2^31 -.word 21100987 // zeta^168 * 2^31 = 1392340^168 * 2^31 = 34606239 * 2^31 -.word 439073189 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1392340^168 * 2586463201 * 2^31 -.word 135470731 // zeta^232 * 2^31 = 1392340^232 * 2^31 = 51886295 * 2^31 -.word 2893115093 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 1392340^232 * 2586463201 * 2^31 -.word 5009133 // zeta^152 * 2^31 = 1392340^152 * 2^31 = 78224322 * 2^31 -.word 2473661107 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 1392340^152 * 2586463201 * 2^31 -.word 195804777 // zeta^216 * 2^31 = 1392340^216 * 2^31 = 18457327 * 2^31 -.word 101376695 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 1392340^216 * 2586463201 * 2^31 -.word 83067589 // zeta^184 * 2^31 = 1392340^184 * 2^31 = 45785556 * 2^31 -.word 1482185179 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 1392340^184 * 2586463201 * 2^31 -.word 2589217 // zeta^248 * 2^31 = 1392340^248 * 2^31 = 85777919 * 2^31 -.word 381482495 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 1392340^248 * 2586463201 * 2^31 -.word 158579253 // zeta^132 * 2^31 = 1392340^132 * 2^31 = 47610530 * 2^31 -.word 1527369835 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1392340^132 * 2586463201 * 2^31 -.word 211241497 // zeta^196 * 2^31 = 1392340^196 * 2^31 = 104724813 * 2^31 -.word 1782091015 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 1392340^196 * 2586463201 * 2^31 -.word 163462913 // zeta^164 * 2^31 = 1392340^164 * 2^31 = 4789145 * 2^31 -.word 3274396959 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 1392340^164 * 2586463201 * 2^31 -.word 164050765 // zeta^228 * 2^31 = 1392340^228 * 2^31 = 96168469 * 2^31 -.word 2786252371 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 1392340^228 * 2586463201 * 2^31 -.word 202661633 // zeta^148 * 2^31 = 1392340^148 * 2^31 = 26918117 * 2^31 -.word 974730527 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 1392340^148 * 2586463201 * 2^31 -.word 200577443 // zeta^212 * 2^31 = 1392340^212 * 2^31 = 68992930 * 2^31 -.word 3835903677 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 1392340^212 * 2586463201 * 2^31 -.word 61832275 // zeta^180 * 2^31 = 1392340^180 * 2^31 = 7684636 * 2^31 -.word 96037389 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1392340^180 * 2586463201 * 2^31 -.word 119149229 // zeta^244 * 2^31 = 1392340^244 * 2^31 = 49385610 * 2^31 -.word 4231718643 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 1392340^244 * 2586463201 * 2^31 -.word 102884397 // zeta^140 * 2^31 = 1392340^140 * 2^31 = 3665510 * 2^31 -.word 754664819 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 1392340^140 * 2586463201 * 2^31 -.word 51008787 // zeta^204 * 2^31 = 1392340^204 * 2^31 = 45046276 * 2^31 -.word 3923814733 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 1392340^204 * 2586463201 * 2^31 -.word 47142847 // zeta^172 * 2^31 = 1392340^172 * 2^31 = 96318080 * 2^31 -.word 2109271073 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 1392340^172 * 2586463201 * 2^31 -.word 94368947 // zeta^236 * 2^31 = 1392340^236 * 2^31 = 79623736 * 2^31 -.word 1583054253 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 1392340^236 * 2586463201 * 2^31 -.word 78594601 // zeta^156 * 2^31 = 1392340^156 * 2^31 = 26871788 * 2^31 -.word 81450743 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1392340^156 * 2586463201 * 2^31 -.word 78147197 // zeta^220 * 2^31 = 1392340^220 * 2^31 = 59264558 * 2^31 -.word 1024869667 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 1392340^220 * 2586463201 * 2^31 -.word 74524877 // zeta^188 * 2^31 = 1392340^188 * 2^31 = 30313514 * 2^31 -.word 3992261843 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 1392340^188 * 2586463201 * 2^31 -.word 164593217 // zeta^252 * 2^31 = 1392340^252 * 2^31 = 37202091 * 2^31 -.word 3321560543 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 1392340^252 * 2586463201 * 2^31 -.word 71193357 // zeta^256 * 2^31 = 1392340^256 * 2^31 = 54660580 * 2^31 -.word 2428646035 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 1392340^256 * 2586463201 * 2^31 -.word 2966249 // zeta^320 * 2^31 = 1392340^320 * 2^31 = 54660581 * 2^31 -.word 1902374455 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 1392340^320 * 2586463201 * 2^31 -.word 75319903 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 4286696833 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 42111779 // zeta^352 * 2^31 = 1392340^352 * 2^31 = 14383667 * 2^31 -.word 1482911037 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 1392340^352 * 2586463201 * 2^31 -.word 47244897 // zeta^272 * 2^31 = 1392340^272 * 2^31 = 105862549 * 2^31 -.word 754784703 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 1392340^272 * 2586463201 * 2^31 -.word 136444545 // zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 848628127 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 37037621 // zeta^304 * 2^31 = 1392340^304 * 2^31 = 16619619 * 2^31 -.word 1449518187 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 1392340^304 * 2586463201 * 2^31 -.word 211747629 // zeta^368 * 2^31 = 1392340^368 * 2^31 = 52822457 * 2^31 -.word 3136161395 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 1392340^368 * 2586463201 * 2^31 -.word 85719703 // zeta^264 * 2^31 = 1392340^264 * 2^31 = 4208468 * 2^31 -.word 1563984969 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 1392340^264 * 2586463201 * 2^31 -.word 131367785 // zeta^328 * 2^31 = 1392340^328 * 2^31 = 66733064 * 2^31 -.word 578731447 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 1392340^328 * 2586463201 * 2^31 -.word 8252591 // zeta^296 * 2^31 = 1392340^296 * 2^31 = 17280056 * 2^31 -.word 2454041905 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 1392340^296 * 2586463201 * 2^31 -.word 191133319 // zeta^360 * 2^31 = 1392340^360 * 2^31 = 71510914 * 2^31 -.word 3855894105 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 1392340^360 * 2586463201 * 2^31 -.word 84678491 // zeta^280 * 2^31 = 1392340^280 * 2^31 = 46350158 * 2^31 -.word 1922682885 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 1392340^280 * 2586463201 * 2^31 -.word 207225173 // zeta^344 * 2^31 = 1392340^344 * 2^31 = 27892831 * 2^31 -.word 1821306187 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 1392340^344 * 2586463201 * 2^31 -.word 25638781 // zeta^312 * 2^31 = 1392340^312 * 2^31 = 39992363 * 2^31 -.word 3194264611 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 1392340^312 * 2586463201 * 2^31 -.word 129166717 // zeta^376 * 2^31 = 1392340^376 * 2^31 = 60331597 * 2^31 -.word 2812782115 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 1392340^376 * 2586463201 * 2^31 -.word 158779397 // zeta^260 * 2^31 = 1392340^260 * 2^31 = 57114283 * 2^31 -.word 254721179 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 1392340^260 * 2586463201 * 2^31 -.word 53655053 // zeta^324 * 2^31 = 1392340^324 * 2^31 = 58506623 * 2^31 -.word 2767597459 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 1392340^324 * 2586463201 * 2^31 -.word 106705005 // zeta^292 * 2^31 = 1392340^292 * 2^31 = 91379324 * 2^31 -.word 3806822707 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 1392340^292 * 2586463201 * 2^31 -.word 48771393 // zeta^356 * 2^31 = 1392340^356 * 2^31 = 101328008 * 2^31 -.word 1020570335 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 1392340^356 * 2586463201 * 2^31 -.word 104032963 // zeta^276 * 2^31 = 1392340^276 * 2^31 = 42074813 * 2^31 -.word 2861173149 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 1392340^276 * 2586463201 * 2^31 -.word 9572673 // zeta^340 * 2^31 = 1392340^340 * 2^31 = 79199036 * 2^31 -.word 3320236767 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 1392340^340 * 2586463201 * 2^31 -.word 163434107 // zeta^308 * 2^31 = 1392340^308 * 2^31 = 41700974 * 2^31 -.word 4135681253 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 1392340^308 * 2586463201 * 2^31 -.word 150402031 // zeta^372 * 2^31 = 1392340^372 * 2^31 = 98432517 * 2^31 -.word 4198929905 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 1392340^372 * 2586463201 * 2^31 -.word 54241543 // zeta^268 * 2^31 = 1392340^268 * 2^31 = 41380766 * 2^31 -.word 3169149913 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 1392340^268 * 2586463201 * 2^31 -.word 109349909 // zeta^332 * 2^31 = 1392340^332 * 2^31 = 102451643 * 2^31 -.word 3540302475 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 1392340^332 * 2586463201 * 2^31 -.word 153343253 // zeta^300 * 2^31 = 1392340^300 * 2^31 = 89422809 * 2^31 -.word 3768750475 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 1392340^300 * 2586463201 * 2^31 -.word 165091459 // zeta^364 * 2^31 = 1392340^364 * 2^31 = 9799073 * 2^31 -.word 2185696221 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 1392340^364 * 2586463201 * 2^31 -.word 105669749 // zeta^284 * 2^31 = 1392340^284 * 2^31 = 32392770 * 2^31 -.word 943418923 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 1392340^284 * 2586463201 * 2^31 -.word 133639705 // zeta^348 * 2^31 = 1392340^348 * 2^31 = 79245365 * 2^31 -.word 4213516551 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 1392340^348 * 2586463201 * 2^31 -.word 196185493 // zeta^316 * 2^31 = 1392340^316 * 2^31 = 6888577 * 2^31 -.word 3624265995 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 1392340^316 * 2586463201 * 2^31 -.word 137709429 // zeta^380 * 2^31 = 1392340^380 * 2^31 = 75803639 * 2^31 -.word 302705451 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 1392340^380 * 2586463201 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_106117153_1392340_incomplete_good_scale -ntt_384_u32_106117153_1392340_incomplete_good_scale: // Constants for scaling by 1/N -.word 37890045 // 1/96 -.word 3768695715 // 1/96 twisted -.data -roots: -.word 136304203 /// zeta^256 * 2^31 = 1392340^256 * 2^31 = 54660580 * 2^31 -.word 1106161429 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 1392340^256 * 2586463201 * 2^31 -.word 50789515 /// zeta^128 * 2^31 = 1392340^128 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1392340^128 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 86500417 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 86500417 // zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 3362131 // zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 74219771 // zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 3362131 // zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 765704461 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 207754911 // zeta^264 * 2^31 = 1392340^264 * 2^31 = 4208468 * 2^31 -.word 85166401 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 1392340^264 * 2586463201 * 2^31 -.word 86384727 // zeta^168 * 2^31 = 1392340^168 * 2^31 = 34606239 * 2^31 -.word 2847807113 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1392340^168 * 2586463201 * 2^31 -.word 74219771 // zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 732633701 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 77895747 // zeta^ 24 * 2^31 = 1392340^ 24 * 2^31 = 87659826 * 2^31 -.word 1773964317 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 24 * 2586463201 * 2^31 -.word 42168601 // zeta^312 * 2^31 = 1392340^312 * 2^31 = 39992363 * 2^31 -.word 2956805639 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 1392340^312 * 2586463201 * 2^31 -.word 131257741 // XX: zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 86500417 // XX: zeta^288 * 2^31 = 1392340^288 * 2^31 = 49248046 * 2^31 -.word 996628447 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 1392340^288 * 2586463201 * 2^31 -.word 3362131 // XX: zeta^144 * 2^31 = 1392340^144 * 2^31 = 37837018 * 2^31 -.word 765704461 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 1392340^144 * 2586463201 * 2^31 -.word 74219771 // XX: zeta^ 48 * 2^31 = 1392340^ 48 * 2^31 = 36202838 * 2^31 -.word 732633701 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 48 * 2586463201 * 2^31 -.word 207754911 // XX: zeta^264 * 2^31 = 1392340^264 * 2^31 = 4208468 * 2^31 -.word 85166401 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 1392340^264 * 2586463201 * 2^31 -.word 86384727 // XX: zeta^168 * 2^31 = 1392340^168 * 2^31 = 34606239 * 2^31 -.word 2847807113 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 1392340^168 * 2586463201 * 2^31 -.word 77895747 // XX: zeta^ 24 * 2^31 = 1392340^ 24 * 2^31 = 87659826 * 2^31 -.word 1773964317 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 24 * 2586463201 * 2^31 -.word 42168601 // XX: zeta^312 * 2^31 = 1392340^312 * 2^31 = 39992363 * 2^31 -.word 2956805639 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 1392340^312 * 2586463201 * 2^31 -.word 120907359 // XX: zeta^132 * 2^31 = 1392340^132 * 2^31 = 47610530 * 2^31 -.word 963490177 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 1392340^132 * 2586463201 * 2^31 -.word 76659711 // XX: zeta^ 36 * 2^31 = 1392340^ 36 * 2^31 = 9948684 * 2^31 -.word 201330657 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 36 * 2586463201 * 2^31 -.word 101045121 // XX: zeta^276 * 2^31 = 1392340^276 * 2^31 = 42074813 * 2^31 -.word 2998947999 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 1392340^276 * 2586463201 * 2^31 -.word 121674239 // XX: zeta^180 * 2^31 = 1392340^180 * 2^31 = 7684636 * 2^31 -.word 155513313 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 1392340^180 * 2586463201 * 2^31 -.word 137517881 // XX: zeta^ 12 * 2^31 = 1392340^ 12 * 2^31 = 61070877 * 2^31 -.word 3383369703 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 12 * 2586463201 * 2^31 -.word 131387629 // XX: zeta^300 * 2^31 = 1392340^300 * 2^31 = 89422809 * 2^31 -.word 3957125299 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 1392340^300 * 2586463201 * 2^31 -.word 87076127 // XX: zeta^156 * 2^31 = 1392340^156 * 2^31 = 26871788 * 2^31 -.word 543802049 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 1392340^156 * 2586463201 * 2^31 -.word 80366379 // XX: zeta^ 60 * 2^31 = 1392340^ 60 * 2^31 = 68915062 * 2^31 -.word 1394628149 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 60 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_106117153_1392340_incomplete_good, %function -.global ntt_384_u32_106117153_1392340_incomplete_good -ntt_384_u32_106117153_1392340_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 106117153 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vmul.u32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vmul.u32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vmul.u32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vmul.u32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 1708504095 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s b/tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s deleted file mode 100644 index 06677c5..0000000 --- a/tests/intmulntt/ntt_384_u32_106117153_1392340_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 50789515 /// zeta^128 * 2^31 = 1392340^128 * 2^31 = 51456572 * 2^31 -.word 1041322197 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 1392340^128 * 2586463201 * 2^31 -.word 136304203 /// zeta^256 * 2^31 = 1392340^256 * 2^31 = 54660580 * 2^31 -.word 1106161429 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 1392340^256 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 125733889 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 131257741 // zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 125733889 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 125733889 // zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 138014535 // zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 3562333593 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 208872175 // zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 3529262833 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 138014535 // zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 3562333593 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 170065705 // zeta^120 * 2^31 = 1392340^120 * 2^31 = 66124790 * 2^31 -.word 1338161655 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1392340^120 * 2586463201 * 2^31 -.word 134338559 // zeta^216 * 2^31 = 1392340^216 * 2^31 = 18457327 * 2^31 -.word 2521002977 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 1392340^216 * 2586463201 * 2^31 -.word 208872175 // zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 3529262833 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 125849579 // zeta^360 * 2^31 = 1392340^360 * 2^31 = 71510914 * 2^31 -.word 1447160181 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 1392340^360 * 2586463201 * 2^31 -.word 4479395 // zeta^ 72 * 2^31 = 1392340^ 72 * 2^31 = 101908685 * 2^31 -.word 4209800893 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 72 * 2586463201 * 2^31 -.word 131257741 // XX: zeta^ 0 * 2^31 = 1392340^ 0 * 2^31 = 1 * 2^31 -.word 2147483667 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 0 * 2586463201 * 2^31 -.word 125733889 // XX: zeta^ 96 * 2^31 = 1392340^ 96 * 2^31 = 56869107 * 2^31 -.word 3298338847 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 96 * 2586463201 * 2^31 -.word 138014535 // XX: zeta^240 * 2^31 = 1392340^240 * 2^31 = 69914315 * 2^31 -.word 3562333593 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 1392340^240 * 2586463201 * 2^31 -.word 208872175 // XX: zeta^336 * 2^31 = 1392340^336 * 2^31 = 68280135 * 2^31 -.word 3529262833 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 1392340^336 * 2586463201 * 2^31 -.word 170065705 // XX: zeta^120 * 2^31 = 1392340^120 * 2^31 = 66124790 * 2^31 -.word 1338161655 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 1392340^120 * 2586463201 * 2^31 -.word 134338559 // XX: zeta^216 * 2^31 = 1392340^216 * 2^31 = 18457327 * 2^31 -.word 2521002977 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 1392340^216 * 2586463201 * 2^31 -.word 125849579 // XX: zeta^360 * 2^31 = 1392340^360 * 2^31 = 71510914 * 2^31 -.word 1447160181 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 1392340^360 * 2586463201 * 2^31 -.word 4479395 // XX: zeta^ 72 * 2^31 = 1392340^ 72 * 2^31 = 101908685 * 2^31 -.word 4209800893 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 72 * 2586463201 * 2^31 -.word 131867927 // XX: zeta^252 * 2^31 = 1392340^252 * 2^31 = 37202091 * 2^31 -.word 2900339145 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 1392340^252 * 2586463201 * 2^31 -.word 125158179 // XX: zeta^348 * 2^31 = 1392340^348 * 2^31 = 79245365 * 2^31 -.word 3751165245 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 1392340^348 * 2586463201 * 2^31 -.word 80846677 // XX: zeta^108 * 2^31 = 1392340^108 * 2^31 = 16694344 * 2^31 -.word 337841995 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 1392340^108 * 2586463201 * 2^31 -.word 74716425 // XX: zeta^204 * 2^31 = 1392340^204 * 2^31 = 45046276 * 2^31 -.word 911597591 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 1392340^204 * 2586463201 * 2^31 -.word 90560067 // XX: zeta^372 * 2^31 = 1392340^372 * 2^31 = 98432517 * 2^31 -.word 4139453981 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 1392340^372 * 2586463201 * 2^31 -.word 111189185 // XX: zeta^ 84 * 2^31 = 1392340^ 84 * 2^31 = 64042340 * 2^31 -.word 1296019295 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 1392340^ 84 * 2586463201 * 2^31 -.word 135574595 // XX: zeta^228 * 2^31 = 1392340^228 * 2^31 = 96168469 * 2^31 -.word 4093636637 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 1392340^228 * 2586463201 * 2^31 -.word 91326947 // XX: zeta^324 * 2^31 = 1392340^324 * 2^31 = 58506623 * 2^31 -.word 3331477117 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 1392340^324 * 2586463201 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_106117153_1392340_incomplete_good_bitrev, %function -.global ntt_384_u32_106117153_1392340_incomplete_good_bitrev -ntt_384_u32_106117153_1392340_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 106117153 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vmul.u32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 1708504095 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good.s b/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good.s deleted file mode 100644 index 33ec980..0000000 --- a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_108643009_640922_incomplete_good_twiddles -ntt_384_u32_108643009_640922_incomplete_good_twiddles: // For base multiplication -.word 117231189 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 3747646315 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 167943169 // zeta^ 64 * 2^31 = 640922^ 64 * 2^31 = 67669976 * 2^31 -.word 1929942719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 64 * 3479293249 * 2^31 -.word 10524287 // zeta^ 32 * 2^31 = 640922^ 32 * 2^31 = 8569779 * 2^31 -.word 2274825921 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 32 * 3479293249 * 2^31 -.word 183751195 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 2275215397 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 197898871 // zeta^ 16 * 2^31 = 640922^ 16 * 2^31 = 82310697 * 2^31 -.word 189228233 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 16 * 3479293249 * 2^31 -.word 117636793 // zeta^ 80 * 2^31 = 640922^ 80 * 2^31 = 87332880 * 2^31 -.word 3072994823 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 80 * 3479293249 * 2^31 -.word 59998845 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1940964675 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 22735857 // zeta^112 * 2^31 = 640922^112 * 2^31 = 44058032 * 2^31 -.word 2477333199 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 640922^112 * 3479293249 * 2^31 -.word 127637249 // zeta^ 8 * 2^31 = 640922^ 8 * 2^31 = 1793055 * 2^31 -.word 1932647359 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 8 * 3479293249 * 2^31 -.word 78695545 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 3934662727 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 203907557 // zeta^ 40 * 2^31 = 640922^ 40 * 2^31 = 52463921 * 2^31 -.word 500614107 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 40 * 3479293249 * 2^31 -.word 212278911 // zeta^104 * 2^31 = 640922^104 * 2^31 = 46625229 * 2^31 -.word 3070660289 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 640922^104 * 3479293249 * 2^31 -.word 65439627 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 2806138549 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 141615223 // zeta^ 88 * 2^31 = 640922^ 88 * 2^31 = 56126250 * 2^31 -.word 830518985 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 88 * 3479293249 * 2^31 -.word 96791441 // zeta^ 56 * 2^31 = 640922^ 56 * 2^31 = 17702973 * 2^31 -.word 1466700591 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 56 * 3479293249 * 2^31 -.word 91234029 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 2063031507 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 172736993 // zeta^ 4 * 2^31 = 640922^ 4 * 2^31 = 640922 * 2^31 -.word 1396807903 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 4 * 3479293249 * 2^31 -.word 84666041 // zeta^ 68 * 2^31 = 640922^ 68 * 2^31 = 18021000 * 2^31 -.word 757024263 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 68 * 3479293249 * 2^31 -.word 145858849 // zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 3495799199 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 183858709 // zeta^100 * 2^31 = 640922^100 * 2^31 = 58708509 * 2^31 -.word 4012454827 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 640922^100 * 3479293249 * 2^31 -.word 177838823 // zeta^ 20 * 2^31 = 640922^ 20 * 2^31 = 81518432 * 2^31 -.word 3547181145 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 20 * 3479293249 * 2^31 -.word 41900335 // zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 2540746769 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 60770513 // zeta^ 52 * 2^31 = 640922^ 52 * 2^31 = 82553845 * 2^31 -.word 4044236271 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 52 * 3479293249 * 2^31 -.word 167358029 // zeta^116 * 2^31 = 640922^116 * 2^31 = 31587287 * 2^31 -.word 953816435 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 640922^116 * 3479293249 * 2^31 -.word 51201803 // zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 3348244277 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 80521231 // zeta^ 76 * 2^31 = 640922^ 76 * 2^31 = 40418220 * 2^31 -.word 382095665 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 76 * 3479293249 * 2^31 -.word 99504283 // zeta^ 44 * 2^31 = 640922^ 44 * 2^31 = 52603644 * 2^31 -.word 3359009189 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 44 * 3479293249 * 2^31 -.word 40810197 // zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 935723755 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 171634653 // zeta^ 28 * 2^31 = 640922^ 28 * 2^31 = 31497268 * 2^31 -.word 2671255523 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 28 * 3479293249 * 2^31 -.word 139731691 // zeta^ 92 * 2^31 = 640922^ 92 * 2^31 = 87621537 * 2^31 -.word 1117909845 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 92 * 3479293249 * 2^31 -.word 62594557 // zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1184680387 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.word 164673767 // zeta^124 * 2^31 = 640922^124 * 2^31 = 78082914 * 2^31 -.word 2238255705 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 640922^124 * 3479293249 * 2^31 -.word 159354989 // zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 2477263699 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 100054829 // zeta^192 * 2^31 = 640922^192 * 2^31 = 108643008 * 2^31 -.word 547320979 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 640922^192 * 3479293249 * 2^31 -.word 64583899 // zeta^160 * 2^31 = 640922^160 * 2^31 = 13028154 * 2^31 -.word 389477 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 640922^160 * 3479293249 * 2^31 -.word 206761731 // zeta^224 * 2^31 = 640922^224 * 2^31 = 100073230 * 2^31 -.word 2020141373 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 640922^224 * 3479293249 * 2^31 -.word 28380931 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 2883766589 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 19387147 // zeta^208 * 2^31 = 640922^208 * 2^31 = 26332312 * 2^31 -.word 4105739061 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 640922^208 * 3479293249 * 2^31 -.word 71380021 // zeta^176 * 2^31 = 640922^176 * 2^31 = 70392207 * 2^31 -.word 536368523 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 640922^176 * 3479293249 * 2^31 -.word 157287173 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 2354002619 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 59701305 // zeta^136 * 2^31 = 640922^136 * 2^31 = 106639146 * 2^31 -.word 2002015367 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 640922^136 * 3479293249 * 2^31 -.word 89648769 // zeta^200 * 2^31 = 640922^200 * 2^31 = 106849954 * 2^31 -.word 2362319935 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 640922^200 * 3479293249 * 2^31 -.word 117014363 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2570046181 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 13378461 // zeta^232 * 2^31 = 640922^232 * 2^31 = 56179088 * 2^31 -.word 3794353187 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 640922^232 * 3479293249 * 2^31 -.word 184818605 // zeta^152 * 2^31 = 640922^152 * 2^31 = 65895091 * 2^31 -.word 2319347731 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 640922^152 * 3479293249 * 2^31 -.word 151846391 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 1488828745 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103085597 // zeta^184 * 2^31 = 640922^184 * 2^31 = 105229554 * 2^31 -.word 596330915 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 640922^184 * 3479293249 * 2^31 -.word 120494577 // zeta^248 * 2^31 = 640922^248 * 2^31 = 90940036 * 2^31 -.word 2828266703 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 640922^248 * 3479293249 * 2^31 -.word 20572057 // zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 3655183655 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 44549025 // zeta^196 * 2^31 = 640922^196 * 2^31 = 108002087 * 2^31 -.word 2898159391 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 640922^196 * 3479293249 * 2^31 -.word 146642869 // zeta^164 * 2^31 = 640922^164 * 2^31 = 54775275 * 2^31 -.word 516655627 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 640922^164 * 3479293249 * 2^31 -.word 71427169 // zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 799168095 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 189990539 // zeta^148 * 2^31 = 640922^148 * 2^31 = 61145083 * 2^31 -.word 3288532917 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 640922^148 * 3479293249 * 2^31 -.word 39447195 // zeta^212 * 2^31 = 640922^212 * 2^31 = 27124577 * 2^31 -.word 747786149 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 640922^212 * 3479293249 * 2^31 -.word 215230525 // zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1204547459 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 156515505 // zeta^244 * 2^31 = 640922^244 * 2^31 = 26089164 * 2^31 -.word 250731023 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 640922^244 * 3479293249 * 2^31 -.word 137962437 // zeta^140 * 2^31 = 640922^140 * 2^31 = 57770712 * 2^31 -.word 1328818683 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 640922^140 * 3479293249 * 2^31 -.word 166084215 // zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 946723017 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 49948923 // zeta^172 * 2^31 = 640922^172 * 2^31 = 62290981 * 2^31 -.word 1871681861 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 640922^172 * 3479293249 * 2^31 -.word 117781735 // zeta^236 * 2^31 = 640922^236 * 2^31 = 56039365 * 2^31 -.word 935958105 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 640922^236 * 3479293249 * 2^31 -.word 76740047 // zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 2741621617 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 45651365 // zeta^220 * 2^31 = 640922^220 * 2^31 = 77145741 * 2^31 -.word 1623711771 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 640922^220 * 3479293249 * 2^31 -.word 210722219 // zeta^188 * 2^31 = 640922^188 * 2^31 = 94509732 * 2^31 -.word 1053575317 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 640922^188 * 3479293249 * 2^31 -.word 154691461 // zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 3110286907 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 49342849 // zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 2365024575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 57931029 // zeta^320 * 2^31 = 640922^320 * 2^31 = 40973034 * 2^31 -.word 1817703595 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 640922^320 * 3479293249 * 2^31 -.word 33534823 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 2019751897 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 152702119 // zeta^352 * 2^31 = 640922^352 * 2^31 = 95614855 * 2^31 -.word 4294577817 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 640922^352 * 3479293249 * 2^31 -.word 99649225 // zeta^272 * 2^31 = 640922^272 * 2^31 = 21310129 * 2^31 -.word 1221972471 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 640922^272 * 3479293249 * 2^31 -.word 188905087 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 1411200705 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 194550161 // zeta^304 * 2^31 = 640922^304 * 2^31 = 64584977 * 2^31 -.word 1817634095 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 640922^304 * 3479293249 * 2^31 -.word 145905997 // zeta^368 * 2^31 = 640922^368 * 2^31 = 38250802 * 2^31 -.word 3758598771 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 640922^368 * 3479293249 * 2^31 -.word 138590473 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 360304567 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 157584713 // zeta^328 * 2^31 = 640922^328 * 2^31 = 2003863 * 2^31 -.word 2292951927 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 640922^328 * 3479293249 * 2^31 -.word 5007107 // zeta^296 * 2^31 = 640922^296 * 2^31 = 62017780 * 2^31 -.word 1224307005 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 640922^296 * 3479293249 * 2^31 -.word 100271655 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 1724921113 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 75670795 // zeta^280 * 2^31 = 640922^280 * 2^31 = 52516759 * 2^31 -.word 3464448309 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 640922^280 * 3479293249 * 2^31 -.word 32467413 // zeta^344 * 2^31 = 640922^344 * 2^31 = 42747918 * 2^31 -.word 1975619563 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 640922^344 * 3479293249 * 2^31 -.word 126051989 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 2231935787 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 114200421 // zeta^376 * 2^31 = 640922^376 * 2^31 = 3413455 * 2^31 -.word 3698636379 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 640922^376 * 3479293249 * 2^31 -.word 132619977 // zeta^260 * 2^31 = 640922^260 * 2^31 = 90622009 * 2^31 -.word 3537943031 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 640922^260 * 3479293249 * 2^31 -.word 196713961 // zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 639783639 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.word 33427309 // zeta^292 * 2^31 = 640922^292 * 2^31 = 49934500 * 2^31 -.word 282512467 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 640922^292 * 3479293249 * 2^31 -.word 70643149 // zeta^356 * 2^31 = 640922^356 * 2^31 = 53867734 * 2^31 -.word 3778311667 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 640922^356 * 3479293249 * 2^31 -.word 175385683 // zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1754220525 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 27295479 // zeta^340 * 2^31 = 640922^340 * 2^31 = 47497926 * 2^31 -.word 1006434377 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 640922^340 * 3479293249 * 2^31 -.word 49927989 // zeta^308 * 2^31 = 640922^308 * 2^31 = 77055722 * 2^31 -.word 3341150859 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 640922^308 * 3479293249 * 2^31 -.word 2055493 // zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 3090419835 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 136764787 // zeta^268 * 2^31 = 640922^268 * 2^31 = 68224789 * 2^31 -.word 3912871629 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 640922^268 * 3479293249 * 2^31 -.word 79323581 // zeta^332 * 2^31 = 640922^332 * 2^31 = 50872297 * 2^31 -.word 2966148611 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 640922^332 * 3479293249 * 2^31 -.word 176475821 // zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 3359243539 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 167337095 // zeta^364 * 2^31 = 640922^364 * 2^31 = 46352028 * 2^31 -.word 2423285433 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 640922^364 * 3479293249 * 2^31 -.word 77554327 // zeta^284 * 2^31 = 640922^284 * 2^31 = 21021472 * 2^31 -.word 3177057449 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 640922^284 * 3479293249 * 2^31 -.word 140545971 // zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1553345677 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 52612251 // zeta^316 * 2^31 = 640922^316 * 2^31 = 30560095 * 2^31 -.word 2056711589 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 640922^316 * 3479293249 * 2^31 -.word 6563799 // zeta^380 * 2^31 = 640922^380 * 2^31 = 14133277 * 2^31 -.word 3241391977 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 640922^380 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_108643009_640922_incomplete_good_scale -ntt_384_u32_108643009_640922_incomplete_good_scale: // Constants for scaling by 1/N -.word 117231189 // 1/96 -.word 3747646315 // 1/96 twisted -.data -roots: -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 210808 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 98874168 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // XX: zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // XX: zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // XX: zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 210808 // XX: zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // XX: zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 98874168 // XX: zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // XX: zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 17380078 // XX: zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 343541970 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 3933234 // XX: zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 77745966 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 74622503 // XX: zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1475019943 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 57676451 // XX: zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1140057115 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 91290517 // XX: zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 1804486955 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 102391393 // XX: zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 2023911563 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 56124269 // XX: zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 1109376029 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 92216191 // XX: zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1822784218 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good, %function -.global ntt_384_u32_108643009_640922_incomplete_good -ntt_384_u32_108643009_640922_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -108643009 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 815674047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s b/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s deleted file mode 100644 index 9c316f8..0000000 --- a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 21597933 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 21597933 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 26334175 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 103620826 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 26334175 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 520532437 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 14289518 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 282452654 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 9768841 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 193095041 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103620826 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 2048213056 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 5838692 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 115410055 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 108432201 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 2143316728 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 21597933 // XX: zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 426913875 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 26334175 // XX: zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 520532437 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 103620826 // XX: zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 2048213056 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 14289518 // XX: zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 282452654 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 9768841 // XX: zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 193095041 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 5838692 // XX: zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 115410055 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 108432201 // XX: zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 2143316728 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 16426818 // XX: zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 324699430 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 52518740 // XX: zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1038107619 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 6251616 // XX: zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 123572085 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 17352492 // XX: zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 342996693 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 50966558 // XX: zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 1007426533 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 34020506 // XX: zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 672463705 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 104709775 // XX: zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 2069737682 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 91262931 // XX: zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 1803941678 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good_bitrev, %function -.global ntt_384_u32_108643009_640922_incomplete_good_bitrev -ntt_384_u32_108643009_640922_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -108643009 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 815674047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop.s b/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop.s deleted file mode 100644 index 00d4e3a..0000000 --- a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop.s +++ /dev/null @@ -1,3388 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_108643009_640922_incomplete_good_oop_twiddles -ntt_384_u32_108643009_640922_incomplete_good_oop_twiddles: // For base multiplication -.word 117231189 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 3747646315 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 167943169 // zeta^ 64 * 2^31 = 640922^ 64 * 2^31 = 67669976 * 2^31 -.word 1929942719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 64 * 3479293249 * 2^31 -.word 10524287 // zeta^ 32 * 2^31 = 640922^ 32 * 2^31 = 8569779 * 2^31 -.word 2274825921 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 32 * 3479293249 * 2^31 -.word 183751195 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 2275215397 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 197898871 // zeta^ 16 * 2^31 = 640922^ 16 * 2^31 = 82310697 * 2^31 -.word 189228233 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 16 * 3479293249 * 2^31 -.word 117636793 // zeta^ 80 * 2^31 = 640922^ 80 * 2^31 = 87332880 * 2^31 -.word 3072994823 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 80 * 3479293249 * 2^31 -.word 59998845 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1940964675 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 22735857 // zeta^112 * 2^31 = 640922^112 * 2^31 = 44058032 * 2^31 -.word 2477333199 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 640922^112 * 3479293249 * 2^31 -.word 127637249 // zeta^ 8 * 2^31 = 640922^ 8 * 2^31 = 1793055 * 2^31 -.word 1932647359 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 8 * 3479293249 * 2^31 -.word 78695545 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 3934662727 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 203907557 // zeta^ 40 * 2^31 = 640922^ 40 * 2^31 = 52463921 * 2^31 -.word 500614107 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 40 * 3479293249 * 2^31 -.word 212278911 // zeta^104 * 2^31 = 640922^104 * 2^31 = 46625229 * 2^31 -.word 3070660289 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 640922^104 * 3479293249 * 2^31 -.word 65439627 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 2806138549 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 141615223 // zeta^ 88 * 2^31 = 640922^ 88 * 2^31 = 56126250 * 2^31 -.word 830518985 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 88 * 3479293249 * 2^31 -.word 96791441 // zeta^ 56 * 2^31 = 640922^ 56 * 2^31 = 17702973 * 2^31 -.word 1466700591 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 56 * 3479293249 * 2^31 -.word 91234029 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 2063031507 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 172736993 // zeta^ 4 * 2^31 = 640922^ 4 * 2^31 = 640922 * 2^31 -.word 1396807903 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 4 * 3479293249 * 2^31 -.word 84666041 // zeta^ 68 * 2^31 = 640922^ 68 * 2^31 = 18021000 * 2^31 -.word 757024263 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 68 * 3479293249 * 2^31 -.word 145858849 // zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 3495799199 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 183858709 // zeta^100 * 2^31 = 640922^100 * 2^31 = 58708509 * 2^31 -.word 4012454827 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 640922^100 * 3479293249 * 2^31 -.word 177838823 // zeta^ 20 * 2^31 = 640922^ 20 * 2^31 = 81518432 * 2^31 -.word 3547181145 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 20 * 3479293249 * 2^31 -.word 41900335 // zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 2540746769 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 60770513 // zeta^ 52 * 2^31 = 640922^ 52 * 2^31 = 82553845 * 2^31 -.word 4044236271 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 52 * 3479293249 * 2^31 -.word 167358029 // zeta^116 * 2^31 = 640922^116 * 2^31 = 31587287 * 2^31 -.word 953816435 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 640922^116 * 3479293249 * 2^31 -.word 51201803 // zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 3348244277 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 80521231 // zeta^ 76 * 2^31 = 640922^ 76 * 2^31 = 40418220 * 2^31 -.word 382095665 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 76 * 3479293249 * 2^31 -.word 99504283 // zeta^ 44 * 2^31 = 640922^ 44 * 2^31 = 52603644 * 2^31 -.word 3359009189 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 44 * 3479293249 * 2^31 -.word 40810197 // zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 935723755 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 171634653 // zeta^ 28 * 2^31 = 640922^ 28 * 2^31 = 31497268 * 2^31 -.word 2671255523 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 28 * 3479293249 * 2^31 -.word 139731691 // zeta^ 92 * 2^31 = 640922^ 92 * 2^31 = 87621537 * 2^31 -.word 1117909845 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 92 * 3479293249 * 2^31 -.word 62594557 // zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1184680387 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.word 164673767 // zeta^124 * 2^31 = 640922^124 * 2^31 = 78082914 * 2^31 -.word 2238255705 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 640922^124 * 3479293249 * 2^31 -.word 159354989 // zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 2477263699 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 100054829 // zeta^192 * 2^31 = 640922^192 * 2^31 = 108643008 * 2^31 -.word 547320979 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 640922^192 * 3479293249 * 2^31 -.word 64583899 // zeta^160 * 2^31 = 640922^160 * 2^31 = 13028154 * 2^31 -.word 389477 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 640922^160 * 3479293249 * 2^31 -.word 206761731 // zeta^224 * 2^31 = 640922^224 * 2^31 = 100073230 * 2^31 -.word 2020141373 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 640922^224 * 3479293249 * 2^31 -.word 28380931 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 2883766589 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 19387147 // zeta^208 * 2^31 = 640922^208 * 2^31 = 26332312 * 2^31 -.word 4105739061 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 640922^208 * 3479293249 * 2^31 -.word 71380021 // zeta^176 * 2^31 = 640922^176 * 2^31 = 70392207 * 2^31 -.word 536368523 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 640922^176 * 3479293249 * 2^31 -.word 157287173 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 2354002619 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 59701305 // zeta^136 * 2^31 = 640922^136 * 2^31 = 106639146 * 2^31 -.word 2002015367 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 640922^136 * 3479293249 * 2^31 -.word 89648769 // zeta^200 * 2^31 = 640922^200 * 2^31 = 106849954 * 2^31 -.word 2362319935 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 640922^200 * 3479293249 * 2^31 -.word 117014363 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2570046181 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 13378461 // zeta^232 * 2^31 = 640922^232 * 2^31 = 56179088 * 2^31 -.word 3794353187 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 640922^232 * 3479293249 * 2^31 -.word 184818605 // zeta^152 * 2^31 = 640922^152 * 2^31 = 65895091 * 2^31 -.word 2319347731 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 640922^152 * 3479293249 * 2^31 -.word 151846391 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 1488828745 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103085597 // zeta^184 * 2^31 = 640922^184 * 2^31 = 105229554 * 2^31 -.word 596330915 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 640922^184 * 3479293249 * 2^31 -.word 120494577 // zeta^248 * 2^31 = 640922^248 * 2^31 = 90940036 * 2^31 -.word 2828266703 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 640922^248 * 3479293249 * 2^31 -.word 20572057 // zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 3655183655 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 44549025 // zeta^196 * 2^31 = 640922^196 * 2^31 = 108002087 * 2^31 -.word 2898159391 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 640922^196 * 3479293249 * 2^31 -.word 146642869 // zeta^164 * 2^31 = 640922^164 * 2^31 = 54775275 * 2^31 -.word 516655627 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 640922^164 * 3479293249 * 2^31 -.word 71427169 // zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 799168095 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 189990539 // zeta^148 * 2^31 = 640922^148 * 2^31 = 61145083 * 2^31 -.word 3288532917 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 640922^148 * 3479293249 * 2^31 -.word 39447195 // zeta^212 * 2^31 = 640922^212 * 2^31 = 27124577 * 2^31 -.word 747786149 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 640922^212 * 3479293249 * 2^31 -.word 215230525 // zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1204547459 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 156515505 // zeta^244 * 2^31 = 640922^244 * 2^31 = 26089164 * 2^31 -.word 250731023 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 640922^244 * 3479293249 * 2^31 -.word 137962437 // zeta^140 * 2^31 = 640922^140 * 2^31 = 57770712 * 2^31 -.word 1328818683 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 640922^140 * 3479293249 * 2^31 -.word 166084215 // zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 946723017 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 49948923 // zeta^172 * 2^31 = 640922^172 * 2^31 = 62290981 * 2^31 -.word 1871681861 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 640922^172 * 3479293249 * 2^31 -.word 117781735 // zeta^236 * 2^31 = 640922^236 * 2^31 = 56039365 * 2^31 -.word 935958105 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 640922^236 * 3479293249 * 2^31 -.word 76740047 // zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 2741621617 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 45651365 // zeta^220 * 2^31 = 640922^220 * 2^31 = 77145741 * 2^31 -.word 1623711771 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 640922^220 * 3479293249 * 2^31 -.word 210722219 // zeta^188 * 2^31 = 640922^188 * 2^31 = 94509732 * 2^31 -.word 1053575317 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 640922^188 * 3479293249 * 2^31 -.word 154691461 // zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 3110286907 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 49342849 // zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 2365024575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 57931029 // zeta^320 * 2^31 = 640922^320 * 2^31 = 40973034 * 2^31 -.word 1817703595 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 640922^320 * 3479293249 * 2^31 -.word 33534823 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 2019751897 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 152702119 // zeta^352 * 2^31 = 640922^352 * 2^31 = 95614855 * 2^31 -.word 4294577817 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 640922^352 * 3479293249 * 2^31 -.word 99649225 // zeta^272 * 2^31 = 640922^272 * 2^31 = 21310129 * 2^31 -.word 1221972471 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 640922^272 * 3479293249 * 2^31 -.word 188905087 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 1411200705 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 194550161 // zeta^304 * 2^31 = 640922^304 * 2^31 = 64584977 * 2^31 -.word 1817634095 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 640922^304 * 3479293249 * 2^31 -.word 145905997 // zeta^368 * 2^31 = 640922^368 * 2^31 = 38250802 * 2^31 -.word 3758598771 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 640922^368 * 3479293249 * 2^31 -.word 138590473 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 360304567 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 157584713 // zeta^328 * 2^31 = 640922^328 * 2^31 = 2003863 * 2^31 -.word 2292951927 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 640922^328 * 3479293249 * 2^31 -.word 5007107 // zeta^296 * 2^31 = 640922^296 * 2^31 = 62017780 * 2^31 -.word 1224307005 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 640922^296 * 3479293249 * 2^31 -.word 100271655 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 1724921113 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 75670795 // zeta^280 * 2^31 = 640922^280 * 2^31 = 52516759 * 2^31 -.word 3464448309 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 640922^280 * 3479293249 * 2^31 -.word 32467413 // zeta^344 * 2^31 = 640922^344 * 2^31 = 42747918 * 2^31 -.word 1975619563 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 640922^344 * 3479293249 * 2^31 -.word 126051989 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 2231935787 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 114200421 // zeta^376 * 2^31 = 640922^376 * 2^31 = 3413455 * 2^31 -.word 3698636379 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 640922^376 * 3479293249 * 2^31 -.word 132619977 // zeta^260 * 2^31 = 640922^260 * 2^31 = 90622009 * 2^31 -.word 3537943031 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 640922^260 * 3479293249 * 2^31 -.word 196713961 // zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 639783639 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.word 33427309 // zeta^292 * 2^31 = 640922^292 * 2^31 = 49934500 * 2^31 -.word 282512467 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 640922^292 * 3479293249 * 2^31 -.word 70643149 // zeta^356 * 2^31 = 640922^356 * 2^31 = 53867734 * 2^31 -.word 3778311667 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 640922^356 * 3479293249 * 2^31 -.word 175385683 // zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1754220525 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 27295479 // zeta^340 * 2^31 = 640922^340 * 2^31 = 47497926 * 2^31 -.word 1006434377 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 640922^340 * 3479293249 * 2^31 -.word 49927989 // zeta^308 * 2^31 = 640922^308 * 2^31 = 77055722 * 2^31 -.word 3341150859 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 640922^308 * 3479293249 * 2^31 -.word 2055493 // zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 3090419835 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 136764787 // zeta^268 * 2^31 = 640922^268 * 2^31 = 68224789 * 2^31 -.word 3912871629 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 640922^268 * 3479293249 * 2^31 -.word 79323581 // zeta^332 * 2^31 = 640922^332 * 2^31 = 50872297 * 2^31 -.word 2966148611 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 640922^332 * 3479293249 * 2^31 -.word 176475821 // zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 3359243539 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 167337095 // zeta^364 * 2^31 = 640922^364 * 2^31 = 46352028 * 2^31 -.word 2423285433 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 640922^364 * 3479293249 * 2^31 -.word 77554327 // zeta^284 * 2^31 = 640922^284 * 2^31 = 21021472 * 2^31 -.word 3177057449 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 640922^284 * 3479293249 * 2^31 -.word 140545971 // zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1553345677 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 52612251 // zeta^316 * 2^31 = 640922^316 * 2^31 = 30560095 * 2^31 -.word 2056711589 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 640922^316 * 3479293249 * 2^31 -.word 6563799 // zeta^380 * 2^31 = 640922^380 * 2^31 = 14133277 * 2^31 -.word 3241391977 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 640922^380 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_108643009_640922_incomplete_good_oop_scale -ntt_384_u32_108643009_640922_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 117231189 // 1/96 -.word 3747646315 // 1/96 twisted -.data -roots: -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 210808 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 98874168 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // XX: zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // XX: zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // XX: zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 210808 // XX: zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // XX: zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 98874168 // XX: zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // XX: zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 17380078 // XX: zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 343541970 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 3933234 // XX: zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 77745966 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 74622503 // XX: zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1475019943 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 57676451 // XX: zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1140057115 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 91290517 // XX: zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 1804486955 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 102391393 // XX: zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 2023911563 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 56124269 // XX: zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 1109376029 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 92216191 // XX: zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1822784218 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good_oop, %function -.global ntt_384_u32_108643009_640922_incomplete_good_oop -ntt_384_u32_108643009_640922_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -108643009 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r7 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r6 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r9 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r6 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmla.s32 Q2, Q3, r9 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r11,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r11,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r11,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r11,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r11,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r11,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r11,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r11,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r11,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r11,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r10,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r11,#(0)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 815674047 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3355 -// Instruction count: 2397 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s b/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s deleted file mode 100644 index 509f96a..0000000 --- a/tests/intmulntt/ntt_384_u32_108643009_640922_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,3075 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_twiddles -ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 117231189 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 3747646315 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 167943169 // zeta^ 64 * 2^31 = 640922^ 64 * 2^31 = 67669976 * 2^31 -.word 1929942719 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 64 * 3479293249 * 2^31 -.word 10524287 // zeta^ 32 * 2^31 = 640922^ 32 * 2^31 = 8569779 * 2^31 -.word 2274825921 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 32 * 3479293249 * 2^31 -.word 183751195 // zeta^ 96 * 2^31 = 640922^ 96 * 2^31 = 21597933 * 2^31 -.word 2275215397 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 96 * 3479293249 * 2^31 -.word 197898871 // zeta^ 16 * 2^31 = 640922^ 16 * 2^31 = 82310697 * 2^31 -.word 189228233 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 16 * 3479293249 * 2^31 -.word 117636793 // zeta^ 80 * 2^31 = 640922^ 80 * 2^31 = 87332880 * 2^31 -.word 3072994823 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 80 * 3479293249 * 2^31 -.word 59998845 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1940964675 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 22735857 // zeta^112 * 2^31 = 640922^112 * 2^31 = 44058032 * 2^31 -.word 2477333199 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 640922^112 * 3479293249 * 2^31 -.word 127637249 // zeta^ 8 * 2^31 = 640922^ 8 * 2^31 = 1793055 * 2^31 -.word 1932647359 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 8 * 3479293249 * 2^31 -.word 78695545 // zeta^ 72 * 2^31 = 640922^ 72 * 2^31 = 108432201 * 2^31 -.word 3934662727 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 72 * 3479293249 * 2^31 -.word 203907557 // zeta^ 40 * 2^31 = 640922^ 40 * 2^31 = 52463921 * 2^31 -.word 500614107 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 40 * 3479293249 * 2^31 -.word 212278911 // zeta^104 * 2^31 = 640922^104 * 2^31 = 46625229 * 2^31 -.word 3070660289 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 640922^104 * 3479293249 * 2^31 -.word 65439627 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 2806138549 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 141615223 // zeta^ 88 * 2^31 = 640922^ 88 * 2^31 = 56126250 * 2^31 -.word 830518985 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 88 * 3479293249 * 2^31 -.word 96791441 // zeta^ 56 * 2^31 = 640922^ 56 * 2^31 = 17702973 * 2^31 -.word 1466700591 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 56 * 3479293249 * 2^31 -.word 91234029 // zeta^120 * 2^31 = 640922^120 * 2^31 = 14289518 * 2^31 -.word 2063031507 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 640922^120 * 3479293249 * 2^31 -.word 172736993 // zeta^ 4 * 2^31 = 640922^ 4 * 2^31 = 640922 * 2^31 -.word 1396807903 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 4 * 3479293249 * 2^31 -.word 84666041 // zeta^ 68 * 2^31 = 640922^ 68 * 2^31 = 18021000 * 2^31 -.word 757024263 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 68 * 3479293249 * 2^31 -.word 145858849 // zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 3495799199 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 183858709 // zeta^100 * 2^31 = 640922^100 * 2^31 = 58708509 * 2^31 -.word 4012454827 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 640922^100 * 3479293249 * 2^31 -.word 177838823 // zeta^ 20 * 2^31 = 640922^ 20 * 2^31 = 81518432 * 2^31 -.word 3547181145 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 20 * 3479293249 * 2^31 -.word 41900335 // zeta^ 84 * 2^31 = 640922^ 84 * 2^31 = 34020506 * 2^31 -.word 2540746769 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 84 * 3479293249 * 2^31 -.word 60770513 // zeta^ 52 * 2^31 = 640922^ 52 * 2^31 = 82553845 * 2^31 -.word 4044236271 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 52 * 3479293249 * 2^31 -.word 167358029 // zeta^116 * 2^31 = 640922^116 * 2^31 = 31587287 * 2^31 -.word 953816435 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 640922^116 * 3479293249 * 2^31 -.word 51201803 // zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 3348244277 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 80521231 // zeta^ 76 * 2^31 = 640922^ 76 * 2^31 = 40418220 * 2^31 -.word 382095665 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 76 * 3479293249 * 2^31 -.word 99504283 // zeta^ 44 * 2^31 = 640922^ 44 * 2^31 = 52603644 * 2^31 -.word 3359009189 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 44 * 3479293249 * 2^31 -.word 40810197 // zeta^108 * 2^31 = 640922^108 * 2^31 = 6251616 * 2^31 -.word 935723755 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 640922^108 * 3479293249 * 2^31 -.word 171634653 // zeta^ 28 * 2^31 = 640922^ 28 * 2^31 = 31497268 * 2^31 -.word 2671255523 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 28 * 3479293249 * 2^31 -.word 139731691 // zeta^ 92 * 2^31 = 640922^ 92 * 2^31 = 87621537 * 2^31 -.word 1117909845 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 92 * 3479293249 * 2^31 -.word 62594557 // zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1184680387 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.word 164673767 // zeta^124 * 2^31 = 640922^124 * 2^31 = 78082914 * 2^31 -.word 2238255705 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 640922^124 * 3479293249 * 2^31 -.word 159354989 // zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 2477263699 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 100054829 // zeta^192 * 2^31 = 640922^192 * 2^31 = 108643008 * 2^31 -.word 547320979 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 640922^192 * 3479293249 * 2^31 -.word 64583899 // zeta^160 * 2^31 = 640922^160 * 2^31 = 13028154 * 2^31 -.word 389477 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 640922^160 * 3479293249 * 2^31 -.word 206761731 // zeta^224 * 2^31 = 640922^224 * 2^31 = 100073230 * 2^31 -.word 2020141373 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 640922^224 * 3479293249 * 2^31 -.word 28380931 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 2883766589 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 19387147 // zeta^208 * 2^31 = 640922^208 * 2^31 = 26332312 * 2^31 -.word 4105739061 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 640922^208 * 3479293249 * 2^31 -.word 71380021 // zeta^176 * 2^31 = 640922^176 * 2^31 = 70392207 * 2^31 -.word 536368523 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 640922^176 * 3479293249 * 2^31 -.word 157287173 // zeta^240 * 2^31 = 640922^240 * 2^31 = 26334175 * 2^31 -.word 2354002619 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 640922^240 * 3479293249 * 2^31 -.word 59701305 // zeta^136 * 2^31 = 640922^136 * 2^31 = 106639146 * 2^31 -.word 2002015367 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 640922^136 * 3479293249 * 2^31 -.word 89648769 // zeta^200 * 2^31 = 640922^200 * 2^31 = 106849954 * 2^31 -.word 2362319935 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 640922^200 * 3479293249 * 2^31 -.word 117014363 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2570046181 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 13378461 // zeta^232 * 2^31 = 640922^232 * 2^31 = 56179088 * 2^31 -.word 3794353187 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 640922^232 * 3479293249 * 2^31 -.word 184818605 // zeta^152 * 2^31 = 640922^152 * 2^31 = 65895091 * 2^31 -.word 2319347731 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 640922^152 * 3479293249 * 2^31 -.word 151846391 // zeta^216 * 2^31 = 640922^216 * 2^31 = 9768841 * 2^31 -.word 1488828745 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 640922^216 * 3479293249 * 2^31 -.word 103085597 // zeta^184 * 2^31 = 640922^184 * 2^31 = 105229554 * 2^31 -.word 596330915 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 640922^184 * 3479293249 * 2^31 -.word 120494577 // zeta^248 * 2^31 = 640922^248 * 2^31 = 90940036 * 2^31 -.word 2828266703 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 640922^248 * 3479293249 * 2^31 -.word 20572057 // zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 3655183655 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 44549025 // zeta^196 * 2^31 = 640922^196 * 2^31 = 108002087 * 2^31 -.word 2898159391 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 640922^196 * 3479293249 * 2^31 -.word 146642869 // zeta^164 * 2^31 = 640922^164 * 2^31 = 54775275 * 2^31 -.word 516655627 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 640922^164 * 3479293249 * 2^31 -.word 71427169 // zeta^228 * 2^31 = 640922^228 * 2^31 = 104709775 * 2^31 -.word 799168095 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 640922^228 * 3479293249 * 2^31 -.word 189990539 // zeta^148 * 2^31 = 640922^148 * 2^31 = 61145083 * 2^31 -.word 3288532917 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 640922^148 * 3479293249 * 2^31 -.word 39447195 // zeta^212 * 2^31 = 640922^212 * 2^31 = 27124577 * 2^31 -.word 747786149 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 640922^212 * 3479293249 * 2^31 -.word 215230525 // zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1204547459 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 156515505 // zeta^244 * 2^31 = 640922^244 * 2^31 = 26089164 * 2^31 -.word 250731023 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 640922^244 * 3479293249 * 2^31 -.word 137962437 // zeta^140 * 2^31 = 640922^140 * 2^31 = 57770712 * 2^31 -.word 1328818683 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 640922^140 * 3479293249 * 2^31 -.word 166084215 // zeta^204 * 2^31 = 640922^204 * 2^31 = 17352492 * 2^31 -.word 946723017 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 640922^204 * 3479293249 * 2^31 -.word 49948923 // zeta^172 * 2^31 = 640922^172 * 2^31 = 62290981 * 2^31 -.word 1871681861 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 640922^172 * 3479293249 * 2^31 -.word 117781735 // zeta^236 * 2^31 = 640922^236 * 2^31 = 56039365 * 2^31 -.word 935958105 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 640922^236 * 3479293249 * 2^31 -.word 76740047 // zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 2741621617 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 45651365 // zeta^220 * 2^31 = 640922^220 * 2^31 = 77145741 * 2^31 -.word 1623711771 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 640922^220 * 3479293249 * 2^31 -.word 210722219 // zeta^188 * 2^31 = 640922^188 * 2^31 = 94509732 * 2^31 -.word 1053575317 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 640922^188 * 3479293249 * 2^31 -.word 154691461 // zeta^252 * 2^31 = 640922^252 * 2^31 = 16426818 * 2^31 -.word 3110286907 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 640922^252 * 3479293249 * 2^31 -.word 49342849 // zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 2365024575 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 57931029 // zeta^320 * 2^31 = 640922^320 * 2^31 = 40973034 * 2^31 -.word 1817703595 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 640922^320 * 3479293249 * 2^31 -.word 33534823 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 2019751897 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 152702119 // zeta^352 * 2^31 = 640922^352 * 2^31 = 95614855 * 2^31 -.word 4294577817 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 640922^352 * 3479293249 * 2^31 -.word 99649225 // zeta^272 * 2^31 = 640922^272 * 2^31 = 21310129 * 2^31 -.word 1221972471 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 640922^272 * 3479293249 * 2^31 -.word 188905087 // zeta^336 * 2^31 = 640922^336 * 2^31 = 103620826 * 2^31 -.word 1411200705 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 640922^336 * 3479293249 * 2^31 -.word 194550161 // zeta^304 * 2^31 = 640922^304 * 2^31 = 64584977 * 2^31 -.word 1817634095 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 640922^304 * 3479293249 * 2^31 -.word 145905997 // zeta^368 * 2^31 = 640922^368 * 2^31 = 38250802 * 2^31 -.word 3758598771 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 640922^368 * 3479293249 * 2^31 -.word 138590473 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 360304567 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 157584713 // zeta^328 * 2^31 = 640922^328 * 2^31 = 2003863 * 2^31 -.word 2292951927 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 640922^328 * 3479293249 * 2^31 -.word 5007107 // zeta^296 * 2^31 = 640922^296 * 2^31 = 62017780 * 2^31 -.word 1224307005 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 640922^296 * 3479293249 * 2^31 -.word 100271655 // zeta^360 * 2^31 = 640922^360 * 2^31 = 5838692 * 2^31 -.word 1724921113 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 640922^360 * 3479293249 * 2^31 -.word 75670795 // zeta^280 * 2^31 = 640922^280 * 2^31 = 52516759 * 2^31 -.word 3464448309 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 640922^280 * 3479293249 * 2^31 -.word 32467413 // zeta^344 * 2^31 = 640922^344 * 2^31 = 42747918 * 2^31 -.word 1975619563 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 640922^344 * 3479293249 * 2^31 -.word 126051989 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 2231935787 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 114200421 // zeta^376 * 2^31 = 640922^376 * 2^31 = 3413455 * 2^31 -.word 3698636379 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 640922^376 * 3479293249 * 2^31 -.word 132619977 // zeta^260 * 2^31 = 640922^260 * 2^31 = 90622009 * 2^31 -.word 3537943031 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 640922^260 * 3479293249 * 2^31 -.word 196713961 // zeta^324 * 2^31 = 640922^324 * 2^31 = 91262931 * 2^31 -.word 639783639 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 640922^324 * 3479293249 * 2^31 -.word 33427309 // zeta^292 * 2^31 = 640922^292 * 2^31 = 49934500 * 2^31 -.word 282512467 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 640922^292 * 3479293249 * 2^31 -.word 70643149 // zeta^356 * 2^31 = 640922^356 * 2^31 = 53867734 * 2^31 -.word 3778311667 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 640922^356 * 3479293249 * 2^31 -.word 175385683 // zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1754220525 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 27295479 // zeta^340 * 2^31 = 640922^340 * 2^31 = 47497926 * 2^31 -.word 1006434377 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 640922^340 * 3479293249 * 2^31 -.word 49927989 // zeta^308 * 2^31 = 640922^308 * 2^31 = 77055722 * 2^31 -.word 3341150859 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 640922^308 * 3479293249 * 2^31 -.word 2055493 // zeta^372 * 2^31 = 640922^372 * 2^31 = 50966558 * 2^31 -.word 3090419835 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 640922^372 * 3479293249 * 2^31 -.word 136764787 // zeta^268 * 2^31 = 640922^268 * 2^31 = 68224789 * 2^31 -.word 3912871629 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 640922^268 * 3479293249 * 2^31 -.word 79323581 // zeta^332 * 2^31 = 640922^332 * 2^31 = 50872297 * 2^31 -.word 2966148611 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 640922^332 * 3479293249 * 2^31 -.word 176475821 // zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 3359243539 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 167337095 // zeta^364 * 2^31 = 640922^364 * 2^31 = 46352028 * 2^31 -.word 2423285433 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 640922^364 * 3479293249 * 2^31 -.word 77554327 // zeta^284 * 2^31 = 640922^284 * 2^31 = 21021472 * 2^31 -.word 3177057449 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 640922^284 * 3479293249 * 2^31 -.word 140545971 // zeta^348 * 2^31 = 640922^348 * 2^31 = 52518740 * 2^31 -.word 1553345677 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 640922^348 * 3479293249 * 2^31 -.word 52612251 // zeta^316 * 2^31 = 640922^316 * 2^31 = 30560095 * 2^31 -.word 2056711589 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 640922^316 * 3479293249 * 2^31 -.word 6563799 // zeta^380 * 2^31 = 640922^380 * 2^31 = 14133277 * 2^31 -.word 3241391977 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 640922^380 * 3479293249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_scale -ntt_384_u32_108643009_640922_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 117231189 // 1/96 -.word 3747646315 // 1/96 twisted -.data -roots: -.word 40973033 /// zeta^256 * 2^31 = 640922^256 * 2^31 = 40973033 * 2^31 -.word 809890293 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 640922^256 * 3479293249 * 2^31 -.word 67669975 /// zeta^128 * 2^31 = 640922^128 * 2^31 = 67669975 * 2^31 -.word 1337593335 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 640922^128 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 87045076 // zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 5022183 // zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 210808 // zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 82308834 // zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 98874168 // zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 640922^ 0 * 2^31 = 1 * 2^31 -.word 20 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 0 * 3479293249 * 2^31 -.word 87045076 // XX: zeta^288 * 2^31 = 640922^288 * 2^31 = 87045076 * 2^31 -.word 1720569773 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 640922^288 * 3479293249 * 2^31 -.word 5022183 // XX: zeta^144 * 2^31 = 640922^144 * 2^31 = 5022183 * 2^31 -.word 99270592 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 640922^144 * 3479293249 * 2^31 -.word 82308834 // XX: zeta^ 48 * 2^31 = 640922^ 48 * 2^31 = 82308834 * 2^31 -.word 1626951211 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 48 * 3479293249 * 2^31 -.word 210808 // XX: zeta^264 * 2^31 = 640922^264 * 2^31 = 210808 * 2^31 -.word 4166920 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 640922^264 * 3479293249 * 2^31 -.word 102804317 // XX: zeta^168 * 2^31 = 640922^168 * 2^31 = 102804317 * 2^31 -.word 2032073593 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 640922^168 * 3479293249 * 2^31 -.word 98874168 // XX: zeta^ 24 * 2^31 = 640922^ 24 * 2^31 = 98874168 * 2^31 -.word 1954388607 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 24 * 3479293249 * 2^31 -.word 94353491 // XX: zeta^312 * 2^31 = 640922^312 * 2^31 = 94353491 * 2^31 -.word 1865030994 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 640922^312 * 3479293249 * 2^31 -.word 17380078 // XX: zeta^132 * 2^31 = 640922^132 * 2^31 = 17380078 * 2^31 -.word 343541970 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 640922^132 * 3479293249 * 2^31 -.word 3933234 // XX: zeta^ 36 * 2^31 = 640922^ 36 * 2^31 = 3933234 * 2^31 -.word 77745966 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 36 * 3479293249 * 2^31 -.word 74622503 // XX: zeta^276 * 2^31 = 640922^276 * 2^31 = 74622503 * 2^31 -.word 1475019943 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 640922^276 * 3479293249 * 2^31 -.word 57676451 // XX: zeta^180 * 2^31 = 640922^180 * 2^31 = 57676451 * 2^31 -.word 1140057115 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 640922^180 * 3479293249 * 2^31 -.word 91290517 // XX: zeta^ 12 * 2^31 = 640922^ 12 * 2^31 = 91290517 * 2^31 -.word 1804486955 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 12 * 3479293249 * 2^31 -.word 102391393 // XX: zeta^300 * 2^31 = 640922^300 * 2^31 = 102391393 * 2^31 -.word 2023911563 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 640922^300 * 3479293249 * 2^31 -.word 56124269 // XX: zeta^156 * 2^31 = 640922^156 * 2^31 = 56124269 * 2^31 -.word 1109376029 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 640922^156 * 3479293249 * 2^31 -.word 92216191 // XX: zeta^ 60 * 2^31 = 640922^ 60 * 2^31 = 92216191 * 2^31 -.word 1822784218 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 640922^ 60 * 3479293249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_108643009_640922_incomplete_good_oop_half_input, %function -.global ntt_384_u32_108643009_640922_incomplete_good_oop_half_input -ntt_384_u32_108643009_640922_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -108643009 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q0 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r6 -// input[4]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 4)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r9 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r11,#(-496)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(16)] -// Release input[0] from Q1 -// Release input[128] from Q0 -// input[4]: Already loaded as Q7 -// input[132]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmulh.s32 Q1, Q7, r6 -// input[8]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 8)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -// Release input[132] from Q6 -// Release input[4] from Q7 -// input[136]: Already loaded as Q4 -// input[8]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[140]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[136] from Q4 -vstrw.u32 Q2, [r11,#(48)] -vsub.s32 Q4, Q1, Q5 -// Release input[8] from Q5 -vstrw.u32 Q4, [r11,#(-464)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(32)] -// input[140]: Already loaded as Q6 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmulh.s32 Q1, Q6, r6 -// input[16]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-448)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release input[12] from Q3 -// Release input[140] from Q6 -// input[16]: Already loaded as Q7 -// input[144]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmulh.s32 Q1, Q7, r6 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(64)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(80)] -// Release input[144] from Q5 -// Release input[16] from Q7 -// input[148]: Already loaded as Q4 -// input[20]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[148] from Q4 -vstrw.u32 Q2, [r11,#(96)] -vsub.s32 Q4, Q1, Q6 -// Release input[20] from Q6 -vstrw.u32 Q4, [r11,#(-416)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(80)] -// input[152]: Already loaded as Q5 -// input[24]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[156]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmulh.s32 Q1, Q5, r6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-400)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(112)] -// Release input[24] from Q3 -// Release input[152] from Q5 -// input[28]: Already loaded as Q7 -// input[156]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmulh.s32 Q1, Q7, r6 -// input[32]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 32)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(112)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release input[156] from Q6 -// Release input[28] from Q7 -// input[160]: Already loaded as Q4 -// input[32]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[160] from Q4 -vstrw.u32 Q2, [r11,#(144)] -vsub.s32 Q4, Q1, Q5 -// Release input[32] from Q5 -vstrw.u32 Q4, [r11,#(-368)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(128)] -// input[164]: Already loaded as Q6 -// input[36]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmulh.s32 Q1, Q6, r6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-352)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(160)] -// Release input[36] from Q3 -// Release input[164] from Q6 -// input[40]: Already loaded as Q7 -// input[168]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmulh.s32 Q1, Q7, r6 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(176)] -// Release input[168] from Q5 -// Release input[40] from Q7 -// input[172]: Already loaded as Q4 -// input[44]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[172] from Q4 -vstrw.u32 Q2, [r11,#(192)] -vsub.s32 Q4, Q1, Q6 -// Release input[44] from Q6 -vstrw.u32 Q4, [r11,#(-320)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(176)] -// input[176]: Already loaded as Q5 -// input[48]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmulh.s32 Q1, Q5, r6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-304)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(208)] -// Release input[48] from Q3 -// Release input[176] from Q5 -// input[52]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmulh.s32 Q1, Q7, r6 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(224)] -// Release input[180] from Q6 -// Release input[52] from Q7 -// input[184]: Already loaded as Q4 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[188]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[184] from Q4 -vstrw.u32 Q2, [r11,#(240)] -vsub.s32 Q4, Q1, Q5 -// Release input[56] from Q5 -vstrw.u32 Q4, [r11,#(-272)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(224)] -// input[188]: Already loaded as Q6 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -vqrdmulh.s32 Q1, Q6, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-256)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release input[60] from Q3 -// Release input[188] from Q6 -// input[64]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -vneg.s32 Q1, Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q5, r6 -vstrw.u32 Q5, [r11,#(-240)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(256)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(272)] -// Release input[64] from Q5 -// input[68]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -vneg.s32 Q1, Q3 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q3, r6 -vstrw.u32 Q3, [r11,#(288)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(272)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[68] from Q3 -// input[72]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(288)] -vstrw.u32 Q4, [r11,#(304)] -vstrw.u32 Q4, [r11,#(-208)] -// Release input[72] from Q4 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-192)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(320)] -// Release input[76] from Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(336)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(320)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[80] from Q4 -// input[84]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(336)] -vstrw.u32 Q3, [r11,#(352)] -vstrw.u32 Q3, [r11,#(-160)] -// Release input[84] from Q3 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-144)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(368)] -// Release input[88] from Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(384)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(368)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[92] from Q4 -// input[96]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(384)] -vstrw.u32 Q3, [r11,#(400)] -vstrw.u32 Q3, [r11,#(-112)] -// Release input[96] from Q3 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-96)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(400)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release input[100] from Q0 -// input[104]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(432)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(416)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[104] from Q4 -// input[108]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(432)] -vstrw.u32 Q3, [r11,#(448)] -vstrw.u32 Q3, [r11,#(-64)] -// Release input[108] from Q3 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-48)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(448)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(464)] -// Release input[112] from Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(480)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(464)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[116] from Q4 -// input[120]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(480)] -vstrw.u32 Q3, [r11,#(496)] -vstrw.u32 Q3, [r11,#(-16)] -// Release input[120] from Q3 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(0)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(496)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[124] from Q0 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 815674047 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3042 -// Instruction count: 2201 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good.s b/tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good.s deleted file mode 100644 index e32f3e8..0000000 --- a/tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_33556993_15047299_incomplete_good_twiddles -ntt_384_u32_33556993_15047299_incomplete_good_twiddles: // For base multiplication -.word 11579973 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 1431437243 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 49092333 // zeta^ 64 * 2^31 = 15047299^ 64 * 2^31 = 8518432 * 2^31 -.word 3982764819 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 64 * 375649793 * 2^31 -.word 42761787 // zeta^ 32 * 2^31 = 15047299^ 32 * 2^31 = 13841461 * 2^31 -.word 425054149 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 32 * 375649793 * 2^31 -.word 34538439 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 5947961 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 66309139 // zeta^ 16 * 2^31 = 15047299^ 16 * 2^31 = 940305 * 2^31 -.word 681112045 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 16 * 375649793 * 2^31 -.word 28356919 // zeta^ 80 * 2^31 = 15047299^ 80 * 2^31 = 4200632 * 2^31 -.word 4055856329 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 80 * 375649793 * 2^31 -.word 59288659 // zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 3771109805 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 7716537 // zeta^112 * 2^31 = 15047299^112 * 2^31 = 24511972 * 2^31 -.word 851016519 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 15047299^112 * 375649793 * 2^31 -.word 46836875 // zeta^ 8 * 2^31 = 15047299^ 8 * 2^31 = 24111745 * 2^31 -.word 2410070389 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 8 * 375649793 * 2^31 -.word 27581675 // zeta^ 72 * 2^31 = 15047299^ 72 * 2^31 = 26823146 * 2^31 -.word 4046475541 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 72 * 375649793 * 2^31 -.word 9436047 // zeta^ 40 * 2^31 = 15047299^ 40 * 2^31 = 33038085 * 2^31 -.word 292002417 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 40 * 375649793 * 2^31 -.word 17776663 // zeta^104 * 2^31 = 15047299^104 * 2^31 = 12390669 * 2^31 -.word 2490738153 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 15047299^104 * 375649793 * 2^31 -.word 11879829 // zeta^ 24 * 2^31 = 15047299^ 24 * 2^31 = 14745691 * 2^31 -.word 399412331 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 24 * 375649793 * 2^31 -.word 60844951 // zeta^ 88 * 2^31 = 15047299^ 88 * 2^31 = 32562828 * 2^31 -.word 1066891881 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 88 * 375649793 * 2^31 -.word 24769191 // zeta^ 56 * 2^31 = 15047299^ 56 * 2^31 = 20448273 * 2^31 -.word 2663682905 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 56 * 375649793 * 2^31 -.word 8635069 // zeta^120 * 2^31 = 15047299^120 * 2^31 = 20044445 * 2^31 -.word 3577978691 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 15047299^120 * 375649793 * 2^31 -.word 16277701 // zeta^ 4 * 2^31 = 15047299^ 4 * 2^31 = 22098973 * 2^31 -.word 4158345531 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 4 * 375649793 * 2^31 -.word 7436455 // zeta^ 68 * 2^31 = 15047299^ 68 * 2^31 = 8970055 * 2^31 -.word 4077456729 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 68 * 375649793 * 2^31 -.word 9212309 // zeta^ 36 * 2^31 = 15047299^ 36 * 2^31 = 14626653 * 2^31 -.word 3505340523 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 36 * 375649793 * 2^31 -.word 23812275 // zeta^100 * 2^31 = 15047299^100 * 2^31 = 7111893 * 2^31 -.word 673162573 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 15047299^100 * 375649793 * 2^31 -.word 55105631 // zeta^ 20 * 2^31 = 15047299^ 20 * 2^31 = 9575431 * 2^31 -.word 638508449 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 20 * 375649793 * 2^31 -.word 63845407 // zeta^ 84 * 2^31 = 15047299^ 84 * 2^31 = 3819232 * 2^31 -.word 69140961 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 84 * 375649793 * 2^31 -.word 45155211 // zeta^ 52 * 2^31 = 15047299^ 52 * 2^31 = 13583150 * 2^31 -.word 2468768373 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 52 * 375649793 * 2^31 -.word 31892597 // zeta^116 * 2^31 = 15047299^116 * 2^31 = 10311346 * 2^31 -.word 3033656715 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 15047299^116 * 375649793 * 2^31 -.word 44632483 // zeta^ 12 * 2^31 = 15047299^ 12 * 2^31 = 21289485 * 2^31 -.word 3523957853 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 12 * 375649793 * 2^31 -.word 20599243 // zeta^ 76 * 2^31 = 15047299^ 76 * 2^31 = 33421816 * 2^31 -.word 3769343029 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 76 * 375649793 * 2^31 -.word 34994515 // zeta^ 44 * 2^31 = 15047299^ 44 * 2^31 = 30222420 * 2^31 -.word 1396393133 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 44 * 375649793 * 2^31 -.word 50418895 // zeta^108 * 2^31 = 15047299^108 * 2^31 = 23642097 * 2^31 -.word 527614257 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 15047299^108 * 375649793 * 2^31 -.word 26517879 // zeta^ 28 * 2^31 = 15047299^ 28 * 2^31 = 17233810 * 2^31 -.word 2151548041 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 28 * 375649793 * 2^31 -.word 5031613 // zeta^ 92 * 2^31 = 15047299^ 92 * 2^31 = 6280499 * 2^31 -.word 530750275 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 92 * 375649793 * 2^31 -.word 67003163 // zeta^ 60 * 2^31 = 15047299^ 60 * 2^31 = 16204162 * 2^31 -.word 3813976805 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 60 * 375649793 * 2^31 -.word 20694533 // zeta^124 * 2^31 = 15047299^124 * 2^31 = 12410931 * 2^31 -.word 2358012923 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 15047299^124 * 375649793 * 2^31 -.word 3955367 // zeta^128 * 2^31 = 15047299^128 * 2^31 = 8518431 * 2^31 -.word 2551327577 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 15047299^128 * 375649793 * 2^31 -.word 55534013 // zeta^192 * 2^31 = 15047299^192 * 2^31 = 33556992 * 2^31 -.word 2863530051 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 15047299^192 * 375649793 * 2^31 -.word 25333645 // zeta^160 * 2^31 = 15047299^160 * 2^31 = 2013241 * 2^31 -.word 3875861107 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 15047299^160 * 375649793 * 2^31 -.word 24352199 // zeta^224 * 2^31 = 15047299^224 * 2^31 = 19715532 * 2^31 -.word 3869913145 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 15047299^224 * 375649793 * 2^31 -.word 62718759 // zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 3374744281 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 804847 // zeta^208 * 2^31 = 15047299^208 * 2^31 = 32616688 * 2^31 -.word 3613855249 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 15047299^208 * 375649793 * 2^31 -.word 49098857 // zeta^176 * 2^31 = 15047299^176 * 2^31 = 9932396 * 2^31 -.word 1374874007 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 15047299^176 * 375649793 * 2^31 -.word 7825327 // zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 523857489 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 14301793 // zeta^136 * 2^31 = 15047299^136 * 2^31 = 2711401 * 2^31 -.word 1636405151 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 15047299^136 * 375649793 * 2^31 -.word 20277111 // zeta^200 * 2^31 = 15047299^200 * 2^31 = 9445248 * 2^31 -.word 1884896905 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 15047299^200 * 375649793 * 2^31 -.word 41897609 // zeta^168 * 2^31 = 15047299^168 * 2^31 = 12909577 * 2^31 -.word 2198735735 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 15047299^168 * 375649793 * 2^31 -.word 57677939 // zeta^232 * 2^31 = 15047299^232 * 2^31 = 518908 * 2^31 -.word 4002964877 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 15047299^232 * 375649793 * 2^31 -.word 15408129 // zeta^152 * 2^31 = 15047299^152 * 2^31 = 17817137 * 2^31 -.word 667479551 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 15047299^152 * 375649793 * 2^31 -.word 55234157 // zeta^216 * 2^31 = 15047299^216 * 2^31 = 18811302 * 2^31 -.word 3895554963 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 15047299^216 * 375649793 * 2^31 -.word 17422871 // zeta^184 * 2^31 = 15047299^184 * 2^31 = 33153165 * 2^31 -.word 914295785 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 15047299^184 * 375649793 * 2^31 -.word 42344795 // zeta^248 * 2^31 = 15047299^248 * 2^31 = 13108720 * 2^31 -.word 1631284389 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 15047299^248 * 375649793 * 2^31 -.word 24715747 // zeta^132 * 2^31 = 15047299^132 * 2^31 = 20428075 * 2^31 -.word 4214078493 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 15047299^132 * 375649793 * 2^31 -.word 50836285 // zeta^196 * 2^31 = 15047299^196 * 2^31 = 11458020 * 2^31 -.word 136621763 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 15047299^196 * 375649793 * 2^31 -.word 48156959 // zeta^164 * 2^31 = 15047299^164 * 2^31 = 26042233 * 2^31 -.word 1462789345 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 15047299^164 * 375649793 * 2^31 -.word 57901677 // zeta^228 * 2^31 = 15047299^228 * 2^31 = 18930340 * 2^31 -.word 789626771 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 15047299^228 * 375649793 * 2^31 -.word 42296769 // zeta^148 * 2^31 = 15047299^148 * 2^31 = 27800794 * 2^31 -.word 3725599807 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 15047299^148 * 375649793 * 2^31 -.word 12008355 // zeta^212 * 2^31 = 15047299^212 * 2^31 = 23981562 * 2^31 -.word 3656458845 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 15047299^212 * 375649793 * 2^31 -.word 20294379 // zeta^180 * 2^31 = 15047299^180 * 2^31 = 30285189 * 2^31 -.word 564888341 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 15047299^180 * 375649793 * 2^31 -.word 21958775 // zeta^244 * 2^31 = 15047299^244 * 2^31 = 19973843 * 2^31 -.word 1826198921 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 15047299^244 * 375649793 * 2^31 -.word 9523753 // zeta^140 * 2^31 = 15047299^140 * 2^31 = 12132331 * 2^31 -.word 245385175 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 15047299^140 * 375649793 * 2^31 -.word 22481503 // zeta^204 * 2^31 = 15047299^204 * 2^31 = 12267508 * 2^31 -.word 771009441 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 15047299^204 * 375649793 * 2^31 -.word 48981373 // zeta^172 * 2^31 = 15047299^172 * 2^31 = 26976670 * 2^31 -.word 3426188419 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 15047299^172 * 375649793 * 2^31 -.word 32119471 // zeta^236 * 2^31 = 15047299^236 * 2^31 = 3334573 * 2^31 -.word 2898574161 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 15047299^236 * 375649793 * 2^31 -.word 12070727 // zeta^156 * 2^31 = 15047299^156 * 2^31 = 22603682 * 2^31 -.word 2674169529 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 15047299^156 * 375649793 * 2^31 -.word 40596107 // zeta^220 * 2^31 = 15047299^220 * 2^31 = 16323183 * 2^31 -.word 2143419253 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 15047299^220 * 375649793 * 2^31 -.word 54362349 // zeta^188 * 2^31 = 15047299^188 * 2^31 = 29763762 * 2^31 -.word 2839003411 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 15047299^188 * 375649793 * 2^31 -.word 110823 // zeta^252 * 2^31 = 15047299^252 * 2^31 = 17352831 * 2^31 -.word 480990489 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 15047299^252 * 375649793 * 2^31 -.word 18021653 // zeta^256 * 2^31 = 15047299^256 * 2^31 = 25038561 * 2^31 -.word 312202475 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 15047299^256 * 375649793 * 2^31 -.word 63158619 // zeta^320 * 2^31 = 15047299^320 * 2^31 = 25038562 * 2^31 -.word 1743639717 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 15047299^320 * 375649793 * 2^31 -.word 32575547 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 4289019333 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 41780341 // zeta^352 * 2^31 = 15047299^352 * 2^31 = 31543752 * 2^31 -.word 419106187 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 15047299^352 * 375649793 * 2^31 -.word 38757067 // zeta^272 * 2^31 = 15047299^272 * 2^31 = 29356361 * 2^31 -.word 239110965 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 15047299^272 * 375649793 * 2^31 -.word 4395227 // zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 920223013 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 59397449 // zeta^304 * 2^31 = 15047299^304 * 2^31 = 9045021 * 2^31 -.word 3443950775 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 15047299^304 * 375649793 * 2^31 -.word 18015129 // zeta^368 * 2^31 = 15047299^368 * 2^31 = 23624597 * 2^31 -.word 2920093287 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 15047299^368 * 375649793 * 2^31 -.word 39532311 // zeta^264 * 2^31 = 15047299^264 * 2^31 = 6733847 * 2^31 -.word 248491753 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 15047299^264 * 375649793 * 2^31 -.word 52812193 // zeta^328 * 2^31 = 15047299^328 * 2^31 = 30845592 * 2^31 -.word 2658562143 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 15047299^328 * 375649793 * 2^31 -.word 49337323 // zeta^296 * 2^31 = 15047299^296 * 2^31 = 21166324 * 2^31 -.word 1804229141 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 15047299^296 * 375649793 * 2^31 -.word 25216377 // zeta^360 * 2^31 = 15047299^360 * 2^31 = 20647416 * 2^31 -.word 2096231559 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 15047299^360 * 375649793 * 2^31 -.word 6269035 // zeta^280 * 2^31 = 15047299^280 * 2^31 = 994165 * 2^31 -.word 3228075413 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 15047299^280 * 375649793 * 2^31 -.word 51705857 // zeta^344 * 2^31 = 15047299^344 * 2^31 = 15739856 * 2^31 -.word 3627487743 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 15047299^344 * 375649793 * 2^31 -.word 58478917 // zeta^312 * 2^31 = 15047299^312 * 2^31 = 13512548 * 2^31 -.word 716988603 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 15047299^312 * 375649793 * 2^31 -.word 49691115 // zeta^376 * 2^31 = 15047299^376 * 2^31 = 403828 * 2^31 -.word 3380671509 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 15047299^376 * 375649793 * 2^31 -.word 59677531 // zeta^260 * 2^31 = 15047299^260 * 2^31 = 24586938 * 2^31 -.word 217510565 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 15047299^260 * 375649793 * 2^31 -.word 42398239 // zeta^324 * 2^31 = 15047299^324 * 2^31 = 13128918 * 2^31 -.word 80888801 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 15047299^324 * 375649793 * 2^31 -.word 43301711 // zeta^292 * 2^31 = 15047299^292 * 2^31 = 26445100 * 2^31 -.word 3621804721 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 15047299^292 * 375649793 * 2^31 -.word 18957027 // zeta^356 * 2^31 = 15047299^356 * 2^31 = 7514760 * 2^31 -.word 2832177949 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 15047299^356 * 375649793 * 2^31 -.word 3268579 // zeta^276 * 2^31 = 15047299^276 * 2^31 = 29737761 * 2^31 -.word 4225826333 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 15047299^276 * 375649793 * 2^31 -.word 24817217 // zeta^340 * 2^31 = 15047299^340 * 2^31 = 5756199 * 2^31 -.word 569367487 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 15047299^340 * 375649793 * 2^31 -.word 35221389 // zeta^308 * 2^31 = 15047299^308 * 2^31 = 23245647 * 2^31 -.word 1261310579 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 15047299^308 * 375649793 * 2^31 -.word 46819607 // zeta^372 * 2^31 = 15047299^372 * 2^31 = 3271804 * 2^31 -.word 3730078953 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 15047299^372 * 375649793 * 2^31 -.word 46514743 // zeta^268 * 2^31 = 15047299^268 * 2^31 = 135177 * 2^31 -.word 525624265 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 15047299^268 * 375649793 * 2^31 -.word 57590233 // zeta^332 * 2^31 = 15047299^332 * 2^31 = 21424662 * 2^31 -.word 4049582119 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 15047299^332 * 375649793 * 2^31 -.word 16695091 // zeta^300 * 2^31 = 15047299^300 * 2^31 = 9914896 * 2^31 -.word 3767353037 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 15047299^300 * 375649793 * 2^31 -.word 18132613 // zeta^364 * 2^31 = 15047299^364 * 2^31 = 6580323 * 2^31 -.word 868778875 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 15047299^364 * 375649793 * 2^31 -.word 62082373 // zeta^284 * 2^31 = 15047299^284 * 2^31 = 27276494 * 2^31 -.word 3764217019 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 15047299^284 * 375649793 * 2^31 -.word 55043259 // zeta^348 * 2^31 = 15047299^348 * 2^31 = 10953311 * 2^31 -.word 1620797765 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 15047299^348 * 375649793 * 2^31 -.word 46419453 // zeta^316 * 2^31 = 15047299^316 * 2^31 = 21146062 * 2^31 -.word 1936954371 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 15047299^316 * 375649793 * 2^31 -.word 12751637 // zeta^380 * 2^31 = 15047299^380 * 2^31 = 3793231 * 2^31 -.word 1455963883 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 15047299^380 * 375649793 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_33556993_15047299_incomplete_good_scale -ntt_384_u32_33556993_15047299_incomplete_good_scale: // Constants for scaling by 1/N -.word 11579973 // 1/96 -.word 1431437243 // 1/96 twisted -.data -roots: -.word 66384763 /// zeta^256 * 2^31 = 15047299^256 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 15047299^256 * 375649793 * 2^31 -.word 893127 /// zeta^128 * 2^31 = 15047299^128 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 15047299^128 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 29095681 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 29095681 // zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 14476917 // zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 43317805 // zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 14476917 // zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 2356128651 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 18598075 // zeta^264 * 2^31 = 15047299^264 * 2^31 = 6733847 * 2^31 -.word 2578416965 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 15047299^264 * 375649793 * 2^31 -.word 4885007 // zeta^168 * 2^31 = 15047299^168 * 2^31 = 12909577 * 2^31 -.word 2973633521 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 15047299^168 * 375649793 * 2^31 -.word 43317805 // zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 933021651 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 64683161 // zeta^ 24 * 2^31 = 15047299^ 24 * 2^31 = 14745691 * 2^31 -.word 3091135847 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 24 * 375649793 * 2^31 -.word 34427601 // zeta^312 * 2^31 = 15047299^312 * 2^31 = 13512548 * 2^31 -.word 864737071 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 15047299^312 * 375649793 * 2^31 -.word 33393089 // XX: zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 29095681 // XX: zeta^288 * 2^31 = 15047299^288 * 2^31 = 17702291 * 2^31 -.word 3280343807 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 15047299^288 * 375649793 * 2^31 -.word 14476917 // XX: zeta^144 * 2^31 = 15047299^144 * 2^31 = 3260327 * 2^31 -.word 2356128651 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 15047299^144 * 375649793 * 2^31 -.word 43317805 // XX: zeta^ 48 * 2^31 = 15047299^ 48 * 2^31 = 14579576 * 2^31 -.word 933021651 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 48 * 375649793 * 2^31 -.word 18598075 // XX: zeta^264 * 2^31 = 15047299^264 * 2^31 = 6733847 * 2^31 -.word 2578416965 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 15047299^264 * 375649793 * 2^31 -.word 4885007 // XX: zeta^168 * 2^31 = 15047299^168 * 2^31 = 12909577 * 2^31 -.word 2973633521 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 15047299^168 * 375649793 * 2^31 -.word 64683161 // XX: zeta^ 24 * 2^31 = 15047299^ 24 * 2^31 = 14745691 * 2^31 -.word 3091135847 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 24 * 375649793 * 2^31 -.word 34427601 // XX: zeta^312 * 2^31 = 15047299^312 * 2^31 = 13512548 * 2^31 -.word 864737071 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 15047299^312 * 375649793 * 2^31 -.word 39999747 // XX: zeta^132 * 2^31 = 15047299^132 * 2^31 = 20428075 * 2^31 -.word 3454780669 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 15047299^132 * 375649793 * 2^31 -.word 45317587 // XX: zeta^ 36 * 2^31 = 15047299^ 36 * 2^31 = 14626653 * 2^31 -.word 3083517997 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 36 * 375649793 * 2^31 -.word 48811299 // XX: zeta^276 * 2^31 = 15047299^276 * 2^31 = 29737761 * 2^31 -.word 4050555101 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 15047299^276 * 375649793 * 2^31 -.word 54571669 // XX: zeta^180 * 2^31 = 15047299^180 * 2^31 = 30285189 * 2^31 -.word 4085587819 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 15047299^180 * 375649793 * 2^31 -.word 59281651 // XX: zeta^ 12 * 2^31 = 15047299^ 12 * 2^31 = 21289485 * 2^31 -.word 3509906701 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 12 * 375649793 * 2^31 -.word 40500013 // XX: zeta^300 * 2^31 = 15047299^300 * 2^31 = 9914896 * 2^31 -.word 634504915 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 15047299^300 * 375649793 * 2^31 -.word 25917637 // XX: zeta^156 * 2^31 = 15047299^156 * 2^31 = 22603682 * 2^31 -.word 1446525243 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 15047299^156 * 375649793 * 2^31 -.word 8356523 // XX: zeta^ 60 * 2^31 = 15047299^ 60 * 2^31 = 16204162 * 2^31 -.word 1036987221 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 60 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_33556993_15047299_incomplete_good, %function -.global ntt_384_u32_33556993_15047299_incomplete_good -ntt_384_u32_33556993_15047299_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33556993 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vmul.u32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vmul.u32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vmul.u32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vmul.u32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 3919317503 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s b/tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s deleted file mode 100644 index ebd349d..0000000 --- a/tests/intmulntt/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 893127 /// zeta^128 * 2^31 = 15047299^128 * 2^31 = 8518431 * 2^31 -.word 2692621625 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 15047299^128 * 375649793 * 2^31 -.word 66384763 /// zeta^256 * 2^31 = 15047299^256 * 2^31 = 25038561 * 2^31 -.word 3749829253 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 15047299^256 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 33393089 // zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 38018305 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 38018305 // zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 23796181 // zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 52637069 // zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 23796181 // zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 3361945643 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 32686385 // zeta^120 * 2^31 = 15047299^120 * 2^31 = 20044445 * 2^31 -.word 3430230223 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 15047299^120 * 375649793 * 2^31 -.word 2430825 // zeta^216 * 2^31 = 15047299^216 * 2^31 = 18811302 * 2^31 -.word 1203831447 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 15047299^216 * 375649793 * 2^31 -.word 52637069 // zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 1938838643 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 62228979 // zeta^360 * 2^31 = 15047299^360 * 2^31 = 20647416 * 2^31 -.word 1321333773 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 15047299^360 * 375649793 * 2^31 -.word 48515911 // zeta^ 72 * 2^31 = 15047299^ 72 * 2^31 = 26823146 * 2^31 -.word 1716550329 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 72 * 375649793 * 2^31 -.word 33393089 // XX: zeta^ 0 * 2^31 = 15047299^ 0 * 2^31 = 1 * 2^31 -.word 2147483711 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 0 * 375649793 * 2^31 -.word 38018305 // XX: zeta^ 96 * 2^31 = 15047299^ 96 * 2^31 = 15854702 * 2^31 -.word 1014623487 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 96 * 375649793 * 2^31 -.word 23796181 // XX: zeta^240 * 2^31 = 15047299^240 * 2^31 = 18977417 * 2^31 -.word 3361945643 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 15047299^240 * 375649793 * 2^31 -.word 52637069 // XX: zeta^336 * 2^31 = 15047299^336 * 2^31 = 30296666 * 2^31 -.word 1938838643 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 15047299^336 * 375649793 * 2^31 -.word 32686385 // XX: zeta^120 * 2^31 = 15047299^120 * 2^31 = 20044445 * 2^31 -.word 3430230223 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 15047299^120 * 375649793 * 2^31 -.word 2430825 // XX: zeta^216 * 2^31 = 15047299^216 * 2^31 = 18811302 * 2^31 -.word 1203831447 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 15047299^216 * 375649793 * 2^31 -.word 62228979 // XX: zeta^360 * 2^31 = 15047299^360 * 2^31 = 20647416 * 2^31 -.word 1321333773 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 15047299^360 * 375649793 * 2^31 -.word 48515911 // XX: zeta^ 72 * 2^31 = 15047299^ 72 * 2^31 = 26823146 * 2^31 -.word 1716550329 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 72 * 375649793 * 2^31 -.word 58757463 // XX: zeta^252 * 2^31 = 15047299^252 * 2^31 = 17352831 * 2^31 -.word 3257980073 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 15047299^252 * 375649793 * 2^31 -.word 41196349 // XX: zeta^348 * 2^31 = 15047299^348 * 2^31 = 10953311 * 2^31 -.word 2848442051 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 15047299^348 * 375649793 * 2^31 -.word 26613973 // XX: zeta^108 * 2^31 = 15047299^108 * 2^31 = 23642097 * 2^31 -.word 3660462379 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 15047299^108 * 375649793 * 2^31 -.word 7832335 // XX: zeta^204 * 2^31 = 15047299^204 * 2^31 = 12267508 * 2^31 -.word 785060593 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 15047299^204 * 375649793 * 2^31 -.word 12542317 // XX: zeta^372 * 2^31 = 15047299^372 * 2^31 = 3271804 * 2^31 -.word 209379475 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 15047299^372 * 375649793 * 2^31 -.word 18302687 // XX: zeta^ 84 * 2^31 = 15047299^ 84 * 2^31 = 3819232 * 2^31 -.word 244412193 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 15047299^ 84 * 375649793 * 2^31 -.word 21796399 // XX: zeta^228 * 2^31 = 15047299^228 * 2^31 = 18930340 * 2^31 -.word 1211449297 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 15047299^228 * 375649793 * 2^31 -.word 27114239 // XX: zeta^324 * 2^31 = 15047299^324 * 2^31 = 13128918 * 2^31 -.word 840186625 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 15047299^324 * 375649793 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_33556993_15047299_incomplete_good_bitrev, %function -.global ntt_384_u32_33556993_15047299_incomplete_good_bitrev -ntt_384_u32_33556993_15047299_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 33556993 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vmul.u32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 3919317503 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good.s b/tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good.s deleted file mode 100644 index 6275846..0000000 --- a/tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_45387457_923104_incomplete_good_twiddles -ntt_384_u32_45387457_923104_incomplete_good_twiddles: // For base multiplication -.word 69606647 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 685157961 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 62904337 // zeta^ 64 * 2^31 = 923104^ 64 * 2^31 = 18186381 * 2^31 -.word 1812533935 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 64 * 450429249 * 2^31 -.word 48768409 // zeta^ 32 * 2^31 = 923104^ 32 * 2^31 = 16376451 * 2^31 -.word 4063746855 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 32 * 450429249 * 2^31 -.word 30855129 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 2025087207 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 5368717 // zeta^ 16 * 2^31 = 923104^ 16 * 2^31 = 6955156 * 2^31 -.word 2630001715 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 16 * 450429249 * 2^31 -.word 35344777 // zeta^ 80 * 2^31 = 923104^ 80 * 2^31 = 38478475 * 2^31 -.word 1419625271 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 80 * 450429249 * 2^31 -.word 34054097 // zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 3259472623 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 35946385 // zeta^112 * 2^31 = 923104^112 * 2^31 = 16261595 * 2^31 -.word 1951599407 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 923104^112 * 450429249 * 2^31 -.word 54446789 // zeta^ 8 * 2^31 = 923104^ 8 * 2^31 = 16877098 * 2^31 -.word 189185787 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 8 * 450429249 * 2^31 -.word 39834949 // zeta^ 72 * 2^31 = 923104^ 72 * 2^31 = 21015440 * 2^31 -.word 1012438139 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 72 * 450429249 * 2^31 -.word 46558923 // zeta^ 40 * 2^31 = 923104^ 40 * 2^31 = 3630241 * 2^31 -.word 4246475637 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 40 * 450429249 * 2^31 -.word 81626031 // zeta^104 * 2^31 = 923104^104 * 2^31 = 33283422 * 2^31 -.word 2162614673 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 923104^104 * 450429249 * 2^31 -.word 66297913 // zeta^ 24 * 2^31 = 923104^ 24 * 2^31 = 38013065 * 2^31 -.word 1899174023 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 24 * 450429249 * 2^31 -.word 39269057 // zeta^ 88 * 2^31 = 923104^ 88 * 2^31 = 33248211 * 2^31 -.word 1896897535 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 88 * 450429249 * 2^31 -.word 90210255 // zeta^ 56 * 2^31 = 923104^ 56 * 2^31 = 31693324 * 2^31 -.word 3850943857 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 56 * 450429249 * 2^31 -.word 80761913 // zeta^120 * 2^31 = 923104^120 * 2^31 = 20563366 * 2^31 -.word 3363136647 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 923104^120 * 450429249 * 2^31 -.word 13759071 // zeta^ 4 * 2^31 = 923104^ 4 * 2^31 = 923104 * 2^31 -.word 3761772257 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 4 * 450429249 * 2^31 -.word 30402329 // zeta^ 68 * 2^31 = 923104^ 68 * 2^31 = 8451464 * 2^31 -.word 1277049255 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 68 * 450429249 * 2^31 -.word 83384231 // zeta^ 36 * 2^31 = 923104^ 36 * 2^31 = 12508371 * 2^31 -.word 2181765017 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 36 * 450429249 * 2^31 -.word 2847179 // zeta^100 * 2^31 = 923104^100 * 2^31 = 20823894 * 2^31 -.word 3061010549 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 923104^100 * 450429249 * 2^31 -.word 73095195 // zeta^ 20 * 2^31 = 923104^ 20 * 2^31 = 4206832 * 2^31 -.word 479430181 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 20 * 450429249 * 2^31 -.word 86175901 // zeta^ 84 * 2^31 = 923104^ 84 * 2^31 = 375141 * 2^31 -.word 2820360995 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 84 * 450429249 * 2^31 -.word 75051431 // zeta^ 52 * 2^31 = 923104^ 52 * 2^31 = 37944787 * 2^31 -.word 1467596185 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 52 * 450429249 * 2^31 -.word 72003281 // zeta^116 * 2^31 = 923104^116 * 2^31 = 13574899 * 2^31 -.word 892455919 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 923104^116 * 450429249 * 2^31 -.word 21266821 // zeta^ 12 * 2^31 = 923104^ 12 * 2^31 = 26669485 * 2^31 -.word 492607547 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 12 * 450429249 * 2^31 -.word 17786721 // zeta^ 76 * 2^31 = 923104^ 76 * 2^31 = 20629734 * 2^31 -.word 813064031 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 76 * 450429249 * 2^31 -.word 28787439 // zeta^ 44 * 2^31 = 923104^ 44 * 2^31 = 43262840 * 2^31 -.word 3600683601 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 44 * 450429249 * 2^31 -.word 9793529 // zeta^108 * 2^31 = 923104^108 * 2^31 = 19489792 * 2^31 -.word 277715143 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 923104^108 * 450429249 * 2^31 -.word 11700093 // zeta^ 28 * 2^31 = 923104^ 28 * 2^31 = 16210463 * 2^31 -.word 2502892611 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 28 * 450429249 * 2^31 -.word 50248023 // zeta^ 92 * 2^31 = 923104^ 92 * 2^31 = 13494060 * 2^31 -.word 1306171881 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 92 * 450429249 * 2^31 -.word 35962109 // zeta^ 60 * 2^31 = 923104^ 60 * 2^31 = 24024980 * 2^31 -.word 1803159235 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 60 * 450429249 * 2^31 -.word 68955489 // zeta^124 * 2^31 = 923104^124 * 2^31 = 1591696 * 2^31 -.word 2272401759 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 923104^124 * 450429249 * 2^31 -.word 38685147 // zeta^128 * 2^31 = 923104^128 * 2^31 = 18186380 * 2^31 -.word 1127375973 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 923104^128 * 450429249 * 2^31 -.word 21168267 // zeta^192 * 2^31 = 923104^192 * 2^31 = 45387456 * 2^31 -.word 3609809333 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 923104^192 * 450429249 * 2^31 -.word 27474177 // zeta^160 * 2^31 = 923104^160 * 2^31 = 43749424 * 2^31 -.word 2256307647 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 923104^160 * 450429249 * 2^31 -.word 42006505 // zeta^224 * 2^31 = 923104^224 * 2^31 = 29011006 * 2^31 -.word 231220439 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 923104^224 * 450429249 * 2^31 -.word 75363517 // zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3084590851 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 85406197 // zeta^208 * 2^31 = 923104^208 * 2^31 = 38432301 * 2^31 -.word 1664965579 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 923104^208 * 450429249 * 2^31 -.word 47279745 // zeta^176 * 2^31 = 923104^176 * 2^31 = 21308336 * 2^31 -.word 2987094079 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 923104^176 * 450429249 * 2^31 -.word 56720817 // zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 1035494671 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 30775617 // zeta^136 * 2^31 = 923104^136 * 2^31 = 4138342 * 2^31 -.word 823252351 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 923104^136 * 450429249 * 2^31 -.word 36328125 // zeta^200 * 2^31 = 923104^200 * 2^31 = 28510359 * 2^31 -.word 4105781507 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 923104^200 * 450429249 * 2^31 -.word 80454565 // zeta^168 * 2^31 = 923104^168 * 2^31 = 29653181 * 2^31 -.word 2211106331 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 923104^168 * 450429249 * 2^31 -.word 44215991 // zeta^232 * 2^31 = 923104^232 * 2^31 = 41757216 * 2^31 -.word 48491657 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 923104^232 * 450429249 * 2^31 -.word 18358601 // zeta^152 * 2^31 = 923104^152 * 2^31 = 40622603 * 2^31 -.word 4292690807 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 923104^152 * 450429249 * 2^31 -.word 24477001 // zeta^216 * 2^31 = 923104^216 * 2^31 = 7374392 * 2^31 -.word 2395793271 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 923104^216 * 450429249 * 2^31 -.word 35939115 // zeta^184 * 2^31 = 923104^184 * 2^31 = 34257499 * 2^31 -.word 3807160085 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 923104^184 * 450429249 * 2^31 -.word 564659 // zeta^248 * 2^31 = 923104^248 * 2^31 = 13694133 * 2^31 -.word 444023437 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 923104^248 * 450429249 * 2^31 -.word 62030715 // zeta^132 * 2^31 = 923104^132 * 2^31 = 7528360 * 2^31 -.word 1810244293 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 923104^132 * 450429249 * 2^31 -.word 77015843 // zeta^196 * 2^31 = 923104^196 * 2^31 = 44464353 * 2^31 -.word 533195037 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 923104^196 * 450429249 * 2^31 -.word 55625319 // zeta^164 * 2^31 = 923104^164 * 2^31 = 8315523 * 2^31 -.word 879245529 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 923104^164 * 450429249 * 2^31 -.word 7390683 // zeta^228 * 2^31 = 923104^228 * 2^31 = 32879086 * 2^31 -.word 2113202277 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 923104^228 * 450429249 * 2^31 -.word 58468163 // zeta^148 * 2^31 = 923104^148 * 2^31 = 41555766 * 2^31 -.word 2340930813 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 923104^148 * 450429249 * 2^31 -.word 17679719 // zeta^212 * 2^31 = 923104^212 * 2^31 = 41180625 * 2^31 -.word 3815537113 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 923104^212 * 450429249 * 2^31 -.word 42339307 // zeta^180 * 2^31 = 923104^180 * 2^31 = 21017569 * 2^31 -.word 3719827029 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 923104^180 * 450429249 * 2^31 -.word 15723483 // zeta^244 * 2^31 = 923104^244 * 2^31 = 7442670 * 2^31 -.word 2827371109 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 923104^244 * 450429249 * 2^31 -.word 41907357 // zeta^140 * 2^31 = 923104^140 * 2^31 = 39347706 * 2^31 -.word 320456483 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 923104^140 * 450429249 * 2^31 -.word 69508093 // zeta^204 * 2^31 = 923104^204 * 2^31 = 18717972 * 2^31 -.word 3802359747 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 923104^204 * 450429249 * 2^31 -.word 26393547 // zeta^172 * 2^31 = 923104^172 * 2^31 = 21614409 * 2^31 -.word 971998837 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 923104^172 * 450429249 * 2^31 -.word 61987475 // zeta^236 * 2^31 = 923104^236 * 2^31 = 2124617 * 2^31 -.word 694283693 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 923104^236 * 450429249 * 2^31 -.word 83935387 // zeta^156 * 2^31 = 923104^156 * 2^31 = 42671054 * 2^31 -.word 3098246565 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 923104^156 * 450429249 * 2^31 -.word 79074821 // zeta^220 * 2^31 = 923104^220 * 2^31 = 29176994 * 2^31 -.word 1792074683 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 923104^220 * 450429249 * 2^31 -.word 78380837 // zeta^188 * 2^31 = 923104^188 * 2^31 = 22954173 * 2^31 -.word 469242523 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 923104^188 * 450429249 * 2^31 -.word 54812805 // zeta^252 * 2^31 = 923104^252 * 2^31 = 21362477 * 2^31 -.word 2491808059 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 923104^252 * 450429249 * 2^31 -.word 27870577 // zeta^256 * 2^31 = 923104^256 * 2^31 = 27201076 * 2^31 -.word 2482433359 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 923104^256 * 450429249 * 2^31 -.word 52089767 // zeta^320 * 2^31 = 923104^320 * 2^31 = 27201077 * 2^31 -.word 3167591321 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 923104^320 * 450429249 * 2^31 -.word 59919785 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 2269880087 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 63300737 // zeta^352 * 2^31 = 923104^352 * 2^31 = 1638033 * 2^31 -.word 2038659647 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 923104^352 * 450429249 * 2^31 -.word 55430137 // zeta^272 * 2^31 = 923104^272 * 2^31 = 6908982 * 2^31 -.word 2875342023 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 923104^272 * 450429249 * 2^31 -.word 15411397 // zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 1210376443 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 54828529 // zeta^304 * 2^31 = 923104^304 * 2^31 = 29125862 * 2^31 -.word 2343367887 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 923104^304 * 450429249 * 2^31 -.word 43495169 // zeta^368 * 2^31 = 923104^368 * 2^31 = 24079121 * 2^31 -.word 1307873215 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 923104^368 * 450429249 * 2^31 -.word 50939965 // zeta^264 * 2^31 = 923104^264 * 2^31 = 24372017 * 2^31 -.word 3282529155 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 923104^264 * 450429249 * 2^31 -.word 59999297 // zeta^328 * 2^31 = 923104^328 * 2^31 = 41249115 * 2^31 -.word 3471714943 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 923104^328 * 450429249 * 2^31 -.word 9148883 // zeta^296 * 2^31 = 923104^296 * 2^31 = 12104035 * 2^31 -.word 2132352621 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 923104^296 * 450429249 * 2^31 -.word 10320349 // zeta^360 * 2^31 = 923104^360 * 2^31 = 15734276 * 2^31 -.word 2083860963 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 923104^360 * 450429249 * 2^31 -.word 51505857 // zeta^280 * 2^31 = 923104^280 * 2^31 = 12139246 * 2^31 -.word 2398069759 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 923104^280 * 450429249 * 2^31 -.word 72416313 // zeta^344 * 2^31 = 923104^344 * 2^31 = 4764854 * 2^31 -.word 2276487 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 923104^344 * 450429249 * 2^31 -.word 10013001 // zeta^312 * 2^31 = 923104^312 * 2^31 = 24824091 * 2^31 -.word 931830647 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 923104^312 * 450429249 * 2^31 -.word 54835799 // zeta^376 * 2^31 = 923104^376 * 2^31 = 11129958 * 2^31 -.word 487807209 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 923104^376 * 450429249 * 2^31 -.word 60372585 // zeta^260 * 2^31 = 923104^260 * 2^31 = 36935993 * 2^31 -.word 3017918039 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 923104^260 * 450429249 * 2^31 -.word 28744199 // zeta^324 * 2^31 = 923104^324 * 2^31 = 37859097 * 2^31 -.word 2484723001 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 923104^324 * 450429249 * 2^31 -.word 87927735 // zeta^292 * 2^31 = 923104^292 * 2^31 = 24563563 * 2^31 -.word 1233956745 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 923104^292 * 450429249 * 2^31 -.word 35149595 // zeta^356 * 2^31 = 923104^356 * 2^31 = 37071934 * 2^31 -.word 3415721765 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 923104^356 * 450429249 * 2^31 -.word 4599013 // zeta^276 * 2^31 = 923104^276 * 2^31 = 45012316 * 2^31 -.word 1474606299 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 923104^276 * 450429249 * 2^31 -.word 32306751 // zeta^340 * 2^31 = 923104^340 * 2^31 = 3831691 * 2^31 -.word 1954036481 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 923104^340 * 450429249 * 2^31 -.word 18771633 // zeta^308 * 2^31 = 923104^308 * 2^31 = 31812558 * 2^31 -.word 3402511375 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 923104^308 * 450429249 * 2^31 -.word 48435607 // zeta^372 * 2^31 = 923104^372 * 2^31 = 24369888 * 2^31 -.word 575140265 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 923104^372 * 450429249 * 2^31 -.word 72988193 // zeta^268 * 2^31 = 923104^268 * 2^31 = 24757723 * 2^31 -.word 3481903263 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 923104^268 * 450429249 * 2^31 -.word 48867557 // zeta^332 * 2^31 = 923104^332 * 2^31 = 6039751 * 2^31 -.word 3974510811 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 923104^332 * 450429249 * 2^31 -.word 80981385 // zeta^300 * 2^31 = 923104^300 * 2^31 = 25897665 * 2^31 -.word 4017252151 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 923104^300 * 450429249 * 2^31 -.word 64381367 // zeta^364 * 2^31 = 923104^364 * 2^31 = 23773048 * 2^31 -.word 3322968457 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 923104^364 * 450429249 * 2^31 -.word 40526891 // zeta^284 * 2^31 = 923104^284 * 2^31 = 31893397 * 2^31 -.word 2988795413 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 923104^284 * 450429249 * 2^31 -.word 6839527 // zeta^348 * 2^31 = 923104^348 * 2^31 = 2716403 * 2^31 -.word 1196720729 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 923104^348 * 450429249 * 2^31 -.word 21819425 // zeta^316 * 2^31 = 923104^316 * 2^31 = 43795761 * 2^31 -.word 2022565535 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 923104^316 * 450429249 * 2^31 -.word 12394077 // zeta^380 * 2^31 = 923104^380 * 2^31 = 22433284 * 2^31 -.word 3825724771 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 923104^380 * 450429249 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_45387457_923104_incomplete_good_scale -ntt_384_u32_45387457_923104_incomplete_good_scale: // Constants for scaling by 1/N -.word 69606647 // 1/96 -.word 685157961 // 1/96 twisted -.data -roots: -.word 22090505 /// zeta^256 * 2^31 = 923104^256 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 923104^256 * 450429249 * 2^31 -.word 9023783 /// zeta^128 * 2^31 = 923104^128 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 923104^128 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 78782351 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 78782351 // zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 88323005 // zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 84188761 // zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 88323005 // zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3638992899 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 16804439 // zeta^264 * 2^31 = 923104^264 * 2^31 = 24372017 * 2^31 -.word 3300632809 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 923104^264 * 450429249 * 2^31 -.word 19157039 // zeta^168 * 2^31 = 923104^168 * 2^31 = 29653181 * 2^31 -.word 3550508305 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 923104^168 * 450429249 * 2^31 -.word 84188761 // zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 1908699751 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 65804887 // zeta^ 24 * 2^31 = 923104^ 24 * 2^31 = 38013065 * 2^31 -.word 3946051817 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 24 * 450429249 * 2^31 -.word 82969997 // zeta^312 * 2^31 = 923104^312 * 2^31 = 24824091 * 2^31 -.word 3322022451 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 923104^312 * 450429249 * 2^31 -.word 14273169 // XX: zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 78782351 // XX: zeta^288 * 2^31 = 923104^288 * 2^31 = 30649039 * 2^31 -.word 3597626801 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 923104^288 * 450429249 * 2^31 -.word 88323005 // XX: zeta^144 * 2^31 = 923104^144 * 2^31 = 31523319 * 2^31 -.word 3638992899 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 923104^144 * 450429249 * 2^31 -.word 84188761 // XX: zeta^ 48 * 2^31 = 923104^ 48 * 2^31 = 40340716 * 2^31 -.word 1908699751 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 48 * 450429249 * 2^31 -.word 16804439 // XX: zeta^264 * 2^31 = 923104^264 * 2^31 = 24372017 * 2^31 -.word 3300632809 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 923104^264 * 450429249 * 2^31 -.word 19157039 // XX: zeta^168 * 2^31 = 923104^168 * 2^31 = 29653181 * 2^31 -.word 3550508305 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 923104^168 * 450429249 * 2^31 -.word 65804887 // XX: zeta^ 24 * 2^31 = 923104^ 24 * 2^31 = 38013065 * 2^31 -.word 3946051817 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 24 * 450429249 * 2^31 -.word 82969997 // XX: zeta^312 * 2^31 = 923104^312 * 2^31 = 24824091 * 2^31 -.word 3322022451 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 923104^312 * 450429249 * 2^31 -.word 66361593 // XX: zeta^132 * 2^31 = 923104^132 * 2^31 = 7528360 * 2^31 -.word 356200391 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 923104^132 * 450429249 * 2^31 -.word 80165521 // XX: zeta^ 36 * 2^31 = 923104^ 36 * 2^31 = 12508371 * 2^31 -.word 2739310639 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 36 * 450429249 * 2^31 -.word 88960289 // XX: zeta^276 * 2^31 = 923104^276 * 2^31 = 45012316 * 2^31 -.word 2129734047 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 923104^276 * 450429249 * 2^31 -.word 6563629 // XX: zeta^180 * 2^31 = 923104^180 * 2^31 = 21017569 * 2^31 -.word 3141918867 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 923104^180 * 450429249 * 2^31 -.word 482773 // XX: zeta^ 12 * 2^31 = 923104^ 12 * 2^31 = 26669485 * 2^31 -.word 3409336299 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 12 * 450429249 * 2^31 -.word 35973319 // XX: zeta^300 * 2^31 = 923104^300 * 2^31 = 25897665 * 2^31 -.word 3372818041 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 923104^300 * 450429249 * 2^31 -.word 11401659 // XX: zeta^156 * 2^31 = 923104^156 * 2^31 = 42671054 * 2^31 -.word 2018958469 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 923104^156 * 450429249 * 2^31 -.word 59173881 // XX: zeta^ 60 * 2^31 = 923104^ 60 * 2^31 = 24024980 * 2^31 -.word 1136729287 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 60 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_45387457_923104_incomplete_good, %function -.global ntt_384_u32_45387457_923104_incomplete_good -ntt_384_u32_45387457_923104_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 45387457 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vmul.u32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vmul.u32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vmul.u32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmul.u32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vmul.u32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmul.u32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vmul.u32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmul.u32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vmul.u32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vmul.u32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vmul.u32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vmul.u32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vmul.u32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vmul.u32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vmul.u32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vmul.u32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vmul.u32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vmul.u32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vmul.u32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vmul.u32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vmul.u32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vmul.u32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vmul.u32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 3844538047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s b/tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s deleted file mode 100644 index 9484257..0000000 --- a/tests/intmulntt/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 9023783 /// zeta^128 * 2^31 = 923104^128 * 2^31 = 18186380 * 2^31 -.word 860479001 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 923104^128 * 450429249 * 2^31 -.word 22090505 /// zeta^256 * 2^31 = 923104^256 * 2^31 = 27201076 * 2^31 -.word 1287004599 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 923104^256 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 14273169 // zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 11992563 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 11992563 // zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 6586153 // zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 2451909 // zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 6586153 // zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 2386267543 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 7804917 // zeta^120 * 2^31 = 923104^120 * 2^31 = 20563366 * 2^31 -.word 972944843 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 923104^120 * 450429249 * 2^31 -.word 24970027 // zeta^216 * 2^31 = 923104^216 * 2^31 = 7374392 * 2^31 -.word 348915477 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 923104^216 * 450429249 * 2^31 -.word 2451909 // zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 655974395 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 71617875 // zeta^360 * 2^31 = 923104^360 * 2^31 = 15734276 * 2^31 -.word 744458989 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 923104^360 * 450429249 * 2^31 -.word 73970475 // zeta^ 72 * 2^31 = 923104^ 72 * 2^31 = 21015440 * 2^31 -.word 994334485 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 72 * 450429249 * 2^31 -.word 14273169 // XX: zeta^ 0 * 2^31 = 923104^ 0 * 2^31 = 1 * 2^31 -.word 2147483695 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 0 * 450429249 * 2^31 -.word 11992563 // XX: zeta^ 96 * 2^31 = 923104^ 96 * 2^31 = 14738418 * 2^31 -.word 697340493 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 96 * 450429249 * 2^31 -.word 6586153 // XX: zeta^240 * 2^31 = 923104^240 * 2^31 = 5046741 * 2^31 -.word 2386267543 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 923104^240 * 450429249 * 2^31 -.word 2451909 // XX: zeta^336 * 2^31 = 923104^336 * 2^31 = 13864138 * 2^31 -.word 655974395 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 923104^336 * 450429249 * 2^31 -.word 7804917 // XX: zeta^120 * 2^31 = 923104^120 * 2^31 = 20563366 * 2^31 -.word 972944843 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 923104^120 * 450429249 * 2^31 -.word 24970027 // XX: zeta^216 * 2^31 = 923104^216 * 2^31 = 7374392 * 2^31 -.word 348915477 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 923104^216 * 450429249 * 2^31 -.word 71617875 // XX: zeta^360 * 2^31 = 923104^360 * 2^31 = 15734276 * 2^31 -.word 744458989 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 923104^360 * 450429249 * 2^31 -.word 73970475 // XX: zeta^ 72 * 2^31 = 923104^ 72 * 2^31 = 21015440 * 2^31 -.word 994334485 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 72 * 450429249 * 2^31 -.word 31601033 // XX: zeta^252 * 2^31 = 923104^252 * 2^31 = 21362477 * 2^31 -.word 3158238007 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 923104^252 * 450429249 * 2^31 -.word 79373255 // XX: zeta^348 * 2^31 = 923104^348 * 2^31 = 2716403 * 2^31 -.word 2276008825 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 923104^348 * 450429249 * 2^31 -.word 54801595 // XX: zeta^108 * 2^31 = 923104^108 * 2^31 = 19489792 * 2^31 -.word 922149253 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 923104^108 * 450429249 * 2^31 -.word 90292141 // XX: zeta^204 * 2^31 = 923104^204 * 2^31 = 18717972 * 2^31 -.word 885630995 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 923104^204 * 450429249 * 2^31 -.word 84211285 // XX: zeta^372 * 2^31 = 923104^372 * 2^31 = 24369888 * 2^31 -.word 1153048427 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 923104^372 * 450429249 * 2^31 -.word 1814625 // XX: zeta^ 84 * 2^31 = 923104^ 84 * 2^31 = 375141 * 2^31 -.word 2165233247 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 923104^ 84 * 450429249 * 2^31 -.word 10609393 // XX: zeta^228 * 2^31 = 923104^228 * 2^31 = 32879086 * 2^31 -.word 1555656655 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 923104^228 * 450429249 * 2^31 -.word 24413321 // XX: zeta^324 * 2^31 = 923104^324 * 2^31 = 37859097 * 2^31 -.word 3938766903 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 923104^324 * 450429249 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_45387457_923104_incomplete_good_bitrev, %function -.global ntt_384_u32_45387457_923104_incomplete_good_bitrev -ntt_384_u32_45387457_923104_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, 45387457 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vmul.u32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmlah.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vqrdmulh.s32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vmul.u32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmlah.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vmul.u32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vmul.u32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vmul.u32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vmul.u32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vmul.u32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vmul.u32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vmul.u32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vmul.u32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vmul.u32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vmul.u32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vmul.u32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vmul.u32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vmul.u32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vmul.u32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vmul.u32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vmul.u32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vmul.u32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vmul.u32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vmul.u32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vmul.u32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vmul.u32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vmul.u32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vmul.u32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vmul.u32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vmul.u32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vmul.u32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vmul.u32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vmul.u32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vmul.u32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vqrdmulh.s32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vmul.u32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vqrdmlah.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vqrdmulh.s32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vmul.u32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vqrdmlah.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vqrdmulh.s32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vqrdmulh.s32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vmul.u32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vqrdmlah.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vqrdmulh.s32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vmul.u32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vqrdmlah.s32 Q0, Q5, r11 -vqrdmulh.s32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vmul.u32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vqrdmlah.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vqrdmulh.s32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vmul.u32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vmul.u32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vqrdmlah.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vmul.u32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmul.u32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vmul.u32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vmul.u32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vmul.u32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vmul.u32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vmul.u32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmul.u32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmul.u32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vmul.u32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vmul.u32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmul.u32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmul.u32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vmul.u32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vqrdmulh.s32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmul.u32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vqrdmulh.s32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vqrdmlah.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vqrdmulh.s32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vmul.u32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vqrdmlah.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vqrdmulh.s32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmul.u32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vqrdmlah.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vqrdmulh.s32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vqrdmlah.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vqrdmulh.s32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vmul.u32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vqrdmlah.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vqrdmulh.s32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vqrdmlah.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vqrdmulh.s32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmul.u32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vqrdmlah.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vqrdmulh.s32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vmul.u32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vqrdmlah.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vqrdmulh.s32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vmul.u32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vqrdmlah.s32 Q5, Q0, r11 -vqrdmulh.s32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vmul.u32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vqrdmlah.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vqrdmulh.s32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vmul.u32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vqrdmlah.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vmul.u32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vmul.u32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vmul.u32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vmul.u32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vmul.u32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vmul.u32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vmul.u32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmul.u32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vmul.u32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vmul.u32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vmul.u32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vmul.u32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vmul.u32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vmul.u32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vmul.u32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vmul.u32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vmul.u32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vmul.u32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vmul.u32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vmul.u32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vmul.u32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vmul.u32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vqrdmulh.s32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vmul.u32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vqrdmlah.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vqrdmulh.s32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vmul.u32 Q3, Q3, r8 -vqrdmlah.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 3844538047 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good.s b/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good.s deleted file mode 100644 index 81acbd2..0000000 --- a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good.s +++ /dev/null @@ -1,3383 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_88299073_4883425_incomplete_good_twiddles -ntt_384_u32_88299073_4883425_incomplete_good_twiddles: // For base multiplication -.word 75231281 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 3951395343 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 15452769 // zeta^ 64 * 2^31 = 4883425^ 64 * 2^31 = 85764717 * 2^31 -.word 2033538015 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 64 * 2066201025 * 2^31 -.word 19987225 // zeta^ 32 * 2^31 = 4883425^ 32 * 2^31 = 19144749 * 2^31 -.word 1892589863 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 32 * 2066201025 * 2^31 -.word 50503029 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 2741681611 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 81982457 // zeta^ 16 * 2^31 = 4883425^ 16 * 2^31 = 76960665 * 2^31 -.word 2158501959 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 16 * 2066201025 * 2^31 -.word 20023469 // zeta^ 80 * 2^31 = 4883425^ 80 * 2^31 = 41822566 * 2^31 -.word 1552412819 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 80 * 2066201025 * 2^31 -.word 55876839 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 1939982041 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 43619891 // zeta^112 * 2^31 = 4883425^112 * 2^31 = 44400103 * 2^31 -.word 2850416781 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4883425^112 * 2066201025 * 2^31 -.word 172662323 // zeta^ 8 * 2^31 = 4883425^ 8 * 2^31 = 26094785 * 2^31 -.word 3064389773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 8 * 2066201025 * 2^31 -.word 71853543 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 4036378073 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 23697415 // zeta^ 40 * 2^31 = 4883425^ 40 * 2^31 = 55309930 * 2^31 -.word 443962297 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 40 * 2066201025 * 2^31 -.word 76499159 // zeta^104 * 2^31 = 4883425^104 * 2^31 = 78628712 * 2^31 -.word 2660611817 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4883425^104 * 2066201025 * 2^31 -.word 56990949 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 337656411 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 120013125 // zeta^ 88 * 2^31 = 4883425^ 88 * 2^31 = 20668553 * 2^31 -.word 616164859 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 88 * 2066201025 * 2^31 -.word 28856125 // zeta^ 56 * 2^31 = 4883425^ 56 * 2^31 = 41675533 * 2^31 -.word 561917443 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 56 * 2066201025 * 2^31 -.word 159401217 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 642203967 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 12190033 // zeta^ 4 * 2^31 = 4883425^ 4 * 2^31 = 4883425 * 2^31 -.word 3933894895 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 4 * 2066201025 * 2^31 -.word 108088419 // zeta^ 68 * 2^31 = 4883425^ 68 * 2^31 = 13818672 * 2^31 -.word 273473117 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 68 * 2066201025 * 2^31 -.word 142353279 // zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 2003400257 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 143392463 // zeta^100 * 2^31 = 4883425^100 * 2^31 = 35160276 * 2^31 -.word 482889457 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4883425^100 * 2066201025 * 2^31 -.word 119167385 // zeta^ 20 * 2^31 = 4883425^ 20 * 2^31 = 52712221 * 2^31 -.word 1897128615 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 20 * 2066201025 * 2^31 -.word 9268541 // zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 1847889923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 80397967 // zeta^ 52 * 2^31 = 4883425^ 52 * 2^31 = 81877099 * 2^31 -.word 3839489841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 52 * 2066201025 * 2^31 -.word 16520015 // zeta^116 * 2^31 = 4883425^116 * 2^31 = 18306165 * 2^31 -.word 838359665 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4883425^116 * 2066201025 * 2^31 -.word 115982427 // zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 3605477477 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 55226367 // zeta^ 76 * 2^31 = 4883425^ 76 * 2^31 = 50302558 * 2^31 -.word 917047745 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 76 * 2066201025 * 2^31 -.word 136968867 // zeta^ 44 * 2^31 = 4883425^ 44 * 2^31 = 63650411 * 2^31 -.word 40189981 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 44 * 2066201025 * 2^31 -.word 68313423 // zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 3720973425 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 117342749 // zeta^ 28 * 2^31 = 4883425^ 28 * 2^31 = 32879858 * 2^31 -.word 212726563 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 28 * 2066201025 * 2^31 -.word 64009947 // zeta^ 92 * 2^31 = 4883425^ 92 * 2^31 = 70872893 * 2^31 -.word 925164005 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 92 * 2066201025 * 2^31 -.word 55029279 // zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1315460001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.word 99141453 // zeta^124 * 2^31 = 4883425^124 * 2^31 = 15592642 * 2^31 -.word 4156561907 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4883425^124 * 2066201025 * 2^31 -.word 28520561 // zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2377109967 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 101366865 // zeta^192 * 2^31 = 4883425^192 * 2^31 = 88299072 * 2^31 -.word 343571951 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4883425^192 * 2066201025 * 2^31 -.word 118814877 // zeta^160 * 2^31 = 4883425^160 * 2^31 = 5579523 * 2^31 -.word 849091747 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4883425^160 * 2066201025 * 2^31 -.word 156610921 // zeta^224 * 2^31 = 4883425^224 * 2^31 = 69154324 * 2^31 -.word 2402377431 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4883425^224 * 2066201025 * 2^31 -.word 26340085 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 3688878155 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 94615689 // zeta^208 * 2^31 = 4883425^208 * 2^31 = 11338408 * 2^31 -.word 2136465335 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4883425^208 * 2066201025 * 2^31 -.word 76042125 // zeta^176 * 2^31 = 4883425^176 * 2^31 = 22220342 * 2^31 -.word 910434739 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4883425^176 * 2066201025 * 2^31 -.word 120721307 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 2354985253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 164088439 // zeta^136 * 2^31 = 4883425^136 * 2^31 = 32274711 * 2^31 -.word 971988297 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4883425^136 * 2066201025 * 2^31 -.word 3935823 // zeta^200 * 2^31 = 4883425^200 * 2^31 = 62204288 * 2^31 -.word 1230577521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4883425^200 * 2066201025 * 2^31 -.word 141100817 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 2216649519 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 152900731 // zeta^232 * 2^31 = 4883425^232 * 2^31 = 32989143 * 2^31 -.word 3851004997 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4883425^232 * 2066201025 * 2^31 -.word 151321249 // zeta^152 * 2^31 = 4883425^152 * 2^31 = 11170776 * 2^31 -.word 278508447 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4883425^152 * 2066201025 * 2^31 -.word 119607197 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 3957310883 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 42246019 // zeta^184 * 2^31 = 4883425^184 * 2^31 = 23363129 * 2^31 -.word 80286525 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4883425^184 * 2066201025 * 2^31 -.word 147742021 // zeta^248 * 2^31 = 4883425^248 * 2^31 = 46623540 * 2^31 -.word 3733049851 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4883425^248 * 2066201025 * 2^31 -.word 7599313 // zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 634545519 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 164408113 // zeta^196 * 2^31 = 4883425^196 * 2^31 = 83415648 * 2^31 -.word 361072399 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4883425^196 * 2066201025 * 2^31 -.word 89338257 // zeta^164 * 2^31 = 4883425^164 * 2^31 = 30758081 * 2^31 -.word 2774456495 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4883425^164 * 2066201025 * 2^31 -.word 34244867 // zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2291567037 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 154998375 // zeta^148 * 2^31 = 4883425^148 * 2^31 = 54723088 * 2^31 -.word 4245728601 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4883425^148 * 2066201025 * 2^31 -.word 57430761 // zeta^212 * 2^31 = 4883425^212 * 2^31 = 35586852 * 2^31 -.word 2397838679 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4883425^212 * 2066201025 * 2^31 -.word 24421121 // zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 1293837119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 96200179 // zeta^244 * 2^31 = 4883425^244 * 2^31 = 6421974 * 2^31 -.word 455477453 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4883425^244 * 2066201025 * 2^31 -.word 27543013 // zeta^140 * 2^31 = 4883425^140 * 2^31 = 22531438 * 2^31 -.word 1606537563 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4883425^140 * 2066201025 * 2^31 -.word 60615719 // zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 689489817 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 19643629 // zeta^172 * 2^31 = 4883425^172 * 2^31 = 5400389 * 2^31 -.word 3680783443 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4883425^172 * 2066201025 * 2^31 -.word 39629279 // zeta^236 * 2^31 = 4883425^236 * 2^31 = 24648662 * 2^31 -.word 4254777313 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4883425^236 * 2066201025 * 2^31 -.word 34966271 // zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 712437441 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 59255397 // zeta^220 * 2^31 = 4883425^220 * 2^31 = 55419215 * 2^31 -.word 4082240731 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4883425^220 * 2066201025 * 2^31 -.word 132411247 // zeta^188 * 2^31 = 4883425^188 * 2^31 = 61321868 * 2^31 -.word 2841101905 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4883425^188 * 2066201025 * 2^31 -.word 121568867 // zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 2979507293 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 161145377 // zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 2261429279 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 148077585 // zeta^320 * 2^31 = 4883425^320 * 2^31 = 2534357 * 2^31 -.word 1917857327 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4883425^320 * 2066201025 * 2^31 -.word 126095117 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1553285683 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 57783269 // zeta^352 * 2^31 = 4883425^352 * 2^31 = 82719550 * 2^31 -.word 3445875547 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4883425^352 * 2066201025 * 2^31 -.word 156574677 // zeta^272 * 2^31 = 4883425^272 * 2^31 = 46476507 * 2^31 -.word 2742554475 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4883425^272 * 2066201025 * 2^31 -.word 150258061 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 606089139 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 132978255 // zeta^304 * 2^31 = 4883425^304 * 2^31 = 43898970 * 2^31 -.word 1444550513 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4883425^304 * 2066201025 * 2^31 -.word 100556021 // zeta^368 * 2^31 = 4883425^368 * 2^31 = 66078731 * 2^31 -.word 3384532555 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4883425^368 * 2066201025 * 2^31 -.word 104744603 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 258589221 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 12509707 // zeta^328 * 2^31 = 4883425^328 * 2^31 = 56024362 * 2^31 -.word 3322978997 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4883425^328 * 2066201025 * 2^31 -.word 100098987 // zeta^296 * 2^31 = 4883425^296 * 2^31 = 9670361 * 2^31 -.word 1634355477 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4883425^296 * 2066201025 * 2^31 -.word 35497329 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 2078317775 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 56585021 // zeta^280 * 2^31 = 4883425^280 * 2^31 = 67630520 * 2^31 -.word 3678802435 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4883425^280 * 2066201025 * 2^31 -.word 25276897 // zeta^344 * 2^31 = 4883425^344 * 2^31 = 77128297 * 2^31 -.word 4016458847 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4883425^344 * 2066201025 * 2^31 -.word 17196929 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 3652763327 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 134352127 // zeta^376 * 2^31 = 4883425^376 * 2^31 = 64935944 * 2^31 -.word 4214680769 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4883425^376 * 2066201025 * 2^31 -.word 68509727 // zeta^260 * 2^31 = 4883425^260 * 2^31 = 74480401 * 2^31 -.word 4021494177 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4883425^260 * 2066201025 * 2^31 -.word 168998833 // zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 3660421775 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.word 33205683 // zeta^292 * 2^31 = 4883425^292 * 2^31 = 53138797 * 2^31 -.word 3812077837 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4883425^292 * 2066201025 * 2^31 -.word 87259889 // zeta^356 * 2^31 = 4883425^356 * 2^31 = 57540992 * 2^31 -.word 1520510799 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4883425^356 * 2066201025 * 2^31 -.word 167329605 // zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 2447077371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 21599771 // zeta^340 * 2^31 = 4883425^340 * 2^31 = 33575985 * 2^31 -.word 49238693 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4883425^340 * 2066201025 * 2^31 -.word 160078131 // zeta^308 * 2^31 = 4883425^308 * 2^31 = 69992908 * 2^31 -.word 3456607629 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4883425^308 * 2066201025 * 2^31 -.word 152177025 // zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 3001130175 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 121371779 // zeta^268 * 2^31 = 4883425^268 * 2^31 = 37996515 * 2^31 -.word 3377919549 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4883425^268 * 2066201025 * 2^31 -.word 149055133 // zeta^332 * 2^31 = 4883425^332 * 2^31 = 65767635 * 2^31 -.word 2688429731 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4883425^332 * 2066201025 * 2^31 -.word 108284723 // zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 573993869 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 156954517 // zeta^364 * 2^31 = 4883425^364 * 2^31 = 82898684 * 2^31 -.word 614183851 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4883425^364 * 2066201025 * 2^31 -.word 112588199 // zeta^284 * 2^31 = 4883425^284 * 2^31 = 17426180 * 2^31 -.word 3369803289 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4883425^284 * 2066201025 * 2^31 -.word 141631875 // zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 3582529853 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 77456693 // zeta^316 * 2^31 = 4883425^316 * 2^31 = 72706431 * 2^31 -.word 138405387 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4883425^316 * 2066201025 * 2^31 -.word 44186899 // zeta^380 * 2^31 = 4883425^380 * 2^31 = 26977205 * 2^31 -.word 1453865389 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4883425^380 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_88299073_4883425_incomplete_good_scale -ntt_384_u32_88299073_4883425_incomplete_good_scale: // Constants for scaling by 1/N -.word 75231281 // 1/96 -.word 3951395343 // 1/96 twisted -.data -roots: -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 29929577 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 9497777 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // XX: zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // XX: zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // XX: zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 29929577 // XX: zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // XX: zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 9497777 // XX: zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // XX: zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 8935247 // XX: zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 217310286 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 4402195 // XX: zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 107063885 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 69162837 // XX: zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 1682079511 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 24728139 // XX: zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 601402397 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 27771120 // XX: zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 675409425 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 19248273 // XX: zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 468128941 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 37993035 // XX: zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 924012208 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 42569847 // XX: zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1035322877 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good, %function -.global ntt_384_u32_88299073_4883425_incomplete_good -ntt_384_u32_88299073_4883425_incomplete_good: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -88299073 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r14,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r14,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -// input[96]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release input[96] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[228]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release input[192] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[36]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 36)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release input[288] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-240)] -// input[36]: Already loaded as Q7 -// input[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[228] from Q6 -// input[132]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// input[360]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release input[36] from Q7 -vstrw.u32 Q3, [r14,#(-480)] -// Release input[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[72]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release input[360] from Q5 -// input[264]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[72] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[300]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(48)] -// Release input[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(288)] -// input[300]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[204]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// input[240]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release input[204] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[48]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 48)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release input[300] from Q7 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-192)] -// input[48]: Already loaded as Q6 -// input[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release input[240] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[372]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[336] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[180]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release input[48] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(336)] -// input[180]: Already loaded as Q7 -// input[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[372] from Q5 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[120]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release input[180] from Q7 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[216]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release input[120] from Q5 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[216] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[60]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-144)] -// input[60]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[348] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[160]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release input[60] from Q7 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(384)] -// input[160]: Already loaded as Q6 -// input[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release input[352] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release input[64] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[292]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release input[160] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(256)] -// input[292]: Already loaded as Q7 -// input[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[100] from Q5 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release input[292] from Q7 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release input[232] from Q5 -// input[136]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[328] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[172]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-464)] -// Release input[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(304)] -// input[172]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[76]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[268]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[76] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[304]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release input[172] from Q7 -vstrw.u32 Q3, [r14,#(64)] -// Release input[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(304)] -// input[304]: Already loaded as Q6 -// input[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[208]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release input[112] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release input[208] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release input[304] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-176)] -// input[52]: Already loaded as Q7 -// input[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[244] from Q5 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release input[52] from Q7 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release input[376] from Q5 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[88] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[316]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(352)] -// input[316]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[220]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// input[224]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release input[220] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[32]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 32)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release input[316] from Q7 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-128)] -// input[32]: Already loaded as Q6 -// input[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release input[224] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[356]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[320] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release input[32] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(272)] -// input[164]: Already loaded as Q7 -// input[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[356] from Q5 -// input[260]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[104]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release input[164] from Q7 -vstrw.u32 Q3, [r14,#(32)] -// Release input[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release input[104] from Q5 -// input[8]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[200] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(32)] -// Release input[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-208)] -// input[44]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// input[368]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[332] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[176]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release input[44] from Q7 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(320)] -// input[176]: Already loaded as Q6 -// input[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release input[368] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[116]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release input[80] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release input[176] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(320)] -// input[308]: Already loaded as Q7 -// input[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[116] from Q5 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[248]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release input[308] from Q7 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release input[248] from Q5 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[344] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(368)] -// input[188]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[92]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[24]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release input[92] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[264]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release input[188] from Q7 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(368)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(96)] -// Release input[24] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(48)] -// Release input[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[280]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(48)] -// Release input[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-480)] -// Release input[132] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(112)] -// Release input[280] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-464)] -// Release input[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[4]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 4)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[8]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 8)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(16)] -// Release input[4] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[284]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(32)] -// Release input[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r8 -// input[140]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(128)] -// Release input[284] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-448)] -// Release input[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[288]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 36)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[60]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[180]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(144)] -// Release input[288] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(240)] -// Release input[60] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-288)] -// Release input[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[316]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[292]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 40)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(256)] -// Release input[316] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(160)] -// Release input[292] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[188]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[308]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-256)] -// Release input[188] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(224)] -// Release input[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-352)] -// Release input[164] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[348]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 96)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(384)] -// Release input[348] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r8 -// input[328]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[64]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 64)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[220]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -32)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(352)] -// Release input[88] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(304)] -// Release input[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(256)] -// Release input[64] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[344]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-128)] -// Release input[220] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[320]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 68)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[92]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 92)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(368)] -// Release input[344] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(272)] -// Release input[320] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(368)] -// Release input[92] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[96]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 96)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(384)] -// Release input[96] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[228]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -24)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[376]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-96)] -// Release input[228] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[352]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 100)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(496)] -// Release input[376] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(400)] -// Release input[352] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[248]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(400)] -// Release input[100] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[224]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -28)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-16)] -// Release input[248] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-112)] -// Release input[224] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(416)] -// Release input[356] from Q2 -ldrd r9, r8, [r10], #+8 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[4]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 4)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[260]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(16)] -// Release input[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(32)] -// Release input[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[264]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r8 -// input[268]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(48)] -// Release input[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[140]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -112)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(48)] -// Release input[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(64)] -// Release input[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-448)] -// Release input[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(96)] -// Release input[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[20]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(80)] -// Release input[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[24]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r8 -// input[28]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 28)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[280]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(96)] -// Release input[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(112)] -// Release input[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[152]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(112)] -// Release input[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[288]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r8 -// input[292]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 40)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-400)] -// Release input[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[160]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(144)] -// Release input[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(160)] -// Release input[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[32]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[300]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-368)] -// Release input[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(128)] -// Release input[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(192)] -// Release input[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[44]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(176)] -// Release input[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[48]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r8 -// input[52]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 52)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-288)] -// Release input[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[304]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[308]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(192)] -// Release input[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(208)] -// Release input[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(208)] -// Release input[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(224)] -// Release input[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[316]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(240)] -// Release input[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[188]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -64)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(256)] -// Release input[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-256)] -// Release input[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[192]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-240)] -// Release input[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[204]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -48)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(256)] -// Release input[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[76]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 76)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-192)] -// Release input[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[328]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r8 -// input[332]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 80)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(304)] -// Release input[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[200]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(304)] -// Release input[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(320)] -// Release input[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[336]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-208)] -// Release input[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[208]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(336)] -// Release input[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[80]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-176)] -// Release input[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[216]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[220]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -32)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(320)] -// Release input[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[88]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r8 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-144)] -// Release input[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-128)] -// Release input[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[344]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(352)] -// Release input[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[96]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[100]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 100)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(368)] -// Release input[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[352]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(384)] -// Release input[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(400)] -// Release input[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[224]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(400)] -// Release input[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[360]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-112)] -// Release input[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[232]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(432)] -// Release input[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[104]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-80)] -// Release input[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[244]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -8)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(416)] -// Release input[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(480)] -// Release input[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r8 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-32)] -// Release input[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[368]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(464)] -// Release input[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[120]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(464)] -// Release input[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[376]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(480)] -// Release input[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[248]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(496)] -// Release input[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-16)] -// Release input[248] from Q1 -.equ modulus_inv, 2228766271 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3350 -// Instruction count: 2395 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s b/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s deleted file mode 100644 index e0ae9fe..0000000 --- a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s +++ /dev/null @@ -1,3182 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -roots: -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 24724272 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 24724272 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 66119312 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 35138099 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 66119312 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 1608059253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 65038662 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 1581777230 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 78801296 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 1916492312 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 35138099 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 854578542 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 64980291 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 1580357614 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 58369496 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 1419579322 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 24724272 // XX: zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 601308349 /// zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 66119312 // XX: zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 1608059253 /// zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 35138099 // XX: zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 854578542 /// zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 65038662 // XX: zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 1581777230 /// zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 78801296 // XX: zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 1916492312 /// zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 64980291 // XX: zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 1580357614 /// zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 58369496 // XX: zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 1419579322 /// zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 45729226 // XX: zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 1112160771 /// zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 50306038 // XX: zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 1223471440 /// zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 69050800 // XX: zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 1679354707 /// zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 60527953 // XX: zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 1472074223 /// zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 63570934 // XX: zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 1546081251 /// zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 19136236 // XX: zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 465404137 /// zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 83896878 // XX: zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2040419763 /// zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 79363826 // XX: zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 1930173362 /// zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good_bitrev, %function -.global ntt_384_u32_88299073_4883425_incomplete_good_bitrev -ntt_384_u32_88299073_4883425_incomplete_good_bitrev: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 1008 -add r14, r0, #1008 -// Use r12 as marker for r0 + 2016 -add r12, r14, #1008 -.equ modulus, -88299073 -movw r11, #:lower16:modulus -movt r11, #:upper16:modulus -ldr r10, roots_addr -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 4)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r9 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r8 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r11 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[64]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 64)] -vadd.s32 Q6, Q4, Q3 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 68)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r14,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r14,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r0,#(0)] -// input[64]: Already loaded as Q1 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r9 -vadd.s32 Q4, Q1, Q7 -// Release input[64] from Q1 -vqrdmulh.s32 Q3, Q0, r8 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmla.s32 Q2, Q3, r11 -vsub.s32 Q3, Q1, Q7 -// Release input[320] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -92)] -vadd.s32 Q6, Q3, Q2 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r0,#(256)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r14,#(272)] -vadd.s32 Q4, Q4, Q1 -// Release input[192] from Q1 -vstrw.u32 Q4, [r14,#(-240)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r14,#(144)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r0,#(384)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -44)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r14,#(-432)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r14,#(336)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r0,#(192)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -116)] -vadd.s32 Q6, Q2, Q1 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r14,#(-48)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r14,#(48)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r0,#(288)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -20)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r14,#(-336)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r14,#(432)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r0,#(96)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -68)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r14,#(-144)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r14,#(240)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[4]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 4)] -vadd.s32 Q6, Q2, Q1 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 8)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r0,#(480)] -// input[4]: Already loaded as Q5 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[4] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[260] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -56)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(16)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[132] from Q4 -vstrw.u32 Q3, [r14,#(-480)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r14,#(288)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r0,#(144)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -104)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r14,#(-96)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 24)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r14,#(96)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r0,#(336)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -72)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -8)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r14,#(-288)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 120)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r14,#(480)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r0,#(48)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r14,#(-192)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r14,#(192)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r0,#(432)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r0,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r14,#(-384)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r14,#(384)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * -124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r14,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r0,#(240)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r9 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r8 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 0)] -vmla.s32 Q1, Q2, r11 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r0,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r14,#(0)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -// input[12]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 12)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r5 -// input[132]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vadd.s32 Q0, Q0, Q1 -// Release input[12] from Q1 -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// input[204]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vadd.s32 Q1, Q1, Q4 -// Release input[132] from Q4 -vqrdmulh.s32 Q2, Q2, r4 -vsub.s32 Q4, Q1, Q0 -// input[72]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 72)] -vmla.s32 Q3, Q2, r11 -vstrw.u32 Q4, [r14,#(48)] -vadd.s32 Q1, Q1, Q0 -// Release input[264] from Q0 -vstrw.u32 Q1, [r0,#(0)] -// Release input[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r0,#(48)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r14,#(-480)] -// input[72]: Already loaded as Q7 -// input[204]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r5 -// input[324]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release input[204] from Q6 -// input[192]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -60)] -vsub.s32 Q4, Q3, Q2 -// input[300]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[324] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[168]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(288)] -vadd.s32 Q3, Q3, Q7 -// Release input[72] from Q7 -vstrw.u32 Q3, [r14,#(-240)] -// Release input[192] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -// input[168]: Already loaded as Q6 -// input[300]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[36]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 36)] -vadd.s32 Q6, Q6, Q5 -// Release input[300] from Q5 -// input[288]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 36)] -vsub.s32 Q4, Q3, Q2 -// input[108]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release input[36] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[360]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release input[168] from Q6 -vstrw.u32 Q3, [r14,#(144)] -// Release input[288] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(192)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(144)] -// input[360]: Already loaded as Q7 -// input[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[228]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.s32 Q7, Q7, Q5 -// Release input[108] from Q5 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vsub.s32 Q4, Q3, Q2 -// input[156]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -96)] -vadd.s32 Q3, Q3, Q2 -// Release input[228] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[24]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 24)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(432)] -vadd.s32 Q3, Q3, Q7 -// Release input[360] from Q7 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-96)] -// input[24]: Already loaded as Q6 -// input[156]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.s32 Q6, Q6, Q5 -// Release input[156] from Q5 -// input[144]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// input[348]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 96)] -vadd.s32 Q3, Q3, Q2 -// Release input[276] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[216]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(96)] -vadd.s32 Q3, Q3, Q6 -// Release input[24] from Q6 -vstrw.u32 Q3, [r14,#(-432)] -// Release input[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(96)] -// input[216]: Already loaded as Q7 -// input[348]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[84]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release input[348] from Q5 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vsub.s32 Q4, Q3, Q2 -// input[60]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 60)] -vadd.s32 Q3, Q3, Q2 -// Release input[84] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[312]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-144)] -vadd.s32 Q3, Q3, Q7 -// Release input[216] from Q7 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(384)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(336)] -// input[312]: Already loaded as Q6 -// input[60]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[180]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.s32 Q6, Q6, Q5 -// Release input[60] from Q5 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vsub.s32 Q4, Q3, Q2 -// input[252]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release input[180] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[120]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 120)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release input[312] from Q6 -vstrw.u32 Q3, [r0,#(192)] -// Release input[48] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(240)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -// input[120]: Already loaded as Q7 -// input[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vadd.s32 Q7, Q7, Q5 -// Release input[252] from Q5 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vsub.s32 Q4, Q3, Q2 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vadd.s32 Q3, Q3, Q2 -// Release input[372] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[136]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -116)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(480)] -vadd.s32 Q3, Q3, Q7 -// Release input[120] from Q7 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -// input[136]: Already loaded as Q6 -// input[268]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vadd.s32 Q6, Q6, Q5 -// Release input[268] from Q5 -// input[256]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q3, Q3, Q2 -// Release input[4] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[328]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 76)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-464)] -vadd.s32 Q3, Q3, Q6 -// Release input[136] from Q6 -vstrw.u32 Q3, [r14,#(16)] -// Release input[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(16)] -// input[328]: Already loaded as Q7 -// input[76]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[196]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release input[76] from Q5 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q2 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -80)] -vadd.s32 Q3, Q3, Q2 -// Release input[196] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[40]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 40)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(304)] -vadd.s32 Q3, Q3, Q7 -// Release input[328] from Q7 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(304)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -// input[40]: Already loaded as Q6 -// input[172]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.s32 Q6, Q6, Q5 -// Release input[172] from Q5 -// input[160]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -92)] -vsub.s32 Q4, Q3, Q2 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release input[292] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[232]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -20)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release input[40] from Q6 -vstrw.u32 Q3, [r14,#(-368)] -// Release input[160] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -// input[232]: Already loaded as Q7 -// input[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[100]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 100)] -vadd.s32 Q7, Q7, Q5 -// Release input[364] from Q5 -// input[352]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 100)] -vsub.s32 Q4, Q3, Q2 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q3, Q3, Q2 -// Release input[100] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[280]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-80)] -vadd.s32 Q3, Q3, Q7 -// Release input[232] from Q7 -vstrw.u32 Q3, [r14,#(400)] -// Release input[352] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(400)] -// input[280]: Already loaded as Q6 -// input[28]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[148]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.s32 Q6, Q6, Q5 -// Release input[28] from Q5 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -32)] -vadd.s32 Q3, Q3, Q2 -// Release input[148] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[88]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 88)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(112)] -vadd.s32 Q3, Q3, Q6 -// Release input[280] from Q6 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -// input[88]: Already loaded as Q7 -// input[220]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[340]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release input[220] from Q5 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vsub.s32 Q4, Q3, Q2 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 64)] -vadd.s32 Q3, Q3, Q2 -// Release input[340] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[184]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(352)] -vadd.s32 Q3, Q3, Q7 -// Release input[88] from Q7 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -// input[184]: Already loaded as Q6 -// input[316]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vadd.s32 Q6, Q6, Q5 -// Release input[316] from Q5 -// input[304]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 52)] -vsub.s32 Q4, Q3, Q2 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release input[52] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[376]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 124)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release input[184] from Q6 -vstrw.u32 Q3, [r14,#(208)] -// Release input[304] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(208)] -// input[376]: Already loaded as Q7 -// input[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[244]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.s32 Q7, Q7, Q5 -// Release input[124] from Q5 -// input[112]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 112)] -vsub.s32 Q4, Q3, Q2 -// input[140]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -112)] -vadd.s32 Q3, Q3, Q2 -// Release input[244] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[8]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 8)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(496)] -vadd.s32 Q3, Q3, Q7 -// Release input[376] from Q7 -vstrw.u32 Q3, [r0,#(448)] -// Release input[112] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-32)] -// input[8]: Already loaded as Q6 -// input[140]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.s32 Q6, Q6, Q5 -// Release input[140] from Q5 -// input[128]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// input[332]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q3, Q3, Q2 -// Release input[260] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -52)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(32)] -vadd.s32 Q3, Q3, Q6 -// Release input[8] from Q6 -vstrw.u32 Q3, [r14,#(-496)] -// Release input[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(32)] -// input[200]: Already loaded as Q7 -// input[332]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release input[332] from Q5 -// input[320]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 68)] -vsub.s32 Q4, Q3, Q2 -// input[44]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 44)] -vadd.s32 Q3, Q3, Q2 -// Release input[68] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[296]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 44)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-208)] -vadd.s32 Q3, Q3, Q7 -// Release input[200] from Q7 -vstrw.u32 Q3, [r14,#(272)] -// Release input[320] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(320)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(272)] -// input[296]: Already loaded as Q6 -// input[44]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[164]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.s32 Q6, Q6, Q5 -// Release input[44] from Q5 -// input[32]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// input[236]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release input[164] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release input[296] from Q6 -vstrw.u32 Q3, [r0,#(128)] -// Release input[32] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(176)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -// input[104]: Already loaded as Q7 -// input[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[356]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 104)] -vadd.s32 Q7, Q7, Q5 -// Release input[236] from Q5 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vsub.s32 Q4, Q3, Q2 -// input[284]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q3, Q3, Q2 -// Release input[356] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[152]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * -100)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(416)] -vadd.s32 Q3, Q3, Q7 -// Release input[104] from Q7 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -// input[152]: Already loaded as Q6 -// input[284]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vadd.s32 Q6, Q6, Q5 -// Release input[284] from Q5 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// input[92]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 92)] -vadd.s32 Q3, Q3, Q2 -// Release input[20] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 92)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-400)] -vadd.s32 Q3, Q3, Q6 -// Release input[152] from Q6 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(128)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(80)] -// input[344]: Already loaded as Q7 -// input[92]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[212]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release input[92] from Q5 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vsub.s32 Q4, Q3, Q2 -// input[188]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * -64)] -vadd.s32 Q3, Q3, Q2 -// Release input[212] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[56]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 56)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(368)] -vadd.s32 Q3, Q3, Q7 -// Release input[344] from Q7 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r0,#(368)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -// input[56]: Already loaded as Q6 -// input[188]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r5 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.s32 Q6, Q6, Q5 -// Release input[188] from Q5 -// input[176]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -76)] -vsub.s32 Q4, Q3, Q2 -// input[380]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release input[308] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q6 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * -4)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r0,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release input[56] from Q6 -vstrw.u32 Q3, [r14,#(-304)] -// Release input[176] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r14,#(-256)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(224)] -// input[248]: Already loaded as Q7 -// input[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r5 -// input[116]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 116)] -vadd.s32 Q7, Q7, Q5 -// Release input[380] from Q5 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vsub.s32 Q4, Q3, Q2 -// input[48]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 48)] -vadd.s32 Q3, Q3, Q2 -// Release input[116] from Q2 -vqrdmulh.s32 Q0, Q0, r4 -vsub.s32 Q2, Q3, Q7 -// input[288]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r11 -vstrw.u32 Q2, [r14,#(-16)] -vadd.s32 Q3, Q3, Q7 -// Release input[248] from Q7 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r12,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(464)] -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[48]: Already loaded as Q5 -vmul.u32 Q0, Q5, r9 -// input[144]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r8 -// input[288]: Already loaded as Q6 -vmla.s32 Q0, Q5, r11 -vmul.u32 Q2, Q1, r9 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r8 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r11 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmul.u32 Q3, Q5, r5 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r11 -// input[240]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -12)] -vmul.u32 Q4, Q6, r7 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r6 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r0,#(192)] -// Release input[48] from Q5 -vmla.s32 Q4, Q6, r11 -vstrw.u32 Q1, [r14,#(-432)] -// Release input[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r14,#(144)] -// Release input[288] from Q6 -vadd.s32 Q0, Q0, Q4 -// input[240]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[336]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(0)] -// Release input[0] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[192]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[304]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-48)] -// Release input[240] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(336)] -// Release input[336] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(384)] -// Release input[96] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[304]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[16]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r8 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -92)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-240)] -// Release input[192] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[256]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[112]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(208)] -// Release input[304] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(64)] -// Release input[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-368)] -// Release input[160] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[112]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vqrdmulh.s32 Q1, Q1, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(16)] -// Release input[256] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[64]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 64)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[176]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -76)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(448)] -// Release input[112] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[176]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[272]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[32]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 32)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(256)] -// Release input[64] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[368]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-304)] -// Release input[176] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(80)] -// Release input[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(128)] -// Release input[32] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[368]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[80]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 80)] -vqrdmulh.s32 Q0, Q0, r8 -// input[224]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -28)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-496)] -// Release input[128] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[320]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(464)] -// Release input[368] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(320)] -// Release input[80] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-112)] -// Release input[224] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -84)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(272)] -// Release input[320] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[264]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[120]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(96)] -// Release input[24] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-336)] -// Release input[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[120]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[216]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -36)] -vqrdmulh.s32 Q2, Q2, r8 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(48)] -// Release input[264] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[72]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[184]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -68)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(480)] -// Release input[120] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-144)] -// Release input[216] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(432)] -// Release input[360] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[280]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[40]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(288)] -// Release input[72] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[136]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -116)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[376]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-272)] -// Release input[184] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(112)] -// Release input[280] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(160)] -// Release input[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[376]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-464)] -// Release input[136] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[328]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[56]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(496)] -// Release input[376] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[152]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -100)] -vqrdmulh.s32 Q2, Q2, r8 -// input[296]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(304)] -// Release input[328] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[8]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[248]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -4)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r0,#(224)] -// Release input[56] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-400)] -// Release input[152] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(176)] -// Release input[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[248]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[344]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r0,#(32)] -// Release input[8] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[200]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -52)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(-16)] -// Release input[248] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(368)] -// Release input[344] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r0,#(416)] -// Release input[104] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[180]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[276]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 24)] -vqrdmulh.s32 Q1, Q1, r8 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(-208)] -// Release input[200] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[132]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -120)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[372]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 120)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(96)] -// Release input[276] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(144)] -// Release input[36] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[372]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r8 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -24)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-480)] -// Release input[132] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[324]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 72)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[52]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 52)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(480)] -// Release input[372] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(336)] -// Release input[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-96)] -// Release input[228] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[52]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[148]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -104)] -vqrdmulh.s32 Q0, Q0, r8 -// input[292]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(288)] -// Release input[324] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[4]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 4)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[244]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -8)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(208)] -// Release input[52] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-416)] -// Release input[148] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(160)] -// Release input[292] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[244]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(16)] -// Release input[4] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[196]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -56)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[308]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 56)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r14,#(-32)] -// Release input[244] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[308]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[20]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r8 -// input[164]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r14,#(-224)] -// Release input[196] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[260]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 8)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[116]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 116)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(224)] -// Release input[308] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(80)] -// Release input[20] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r14,#(-352)] -// Release input[164] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[116]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[212]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r8 -// input[356]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 104)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(32)] -// Release input[260] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[68]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 68)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r0,#(464)] -// Release input[116] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-160)] -// Release input[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(416)] -// Release input[356] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r9, r8, [r10], #+8 -ldrd r7, r6, [r10], #+8 -ldrd r5, r4, [r10], #+8 -// input[60]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[156]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -96)] -vqrdmulh.s32 Q1, Q1, r8 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 48)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r0,#(272)] -// Release input[68] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[12]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 12)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[252]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 0)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-384)] -// Release input[156] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(192)] -// Release input[300] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[348]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 96)] -vqrdmulh.s32 Q2, Q2, r8 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(48)] -// Release input[12] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[204]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[316]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(0)] -// Release input[252] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(384)] -// Release input[348] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(432)] -// Release input[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[316]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[28]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 28)] -vqrdmulh.s32 Q0, Q0, r8 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -80)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-192)] -// Release input[204] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[268]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 16)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -// input[124]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r14,#(256)] -// Release input[316] from Q0 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r0,#(112)] -// Release input[28] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r14,#(-320)] -// Release input[172] from Q4 -vadd.s32 Q2, Q2, Q6 -// input[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r9 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vqrdmulh.s32 Q1, Q1, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q1, r11 -vstrw.u32 Q2, [r14,#(64)] -// Release input[268] from Q2 -vmul.u32 Q2, Q3, r9 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r11 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q5, Q1, r5 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r4 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r11 -// input[188]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -64)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r0,#(496)] -// Release input[124] from Q1 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// input[188]: Already loaded as Q2 -vmul.u32 Q1, Q2, r9 -// input[284]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 32)] -vqrdmulh.s32 Q2, Q2, r8 -// input[44]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 44)] -vmla.s32 Q1, Q2, r11 -vstrw.u32 Q0, [r0,#(304)] -// Release input[76] from Q0 -vmul.u32 Q0, Q3, r9 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r11 -// input[140]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vmul.u32 Q5, Q2, r5 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r4 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r11 -// input[380]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * -124)] -vmul.u32 Q6, Q4, r7 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r14,#(-256)] -// Release input[188] from Q2 -vmla.s32 Q6, Q4, r11 -vstrw.u32 Q3, [r14,#(128)] -// Release input[284] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r0,#(176)] -// Release input[44] from Q4 -vadd.s32 Q1, Q1, Q6 -// input[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r9 -// input[92]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 92)] -vqrdmulh.s32 Q0, Q0, r8 -// input[236]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -16)] -vmla.s32 Q2, Q0, r11 -vstrw.u32 Q1, [r14,#(-448)] -// Release input[140] from Q1 -vmul.u32 Q1, Q3, r9 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r8 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r11 -// input[332]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 80)] -vmul.u32 Q5, Q0, r5 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r4 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r11 -vmul.u32 Q1, Q4, r7 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r6 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r12,#(-496)] -// Release input[380] from Q0 -vmla.s32 Q1, Q4, r11 -vstrw.u32 Q3, [r0,#(368)] -// Release input[92] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r14,#(-64)] -// Release input[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(320)] -// Release input[332] from Q2 -ldrd r9, r8, [r10], #+8 -// input[192]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * -60)] -vmul.u32 Q1, Q0, r9 -// input[0]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r8 -// input[64]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 64)] -vmla.s32 Q1, Q0, r11 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r14,#(-240)] -// Release input[192] from Q0 -vadd.s32 Q2, Q2, Q1 -// input[64]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[256]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[320]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(0)] -// Release input[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(256)] -// Release input[64] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[320]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[128]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r8 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(16)] -// Release input[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(272)] -// Release input[320] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[96]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[288]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[352]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-496)] -// Release input[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(384)] -// Release input[96] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[352]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[160]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -92)] -vqrdmulh.s32 Q4, Q4, r8 -// input[224]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -28)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(144)] -// Release input[288] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(400)] -// Release input[352] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[224]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[32]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 32)] -vqrdmulh.s32 Q3, Q3, r8 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-368)] -// Release input[160] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-112)] -// Release input[224] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[336]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[144]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r8 -// input[208]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -44)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(128)] -// Release input[32] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(336)] -// Release input[336] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[208]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[16]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-432)] -// Release input[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-176)] -// Release input[208] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[272]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[240]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -12)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(64)] -// Release input[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(320)] -// Release input[80] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[240]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[48]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[112]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(80)] -// Release input[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-48)] -// Release input[240] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[112]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[304]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[368]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(192)] -// Release input[48] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(448)] -// Release input[112] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[368]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[176]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vqrdmulh.s32 Q3, Q3, r8 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(208)] -// Release input[304] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(464)] -// Release input[368] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[72]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[264]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[328]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-304)] -// Release input[176] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(288)] -// Release input[72] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[328]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[136]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vqrdmulh.s32 Q3, Q3, r8 -// input[200]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(48)] -// Release input[264] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(304)] -// Release input[328] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[200]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[8]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[360]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-464)] -// Release input[136] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-208)] -// Release input[200] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[360]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[168]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r8 -// input[232]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -20)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(32)] -// Release input[8] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(432)] -// Release input[360] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[232]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[40]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[104]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-336)] -// Release input[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-80)] -// Release input[232] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[104]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[296]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -36)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(160)] -// Release input[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(416)] -// Release input[104] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[216]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[24]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[88]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(176)] -// Release input[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-144)] -// Release input[216] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[88]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[280]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[344]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(96)] -// Release input[24] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(352)] -// Release input[88] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[344]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[152]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -100)] -vqrdmulh.s32 Q4, Q4, r8 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(112)] -// Release input[280] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(368)] -// Release input[344] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[120]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[312]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[376]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-400)] -// Release input[152] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(480)] -// Release input[120] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[376]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[184]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r8 -// input[248]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -4)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(240)] -// Release input[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(496)] -// Release input[376] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[248]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[56]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 72)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-272)] -// Release input[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-16)] -// Release input[248] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[132]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -120)] -vqrdmulh.s32 Q4, Q4, r8 -// input[196]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -56)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(224)] -// Release input[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(288)] -// Release input[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r8 -// input[68]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 68)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-480)] -// Release input[132] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-224)] -// Release input[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[260]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 8)] -vqrdmulh.s32 Q4, Q4, r8 -// input[228]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -24)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(16)] -// Release input[4] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(272)] -// Release input[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[228]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[36]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vqrdmulh.s32 Q3, Q3, r8 -// input[100]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 100)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(32)] -// Release input[260] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-96)] -// Release input[228] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[100]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[292]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r8 -// input[356]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 104)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(144)] -// Release input[36] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(400)] -// Release input[100] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[356]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[164]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vqrdmulh.s32 Q3, Q3, r8 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(160)] -// Release input[292] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(416)] -// Release input[356] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[276]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 24)] -vqrdmulh.s32 Q4, Q4, r8 -// input[340]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 88)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-352)] -// Release input[164] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(336)] -// Release input[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[148]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vqrdmulh.s32 Q3, Q3, r8 -// input[212]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(96)] -// Release input[276] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(352)] -// Release input[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[20]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r8 -// input[372]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 120)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-416)] -// Release input[148] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-160)] -// Release input[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[372]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[180]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vqrdmulh.s32 Q3, Q3, r8 -// input[244]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -8)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(80)] -// Release input[20] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(480)] -// Release input[372] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[244]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[52]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 52)] -vqrdmulh.s32 Q4, Q4, r8 -// input[116]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 116)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(-288)] -// Release input[180] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-32)] -// Release input[244] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[116]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[308]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r8 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * -48)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(208)] -// Release input[52] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(464)] -// Release input[116] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[204]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[12]: Load as Q2 -vldrw.u32 Q2, [r0, #(4 * 12)] -vqrdmulh.s32 Q4, Q4, r8 -// input[76]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 76)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(224)] -// Release input[308] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(-192)] -// Release input[204] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[76]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[268]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r8 -// input[332]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 80)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r0,#(48)] -// Release input[12] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(304)] -// Release input[76] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[332]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[140]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -112)] -vqrdmulh.s32 Q4, Q4, r8 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(64)] -// Release input[268] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(320)] -// Release input[332] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[300]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 48)] -vqrdmulh.s32 Q3, Q3, r8 -// input[364]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-448)] -// Release input[140] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r0,#(432)] -// Release input[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[172]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -80)] -vqrdmulh.s32 Q4, Q4, r8 -// input[236]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -16)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r14,#(192)] -// Release input[300] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(448)] -// Release input[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[44]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r8 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 96)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-320)] -// Release input[172] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-64)] -// Release input[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r9, r8, [r10], #+8 -// input[348]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[156]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * -96)] -vqrdmulh.s32 Q4, Q4, r8 -// input[220]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * -32)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(176)] -// Release input[44] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r14,#(384)] -// Release input[348] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[220]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[28]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 28)] -vqrdmulh.s32 Q3, Q3, r8 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(-384)] -// Release input[156] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(-128)] -// Release input[220] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[284]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r8 -// input[252]: Load as Q3 -vldrw.u32 Q3, [r14, #(4 * 0)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(112)] -// Release input[28] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(368)] -// Release input[92] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r9, r8, [r10], #+8 -// input[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[60]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r8 -// input[124]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 124)] -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(128)] -// Release input[284] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r14,#(0)] -// Release input[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// input[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r9 -// input[316]: Load as Q2 -vldrw.u32 Q2, [r14, #(4 * 64)] -vqrdmulh.s32 Q4, Q4, r8 -// input[380]: Load as Q3 -vldrw.u32 Q3, [r12, #(4 * -124)] -vmla.s32 Q0, Q4, r11 -vstrw.u32 Q1, [r0,#(240)] -// Release input[60] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r0,#(496)] -// Release input[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// input[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r9 -// input[188]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vqrdmulh.s32 Q3, Q3, r8 -vmla.s32 Q0, Q3, r11 -vstrw.u32 Q2, [r14,#(256)] -// Release input[316] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r12,#(-496)] -// Release input[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-256)] -// Release input[188] from Q1 -.equ modulus_inv, 2228766271 -movw r9, #:lower16:modulus_inv -movt r9, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3150 -// Instruction count: 2196 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop.s b/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop.s deleted file mode 100644 index 43841df..0000000 --- a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop.s +++ /dev/null @@ -1,3388 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_twiddles -ntt_384_u32_88299073_4883425_incomplete_good_oop_twiddles: // For base multiplication -.word 75231281 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 3951395343 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 15452769 // zeta^ 64 * 2^31 = 4883425^ 64 * 2^31 = 85764717 * 2^31 -.word 2033538015 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 64 * 2066201025 * 2^31 -.word 19987225 // zeta^ 32 * 2^31 = 4883425^ 32 * 2^31 = 19144749 * 2^31 -.word 1892589863 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 32 * 2066201025 * 2^31 -.word 50503029 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 2741681611 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 81982457 // zeta^ 16 * 2^31 = 4883425^ 16 * 2^31 = 76960665 * 2^31 -.word 2158501959 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 16 * 2066201025 * 2^31 -.word 20023469 // zeta^ 80 * 2^31 = 4883425^ 80 * 2^31 = 41822566 * 2^31 -.word 1552412819 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 80 * 2066201025 * 2^31 -.word 55876839 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 1939982041 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 43619891 // zeta^112 * 2^31 = 4883425^112 * 2^31 = 44400103 * 2^31 -.word 2850416781 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4883425^112 * 2066201025 * 2^31 -.word 172662323 // zeta^ 8 * 2^31 = 4883425^ 8 * 2^31 = 26094785 * 2^31 -.word 3064389773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 8 * 2066201025 * 2^31 -.word 71853543 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 4036378073 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 23697415 // zeta^ 40 * 2^31 = 4883425^ 40 * 2^31 = 55309930 * 2^31 -.word 443962297 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 40 * 2066201025 * 2^31 -.word 76499159 // zeta^104 * 2^31 = 4883425^104 * 2^31 = 78628712 * 2^31 -.word 2660611817 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4883425^104 * 2066201025 * 2^31 -.word 56990949 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 337656411 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 120013125 // zeta^ 88 * 2^31 = 4883425^ 88 * 2^31 = 20668553 * 2^31 -.word 616164859 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 88 * 2066201025 * 2^31 -.word 28856125 // zeta^ 56 * 2^31 = 4883425^ 56 * 2^31 = 41675533 * 2^31 -.word 561917443 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 56 * 2066201025 * 2^31 -.word 159401217 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 642203967 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 12190033 // zeta^ 4 * 2^31 = 4883425^ 4 * 2^31 = 4883425 * 2^31 -.word 3933894895 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 4 * 2066201025 * 2^31 -.word 108088419 // zeta^ 68 * 2^31 = 4883425^ 68 * 2^31 = 13818672 * 2^31 -.word 273473117 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 68 * 2066201025 * 2^31 -.word 142353279 // zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 2003400257 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 143392463 // zeta^100 * 2^31 = 4883425^100 * 2^31 = 35160276 * 2^31 -.word 482889457 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4883425^100 * 2066201025 * 2^31 -.word 119167385 // zeta^ 20 * 2^31 = 4883425^ 20 * 2^31 = 52712221 * 2^31 -.word 1897128615 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 20 * 2066201025 * 2^31 -.word 9268541 // zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 1847889923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 80397967 // zeta^ 52 * 2^31 = 4883425^ 52 * 2^31 = 81877099 * 2^31 -.word 3839489841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 52 * 2066201025 * 2^31 -.word 16520015 // zeta^116 * 2^31 = 4883425^116 * 2^31 = 18306165 * 2^31 -.word 838359665 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4883425^116 * 2066201025 * 2^31 -.word 115982427 // zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 3605477477 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 55226367 // zeta^ 76 * 2^31 = 4883425^ 76 * 2^31 = 50302558 * 2^31 -.word 917047745 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 76 * 2066201025 * 2^31 -.word 136968867 // zeta^ 44 * 2^31 = 4883425^ 44 * 2^31 = 63650411 * 2^31 -.word 40189981 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 44 * 2066201025 * 2^31 -.word 68313423 // zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 3720973425 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 117342749 // zeta^ 28 * 2^31 = 4883425^ 28 * 2^31 = 32879858 * 2^31 -.word 212726563 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 28 * 2066201025 * 2^31 -.word 64009947 // zeta^ 92 * 2^31 = 4883425^ 92 * 2^31 = 70872893 * 2^31 -.word 925164005 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 92 * 2066201025 * 2^31 -.word 55029279 // zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1315460001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.word 99141453 // zeta^124 * 2^31 = 4883425^124 * 2^31 = 15592642 * 2^31 -.word 4156561907 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4883425^124 * 2066201025 * 2^31 -.word 28520561 // zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2377109967 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 101366865 // zeta^192 * 2^31 = 4883425^192 * 2^31 = 88299072 * 2^31 -.word 343571951 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4883425^192 * 2066201025 * 2^31 -.word 118814877 // zeta^160 * 2^31 = 4883425^160 * 2^31 = 5579523 * 2^31 -.word 849091747 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4883425^160 * 2066201025 * 2^31 -.word 156610921 // zeta^224 * 2^31 = 4883425^224 * 2^31 = 69154324 * 2^31 -.word 2402377431 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4883425^224 * 2066201025 * 2^31 -.word 26340085 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 3688878155 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 94615689 // zeta^208 * 2^31 = 4883425^208 * 2^31 = 11338408 * 2^31 -.word 2136465335 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4883425^208 * 2066201025 * 2^31 -.word 76042125 // zeta^176 * 2^31 = 4883425^176 * 2^31 = 22220342 * 2^31 -.word 910434739 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4883425^176 * 2066201025 * 2^31 -.word 120721307 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 2354985253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 164088439 // zeta^136 * 2^31 = 4883425^136 * 2^31 = 32274711 * 2^31 -.word 971988297 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4883425^136 * 2066201025 * 2^31 -.word 3935823 // zeta^200 * 2^31 = 4883425^200 * 2^31 = 62204288 * 2^31 -.word 1230577521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4883425^200 * 2066201025 * 2^31 -.word 141100817 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 2216649519 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 152900731 // zeta^232 * 2^31 = 4883425^232 * 2^31 = 32989143 * 2^31 -.word 3851004997 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4883425^232 * 2066201025 * 2^31 -.word 151321249 // zeta^152 * 2^31 = 4883425^152 * 2^31 = 11170776 * 2^31 -.word 278508447 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4883425^152 * 2066201025 * 2^31 -.word 119607197 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 3957310883 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 42246019 // zeta^184 * 2^31 = 4883425^184 * 2^31 = 23363129 * 2^31 -.word 80286525 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4883425^184 * 2066201025 * 2^31 -.word 147742021 // zeta^248 * 2^31 = 4883425^248 * 2^31 = 46623540 * 2^31 -.word 3733049851 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4883425^248 * 2066201025 * 2^31 -.word 7599313 // zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 634545519 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 164408113 // zeta^196 * 2^31 = 4883425^196 * 2^31 = 83415648 * 2^31 -.word 361072399 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4883425^196 * 2066201025 * 2^31 -.word 89338257 // zeta^164 * 2^31 = 4883425^164 * 2^31 = 30758081 * 2^31 -.word 2774456495 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4883425^164 * 2066201025 * 2^31 -.word 34244867 // zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2291567037 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 154998375 // zeta^148 * 2^31 = 4883425^148 * 2^31 = 54723088 * 2^31 -.word 4245728601 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4883425^148 * 2066201025 * 2^31 -.word 57430761 // zeta^212 * 2^31 = 4883425^212 * 2^31 = 35586852 * 2^31 -.word 2397838679 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4883425^212 * 2066201025 * 2^31 -.word 24421121 // zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 1293837119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 96200179 // zeta^244 * 2^31 = 4883425^244 * 2^31 = 6421974 * 2^31 -.word 455477453 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4883425^244 * 2066201025 * 2^31 -.word 27543013 // zeta^140 * 2^31 = 4883425^140 * 2^31 = 22531438 * 2^31 -.word 1606537563 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4883425^140 * 2066201025 * 2^31 -.word 60615719 // zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 689489817 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 19643629 // zeta^172 * 2^31 = 4883425^172 * 2^31 = 5400389 * 2^31 -.word 3680783443 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4883425^172 * 2066201025 * 2^31 -.word 39629279 // zeta^236 * 2^31 = 4883425^236 * 2^31 = 24648662 * 2^31 -.word 4254777313 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4883425^236 * 2066201025 * 2^31 -.word 34966271 // zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 712437441 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 59255397 // zeta^220 * 2^31 = 4883425^220 * 2^31 = 55419215 * 2^31 -.word 4082240731 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4883425^220 * 2066201025 * 2^31 -.word 132411247 // zeta^188 * 2^31 = 4883425^188 * 2^31 = 61321868 * 2^31 -.word 2841101905 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4883425^188 * 2066201025 * 2^31 -.word 121568867 // zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 2979507293 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 161145377 // zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 2261429279 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 148077585 // zeta^320 * 2^31 = 4883425^320 * 2^31 = 2534357 * 2^31 -.word 1917857327 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4883425^320 * 2066201025 * 2^31 -.word 126095117 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1553285683 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 57783269 // zeta^352 * 2^31 = 4883425^352 * 2^31 = 82719550 * 2^31 -.word 3445875547 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4883425^352 * 2066201025 * 2^31 -.word 156574677 // zeta^272 * 2^31 = 4883425^272 * 2^31 = 46476507 * 2^31 -.word 2742554475 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4883425^272 * 2066201025 * 2^31 -.word 150258061 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 606089139 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 132978255 // zeta^304 * 2^31 = 4883425^304 * 2^31 = 43898970 * 2^31 -.word 1444550513 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4883425^304 * 2066201025 * 2^31 -.word 100556021 // zeta^368 * 2^31 = 4883425^368 * 2^31 = 66078731 * 2^31 -.word 3384532555 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4883425^368 * 2066201025 * 2^31 -.word 104744603 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 258589221 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 12509707 // zeta^328 * 2^31 = 4883425^328 * 2^31 = 56024362 * 2^31 -.word 3322978997 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4883425^328 * 2066201025 * 2^31 -.word 100098987 // zeta^296 * 2^31 = 4883425^296 * 2^31 = 9670361 * 2^31 -.word 1634355477 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4883425^296 * 2066201025 * 2^31 -.word 35497329 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 2078317775 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 56585021 // zeta^280 * 2^31 = 4883425^280 * 2^31 = 67630520 * 2^31 -.word 3678802435 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4883425^280 * 2066201025 * 2^31 -.word 25276897 // zeta^344 * 2^31 = 4883425^344 * 2^31 = 77128297 * 2^31 -.word 4016458847 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4883425^344 * 2066201025 * 2^31 -.word 17196929 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 3652763327 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 134352127 // zeta^376 * 2^31 = 4883425^376 * 2^31 = 64935944 * 2^31 -.word 4214680769 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4883425^376 * 2066201025 * 2^31 -.word 68509727 // zeta^260 * 2^31 = 4883425^260 * 2^31 = 74480401 * 2^31 -.word 4021494177 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4883425^260 * 2066201025 * 2^31 -.word 168998833 // zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 3660421775 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.word 33205683 // zeta^292 * 2^31 = 4883425^292 * 2^31 = 53138797 * 2^31 -.word 3812077837 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4883425^292 * 2066201025 * 2^31 -.word 87259889 // zeta^356 * 2^31 = 4883425^356 * 2^31 = 57540992 * 2^31 -.word 1520510799 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4883425^356 * 2066201025 * 2^31 -.word 167329605 // zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 2447077371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 21599771 // zeta^340 * 2^31 = 4883425^340 * 2^31 = 33575985 * 2^31 -.word 49238693 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4883425^340 * 2066201025 * 2^31 -.word 160078131 // zeta^308 * 2^31 = 4883425^308 * 2^31 = 69992908 * 2^31 -.word 3456607629 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4883425^308 * 2066201025 * 2^31 -.word 152177025 // zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 3001130175 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 121371779 // zeta^268 * 2^31 = 4883425^268 * 2^31 = 37996515 * 2^31 -.word 3377919549 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4883425^268 * 2066201025 * 2^31 -.word 149055133 // zeta^332 * 2^31 = 4883425^332 * 2^31 = 65767635 * 2^31 -.word 2688429731 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4883425^332 * 2066201025 * 2^31 -.word 108284723 // zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 573993869 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 156954517 // zeta^364 * 2^31 = 4883425^364 * 2^31 = 82898684 * 2^31 -.word 614183851 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4883425^364 * 2066201025 * 2^31 -.word 112588199 // zeta^284 * 2^31 = 4883425^284 * 2^31 = 17426180 * 2^31 -.word 3369803289 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4883425^284 * 2066201025 * 2^31 -.word 141631875 // zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 3582529853 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 77456693 // zeta^316 * 2^31 = 4883425^316 * 2^31 = 72706431 * 2^31 -.word 138405387 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4883425^316 * 2066201025 * 2^31 -.word 44186899 // zeta^380 * 2^31 = 4883425^380 * 2^31 = 26977205 * 2^31 -.word 1453865389 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4883425^380 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_scale -ntt_384_u32_88299073_4883425_incomplete_good_oop_scale: // Constants for scaling by 1/N -.word 75231281 // 1/96 -.word 3951395343 // 1/96 twisted -.data -roots: -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 29929577 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 9497777 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // XX: zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // XX: zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // XX: zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 29929577 // XX: zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // XX: zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 9497777 // XX: zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // XX: zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 8935247 // XX: zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 217310286 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 4402195 // XX: zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 107063885 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 69162837 // XX: zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 1682079511 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 24728139 // XX: zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 601402397 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 27771120 // XX: zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 675409425 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 19248273 // XX: zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 468128941 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 37993035 // XX: zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 924012208 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 42569847 // XX: zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1035322877 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good_oop, %function -.global ntt_384_u32_88299073_4883425_incomplete_good_oop -ntt_384_u32_88299073_4883425_incomplete_good_oop: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -88299073 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[256]: Load as Q0 -vldrw.u32 Q0, [r12, #(4 * 0)] -// input[128]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 0)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r7 -vadd.s32 Q5, Q0, Q1 -// Release input[256] from Q0 -vqrdmulh.s32 Q4, Q2, r6 -// input[0]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s32 Q3, Q4, r9 -vsub.s32 Q4, Q0, Q1 -// Release input[128] from Q1 -// input[4]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 4)] -vadd.s32 Q6, Q4, Q3 -// input[260]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 4)] -vsub.s32 Q4, Q4, Q3 -vstrw.u32 Q6, [r11,#(16)] -vsub.s32 Q4, Q4, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.s32 Q5, Q5, Q0 -// Release input[0] from Q0 -vstrw.u32 Q5, [r1,#(0)] -// input[4]: Already loaded as Q1 -// input[260]: Already loaded as Q7 -vsub.s32 Q0, Q1, Q7 -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q7 -// Release input[4] from Q1 -vqrdmulh.s32 Q3, Q0, r6 -// input[132]: Load as Q1 -vldrw.u32 Q1, [r14, #(4 * 4)] -vmla.s32 Q2, Q3, r9 -vsub.s32 Q3, Q1, Q7 -// Release input[260] from Q7 -// input[136]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 8)] -vadd.s32 Q6, Q3, Q2 -// input[8]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 8)] -vsub.s32 Q3, Q3, Q2 -vstrw.u32 Q6, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -vadd.s32 Q4, Q4, Q1 -// Release input[132] from Q1 -vstrw.u32 Q4, [r11,#(-480)] -// input[136]: Already loaded as Q5 -// input[8]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[136] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[264]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[8] from Q7 -// input[268]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 12)] -vadd.s32 Q6, Q2, Q1 -// input[140]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 12)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-464)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(32)] -vadd.s32 Q3, Q3, Q4 -// Release input[264] from Q4 -vstrw.u32 Q3, [r11,#(48)] -// input[268]: Already loaded as Q5 -// input[140]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[268] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[12]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[140] from Q7 -// input[16]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 16)] -vadd.s32 Q6, Q2, Q1 -// input[272]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 16)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-448)] -vadd.s32 Q3, Q3, Q4 -// Release input[12] from Q4 -vstrw.u32 Q3, [r1,#(48)] -// input[16]: Already loaded as Q5 -// input[272]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[16] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[144]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 16)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[272] from Q7 -// input[148]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 20)] -vadd.s32 Q6, Q2, Q1 -// input[20]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 20)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(64)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[144] from Q4 -vstrw.u32 Q3, [r11,#(-432)] -// input[148]: Already loaded as Q5 -// input[20]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[148] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[276]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 20)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[20] from Q7 -// input[280]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 24)] -vadd.s32 Q6, Q2, Q1 -// input[152]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 24)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-416)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(80)] -vadd.s32 Q3, Q3, Q4 -// Release input[276] from Q4 -vstrw.u32 Q3, [r11,#(96)] -// input[280]: Already loaded as Q5 -// input[152]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[280] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[24]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 24)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[152] from Q7 -// input[28]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 28)] -vadd.s32 Q6, Q2, Q1 -// input[284]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 28)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-400)] -vadd.s32 Q3, Q3, Q4 -// Release input[24] from Q4 -vstrw.u32 Q3, [r1,#(96)] -// input[28]: Already loaded as Q5 -// input[284]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[28] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[156]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 28)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[284] from Q7 -// input[160]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 32)] -vadd.s32 Q6, Q2, Q1 -// input[32]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 32)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(112)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[156] from Q4 -vstrw.u32 Q3, [r11,#(-384)] -// input[160]: Already loaded as Q5 -// input[32]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[160] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[288]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 32)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[32] from Q7 -// input[292]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 36)] -vadd.s32 Q6, Q2, Q1 -// input[164]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 36)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-368)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q4 -// Release input[288] from Q4 -vstrw.u32 Q3, [r11,#(144)] -// input[292]: Already loaded as Q5 -// input[164]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[292] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[36]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 36)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[164] from Q7 -// input[40]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 40)] -vadd.s32 Q6, Q2, Q1 -// input[296]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 40)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q4 -// Release input[36] from Q4 -vstrw.u32 Q3, [r1,#(144)] -// input[40]: Already loaded as Q5 -// input[296]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[40] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[168]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 40)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[296] from Q7 -// input[172]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 44)] -vadd.s32 Q6, Q2, Q1 -// input[44]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 44)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(160)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[168] from Q4 -vstrw.u32 Q3, [r11,#(-336)] -// input[172]: Already loaded as Q5 -// input[44]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[172] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[300]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[44] from Q7 -// input[304]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 48)] -vadd.s32 Q6, Q2, Q1 -// input[176]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 48)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-320)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q4 -// Release input[300] from Q4 -vstrw.u32 Q3, [r11,#(192)] -// input[304]: Already loaded as Q5 -// input[176]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[304] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[48]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[176] from Q7 -// input[52]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 52)] -vadd.s32 Q6, Q2, Q1 -// input[308]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 52)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q4 -// Release input[48] from Q4 -vstrw.u32 Q3, [r1,#(192)] -// input[52]: Already loaded as Q5 -// input[308]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[52] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[180]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 52)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[308] from Q7 -// input[184]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 56)] -vadd.s32 Q6, Q2, Q1 -// input[56]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 56)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(208)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[180] from Q4 -vstrw.u32 Q3, [r11,#(-288)] -// input[184]: Already loaded as Q5 -// input[56]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[184] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[312]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 56)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[56] from Q7 -// input[316]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 60)] -vadd.s32 Q6, Q2, Q1 -// input[188]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 60)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-272)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q4 -// Release input[312] from Q4 -vstrw.u32 Q3, [r11,#(240)] -// input[316]: Already loaded as Q5 -// input[188]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[316] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[60]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 60)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[188] from Q7 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vadd.s32 Q6, Q2, Q1 -// input[320]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 64)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q4 -// Release input[60] from Q4 -vstrw.u32 Q3, [r1,#(240)] -// input[64]: Already loaded as Q5 -// input[320]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[64] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[192]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 64)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[320] from Q7 -// input[196]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 68)] -vadd.s32 Q6, Q2, Q1 -// input[68]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 68)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(256)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[192] from Q4 -vstrw.u32 Q3, [r11,#(-240)] -// input[196]: Already loaded as Q5 -// input[68]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[196] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[324]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 68)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[68] from Q7 -// input[328]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 72)] -vadd.s32 Q6, Q2, Q1 -// input[200]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 72)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-224)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(272)] -vadd.s32 Q3, Q3, Q4 -// Release input[324] from Q4 -vstrw.u32 Q3, [r11,#(288)] -// input[328]: Already loaded as Q5 -// input[200]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[328] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[200] from Q7 -// input[76]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 76)] -vadd.s32 Q6, Q2, Q1 -// input[332]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 76)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-208)] -vadd.s32 Q3, Q3, Q4 -// Release input[72] from Q4 -vstrw.u32 Q3, [r1,#(288)] -// input[76]: Already loaded as Q5 -// input[332]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[76] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[204]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 76)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[332] from Q7 -// input[208]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 80)] -vadd.s32 Q6, Q2, Q1 -// input[80]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 80)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(304)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[204] from Q4 -vstrw.u32 Q3, [r11,#(-192)] -// input[208]: Already loaded as Q5 -// input[80]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[208] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[336]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 80)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[80] from Q7 -// input[340]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 84)] -vadd.s32 Q6, Q2, Q1 -// input[212]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 84)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-176)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(320)] -vadd.s32 Q3, Q3, Q4 -// Release input[336] from Q4 -vstrw.u32 Q3, [r11,#(336)] -// input[340]: Already loaded as Q5 -// input[212]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[340] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[84]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 84)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[212] from Q7 -// input[88]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 88)] -vadd.s32 Q6, Q2, Q1 -// input[344]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 88)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-160)] -vadd.s32 Q3, Q3, Q4 -// Release input[84] from Q4 -vstrw.u32 Q3, [r1,#(336)] -// input[88]: Already loaded as Q5 -// input[344]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[88] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[216]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 88)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[344] from Q7 -// input[220]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 92)] -vadd.s32 Q6, Q2, Q1 -// input[92]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 92)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(352)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[216] from Q4 -vstrw.u32 Q3, [r11,#(-144)] -// input[220]: Already loaded as Q5 -// input[92]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[220] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[348]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 92)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[92] from Q7 -// input[352]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 96)] -vadd.s32 Q6, Q2, Q1 -// input[224]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 96)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(368)] -vadd.s32 Q3, Q3, Q4 -// Release input[348] from Q4 -vstrw.u32 Q3, [r11,#(384)] -// input[352]: Already loaded as Q5 -// input[224]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[352] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[96]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 96)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[224] from Q7 -// input[100]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 100)] -vadd.s32 Q6, Q2, Q1 -// input[356]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 100)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-112)] -vadd.s32 Q3, Q3, Q4 -// Release input[96] from Q4 -vstrw.u32 Q3, [r1,#(384)] -// input[100]: Already loaded as Q5 -// input[356]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[100] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[228]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 100)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[356] from Q7 -// input[232]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 104)] -vadd.s32 Q6, Q2, Q1 -// input[104]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 104)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(400)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[228] from Q4 -vstrw.u32 Q3, [r11,#(-96)] -// input[232]: Already loaded as Q5 -// input[104]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[232] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[360]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[104] from Q7 -// input[364]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 108)] -vadd.s32 Q6, Q2, Q1 -// input[236]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 108)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-80)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(416)] -vadd.s32 Q3, Q3, Q4 -// Release input[360] from Q4 -vstrw.u32 Q3, [r11,#(432)] -// input[364]: Already loaded as Q5 -// input[236]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[364] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[108]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[236] from Q7 -// input[112]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 112)] -vadd.s32 Q6, Q2, Q1 -// input[368]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 112)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-64)] -vadd.s32 Q3, Q3, Q4 -// Release input[108] from Q4 -vstrw.u32 Q3, [r1,#(432)] -// input[112]: Already loaded as Q5 -// input[368]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[112] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[240]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 112)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[368] from Q7 -// input[244]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 116)] -vadd.s32 Q6, Q2, Q1 -// input[116]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 116)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r1,#(448)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[240] from Q4 -vstrw.u32 Q3, [r11,#(-48)] -// input[244]: Already loaded as Q5 -// input[116]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[244] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[372]: Load as Q4 -vldrw.u32 Q4, [r12, #(4 * 116)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[116] from Q7 -// input[376]: Load as Q5 -vldrw.u32 Q5, [r12, #(4 * 120)] -vadd.s32 Q6, Q2, Q1 -// input[248]: Load as Q7 -vldrw.u32 Q7, [r14, #(4 * 120)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(-32)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r1,#(464)] -vadd.s32 Q3, Q3, Q4 -// Release input[372] from Q4 -vstrw.u32 Q3, [r11,#(480)] -// input[376]: Already loaded as Q5 -// input[248]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[376] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[120]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 120)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[248] from Q7 -// input[124]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 124)] -vadd.s32 Q6, Q2, Q1 -// input[380]: Load as Q7 -vldrw.u32 Q7, [r12, #(4 * 124)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q6, [r11,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r11,#(-16)] -vadd.s32 Q3, Q3, Q4 -// Release input[120] from Q4 -vstrw.u32 Q3, [r1,#(480)] -// input[124]: Already loaded as Q5 -// input[380]: Already loaded as Q7 -vsub.s32 Q0, Q5, Q7 -vmul.u32 Q1, Q0, r7 -vadd.s32 Q3, Q5, Q7 -// Release input[124] from Q5 -vqrdmulh.s32 Q2, Q0, r6 -// input[252]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 124)] -vmla.s32 Q1, Q2, r9 -vsub.s32 Q2, Q4, Q7 -// Release input[380] from Q7 -vadd.s32 Q5, Q2, Q1 -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q5, [r1,#(496)] -vsub.s32 Q2, Q2, Q0 -vstrw.u32 Q2, [r10,#(-496)] -vadd.s32 Q3, Q3, Q4 -// Release input[252] from Q4 -vstrw.u32 Q3, [r11,#(0)] -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 2228766271 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3355 -// Instruction count: 2397 \ No newline at end of file diff --git a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s b/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s deleted file mode 100644 index d40cc4c..0000000 --- a/tests/intmulntt/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s +++ /dev/null @@ -1,3075 +0,0 @@ - -/// -/// Copyright (c) 2021 Arm Limited -/// SPDX-License-Identifier: MIT -/// -/// Permission is hereby granted, free of charge, to any person obtaining a copy -/// of this software and associated documentation files (the "Software"), to deal -/// in the Software without restriction, including without limitation the rights -/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -/// copies of the Software, and to permit persons to whom the Software is -/// furnished to do so, subject to the following conditions: -/// -/// The above copyright notice and this permission notice shall be included in all -/// copies or substantial portions of the Software. -/// -/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -/// SOFTWARE. -/// - - - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.data -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_twiddles -ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_twiddles: // For base multiplication -.word 75231281 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 3951395343 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 15452769 // zeta^ 64 * 2^31 = 4883425^ 64 * 2^31 = 85764717 * 2^31 -.word 2033538015 // zeta^ 64 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 64 * 2066201025 * 2^31 -.word 19987225 // zeta^ 32 * 2^31 = 4883425^ 32 * 2^31 = 19144749 * 2^31 -.word 1892589863 // zeta^ 32 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 32 * 2066201025 * 2^31 -.word 50503029 // zeta^ 96 * 2^31 = 4883425^ 96 * 2^31 = 24724272 * 2^31 -.word 2741681611 // zeta^ 96 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 96 * 2066201025 * 2^31 -.word 81982457 // zeta^ 16 * 2^31 = 4883425^ 16 * 2^31 = 76960665 * 2^31 -.word 2158501959 // zeta^ 16 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 16 * 2066201025 * 2^31 -.word 20023469 // zeta^ 80 * 2^31 = 4883425^ 80 * 2^31 = 41822566 * 2^31 -.word 1552412819 // zeta^ 80 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 80 * 2066201025 * 2^31 -.word 55876839 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 1939982041 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 43619891 // zeta^112 * 2^31 = 4883425^112 * 2^31 = 44400103 * 2^31 -.word 2850416781 // zeta^112 * f(q^(-1) mod 2^32) * 2^31 = 4883425^112 * 2066201025 * 2^31 -.word 172662323 // zeta^ 8 * 2^31 = 4883425^ 8 * 2^31 = 26094785 * 2^31 -.word 3064389773 // zeta^ 8 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 8 * 2066201025 * 2^31 -.word 71853543 // zeta^ 72 * 2^31 = 4883425^ 72 * 2^31 = 58369496 * 2^31 -.word 4036378073 // zeta^ 72 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 72 * 2066201025 * 2^31 -.word 23697415 // zeta^ 40 * 2^31 = 4883425^ 40 * 2^31 = 55309930 * 2^31 -.word 443962297 // zeta^ 40 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 40 * 2066201025 * 2^31 -.word 76499159 // zeta^104 * 2^31 = 4883425^104 * 2^31 = 78628712 * 2^31 -.word 2660611817 // zeta^104 * f(q^(-1) mod 2^32) * 2^31 = 4883425^104 * 2066201025 * 2^31 -.word 56990949 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 337656411 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 120013125 // zeta^ 88 * 2^31 = 4883425^ 88 * 2^31 = 20668553 * 2^31 -.word 616164859 // zeta^ 88 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 88 * 2066201025 * 2^31 -.word 28856125 // zeta^ 56 * 2^31 = 4883425^ 56 * 2^31 = 41675533 * 2^31 -.word 561917443 // zeta^ 56 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 56 * 2066201025 * 2^31 -.word 159401217 // zeta^120 * 2^31 = 4883425^120 * 2^31 = 65038662 * 2^31 -.word 642203967 // zeta^120 * f(q^(-1) mod 2^32) * 2^31 = 4883425^120 * 2066201025 * 2^31 -.word 12190033 // zeta^ 4 * 2^31 = 4883425^ 4 * 2^31 = 4883425 * 2^31 -.word 3933894895 // zeta^ 4 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 4 * 2066201025 * 2^31 -.word 108088419 // zeta^ 68 * 2^31 = 4883425^ 68 * 2^31 = 13818672 * 2^31 -.word 273473117 // zeta^ 68 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 68 * 2066201025 * 2^31 -.word 142353279 // zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 2003400257 // zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 143392463 // zeta^100 * 2^31 = 4883425^100 * 2^31 = 35160276 * 2^31 -.word 482889457 // zeta^100 * f(q^(-1) mod 2^32) * 2^31 = 4883425^100 * 2066201025 * 2^31 -.word 119167385 // zeta^ 20 * 2^31 = 4883425^ 20 * 2^31 = 52712221 * 2^31 -.word 1897128615 // zeta^ 20 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 20 * 2066201025 * 2^31 -.word 9268541 // zeta^ 84 * 2^31 = 4883425^ 84 * 2^31 = 19136236 * 2^31 -.word 1847889923 // zeta^ 84 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 84 * 2066201025 * 2^31 -.word 80397967 // zeta^ 52 * 2^31 = 4883425^ 52 * 2^31 = 81877099 * 2^31 -.word 3839489841 // zeta^ 52 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 52 * 2066201025 * 2^31 -.word 16520015 // zeta^116 * 2^31 = 4883425^116 * 2^31 = 18306165 * 2^31 -.word 838359665 // zeta^116 * f(q^(-1) mod 2^32) * 2^31 = 4883425^116 * 2066201025 * 2^31 -.word 115982427 // zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 3605477477 // zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 55226367 // zeta^ 76 * 2^31 = 4883425^ 76 * 2^31 = 50302558 * 2^31 -.word 917047745 // zeta^ 76 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 76 * 2066201025 * 2^31 -.word 136968867 // zeta^ 44 * 2^31 = 4883425^ 44 * 2^31 = 63650411 * 2^31 -.word 40189981 // zeta^ 44 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 44 * 2066201025 * 2^31 -.word 68313423 // zeta^108 * 2^31 = 4883425^108 * 2^31 = 69050800 * 2^31 -.word 3720973425 // zeta^108 * f(q^(-1) mod 2^32) * 2^31 = 4883425^108 * 2066201025 * 2^31 -.word 117342749 // zeta^ 28 * 2^31 = 4883425^ 28 * 2^31 = 32879858 * 2^31 -.word 212726563 // zeta^ 28 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 28 * 2066201025 * 2^31 -.word 64009947 // zeta^ 92 * 2^31 = 4883425^ 92 * 2^31 = 70872893 * 2^31 -.word 925164005 // zeta^ 92 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 92 * 2066201025 * 2^31 -.word 55029279 // zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1315460001 // zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.word 99141453 // zeta^124 * 2^31 = 4883425^124 * 2^31 = 15592642 * 2^31 -.word 4156561907 // zeta^124 * f(q^(-1) mod 2^32) * 2^31 = 4883425^124 * 2066201025 * 2^31 -.word 28520561 // zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2377109967 // zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 101366865 // zeta^192 * 2^31 = 4883425^192 * 2^31 = 88299072 * 2^31 -.word 343571951 // zeta^192 * f(q^(-1) mod 2^32) * 2^31 = 4883425^192 * 2066201025 * 2^31 -.word 118814877 // zeta^160 * 2^31 = 4883425^160 * 2^31 = 5579523 * 2^31 -.word 849091747 // zeta^160 * f(q^(-1) mod 2^32) * 2^31 = 4883425^160 * 2066201025 * 2^31 -.word 156610921 // zeta^224 * 2^31 = 4883425^224 * 2^31 = 69154324 * 2^31 -.word 2402377431 // zeta^224 * f(q^(-1) mod 2^32) * 2^31 = 4883425^224 * 2066201025 * 2^31 -.word 26340085 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 3688878155 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 94615689 // zeta^208 * 2^31 = 4883425^208 * 2^31 = 11338408 * 2^31 -.word 2136465335 // zeta^208 * f(q^(-1) mod 2^32) * 2^31 = 4883425^208 * 2066201025 * 2^31 -.word 76042125 // zeta^176 * 2^31 = 4883425^176 * 2^31 = 22220342 * 2^31 -.word 910434739 // zeta^176 * f(q^(-1) mod 2^32) * 2^31 = 4883425^176 * 2066201025 * 2^31 -.word 120721307 // zeta^240 * 2^31 = 4883425^240 * 2^31 = 66119312 * 2^31 -.word 2354985253 // zeta^240 * f(q^(-1) mod 2^32) * 2^31 = 4883425^240 * 2066201025 * 2^31 -.word 164088439 // zeta^136 * 2^31 = 4883425^136 * 2^31 = 32274711 * 2^31 -.word 971988297 // zeta^136 * f(q^(-1) mod 2^32) * 2^31 = 4883425^136 * 2066201025 * 2^31 -.word 3935823 // zeta^200 * 2^31 = 4883425^200 * 2^31 = 62204288 * 2^31 -.word 1230577521 // zeta^200 * f(q^(-1) mod 2^32) * 2^31 = 4883425^200 * 2066201025 * 2^31 -.word 141100817 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 2216649519 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 152900731 // zeta^232 * 2^31 = 4883425^232 * 2^31 = 32989143 * 2^31 -.word 3851004997 // zeta^232 * f(q^(-1) mod 2^32) * 2^31 = 4883425^232 * 2066201025 * 2^31 -.word 151321249 // zeta^152 * 2^31 = 4883425^152 * 2^31 = 11170776 * 2^31 -.word 278508447 // zeta^152 * f(q^(-1) mod 2^32) * 2^31 = 4883425^152 * 2066201025 * 2^31 -.word 119607197 // zeta^216 * 2^31 = 4883425^216 * 2^31 = 78801296 * 2^31 -.word 3957310883 // zeta^216 * f(q^(-1) mod 2^32) * 2^31 = 4883425^216 * 2066201025 * 2^31 -.word 42246019 // zeta^184 * 2^31 = 4883425^184 * 2^31 = 23363129 * 2^31 -.word 80286525 // zeta^184 * f(q^(-1) mod 2^32) * 2^31 = 4883425^184 * 2066201025 * 2^31 -.word 147742021 // zeta^248 * 2^31 = 4883425^248 * 2^31 = 46623540 * 2^31 -.word 3733049851 // zeta^248 * f(q^(-1) mod 2^32) * 2^31 = 4883425^248 * 2066201025 * 2^31 -.word 7599313 // zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 634545519 // zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 164408113 // zeta^196 * 2^31 = 4883425^196 * 2^31 = 83415648 * 2^31 -.word 361072399 // zeta^196 * f(q^(-1) mod 2^32) * 2^31 = 4883425^196 * 2066201025 * 2^31 -.word 89338257 // zeta^164 * 2^31 = 4883425^164 * 2^31 = 30758081 * 2^31 -.word 2774456495 // zeta^164 * f(q^(-1) mod 2^32) * 2^31 = 4883425^164 * 2066201025 * 2^31 -.word 34244867 // zeta^228 * 2^31 = 4883425^228 * 2^31 = 83896878 * 2^31 -.word 2291567037 // zeta^228 * f(q^(-1) mod 2^32) * 2^31 = 4883425^228 * 2066201025 * 2^31 -.word 154998375 // zeta^148 * 2^31 = 4883425^148 * 2^31 = 54723088 * 2^31 -.word 4245728601 // zeta^148 * f(q^(-1) mod 2^32) * 2^31 = 4883425^148 * 2066201025 * 2^31 -.word 57430761 // zeta^212 * 2^31 = 4883425^212 * 2^31 = 35586852 * 2^31 -.word 2397838679 // zeta^212 * f(q^(-1) mod 2^32) * 2^31 = 4883425^212 * 2066201025 * 2^31 -.word 24421121 // zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 1293837119 // zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 96200179 // zeta^244 * 2^31 = 4883425^244 * 2^31 = 6421974 * 2^31 -.word 455477453 // zeta^244 * f(q^(-1) mod 2^32) * 2^31 = 4883425^244 * 2066201025 * 2^31 -.word 27543013 // zeta^140 * 2^31 = 4883425^140 * 2^31 = 22531438 * 2^31 -.word 1606537563 // zeta^140 * f(q^(-1) mod 2^32) * 2^31 = 4883425^140 * 2066201025 * 2^31 -.word 60615719 // zeta^204 * 2^31 = 4883425^204 * 2^31 = 60527953 * 2^31 -.word 689489817 // zeta^204 * f(q^(-1) mod 2^32) * 2^31 = 4883425^204 * 2066201025 * 2^31 -.word 19643629 // zeta^172 * 2^31 = 4883425^172 * 2^31 = 5400389 * 2^31 -.word 3680783443 // zeta^172 * f(q^(-1) mod 2^32) * 2^31 = 4883425^172 * 2066201025 * 2^31 -.word 39629279 // zeta^236 * 2^31 = 4883425^236 * 2^31 = 24648662 * 2^31 -.word 4254777313 // zeta^236 * f(q^(-1) mod 2^32) * 2^31 = 4883425^236 * 2066201025 * 2^31 -.word 34966271 // zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 712437441 // zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 59255397 // zeta^220 * 2^31 = 4883425^220 * 2^31 = 55419215 * 2^31 -.word 4082240731 // zeta^220 * f(q^(-1) mod 2^32) * 2^31 = 4883425^220 * 2066201025 * 2^31 -.word 132411247 // zeta^188 * 2^31 = 4883425^188 * 2^31 = 61321868 * 2^31 -.word 2841101905 // zeta^188 * f(q^(-1) mod 2^32) * 2^31 = 4883425^188 * 2066201025 * 2^31 -.word 121568867 // zeta^252 * 2^31 = 4883425^252 * 2^31 = 45729226 * 2^31 -.word 2979507293 // zeta^252 * f(q^(-1) mod 2^32) * 2^31 = 4883425^252 * 2066201025 * 2^31 -.word 161145377 // zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 2261429279 // zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 148077585 // zeta^320 * 2^31 = 4883425^320 * 2^31 = 2534357 * 2^31 -.word 1917857327 // zeta^320 * f(q^(-1) mod 2^32) * 2^31 = 4883425^320 * 2066201025 * 2^31 -.word 126095117 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1553285683 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 57783269 // zeta^352 * 2^31 = 4883425^352 * 2^31 = 82719550 * 2^31 -.word 3445875547 // zeta^352 * f(q^(-1) mod 2^32) * 2^31 = 4883425^352 * 2066201025 * 2^31 -.word 156574677 // zeta^272 * 2^31 = 4883425^272 * 2^31 = 46476507 * 2^31 -.word 2742554475 // zeta^272 * f(q^(-1) mod 2^32) * 2^31 = 4883425^272 * 2066201025 * 2^31 -.word 150258061 // zeta^336 * 2^31 = 4883425^336 * 2^31 = 35138099 * 2^31 -.word 606089139 // zeta^336 * f(q^(-1) mod 2^32) * 2^31 = 4883425^336 * 2066201025 * 2^31 -.word 132978255 // zeta^304 * 2^31 = 4883425^304 * 2^31 = 43898970 * 2^31 -.word 1444550513 // zeta^304 * f(q^(-1) mod 2^32) * 2^31 = 4883425^304 * 2066201025 * 2^31 -.word 100556021 // zeta^368 * 2^31 = 4883425^368 * 2^31 = 66078731 * 2^31 -.word 3384532555 // zeta^368 * f(q^(-1) mod 2^32) * 2^31 = 4883425^368 * 2066201025 * 2^31 -.word 104744603 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 258589221 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 12509707 // zeta^328 * 2^31 = 4883425^328 * 2^31 = 56024362 * 2^31 -.word 3322978997 // zeta^328 * f(q^(-1) mod 2^32) * 2^31 = 4883425^328 * 2066201025 * 2^31 -.word 100098987 // zeta^296 * 2^31 = 4883425^296 * 2^31 = 9670361 * 2^31 -.word 1634355477 // zeta^296 * f(q^(-1) mod 2^32) * 2^31 = 4883425^296 * 2066201025 * 2^31 -.word 35497329 // zeta^360 * 2^31 = 4883425^360 * 2^31 = 64980291 * 2^31 -.word 2078317775 // zeta^360 * f(q^(-1) mod 2^32) * 2^31 = 4883425^360 * 2066201025 * 2^31 -.word 56585021 // zeta^280 * 2^31 = 4883425^280 * 2^31 = 67630520 * 2^31 -.word 3678802435 // zeta^280 * f(q^(-1) mod 2^32) * 2^31 = 4883425^280 * 2066201025 * 2^31 -.word 25276897 // zeta^344 * 2^31 = 4883425^344 * 2^31 = 77128297 * 2^31 -.word 4016458847 // zeta^344 * f(q^(-1) mod 2^32) * 2^31 = 4883425^344 * 2066201025 * 2^31 -.word 17196929 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 3652763327 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 134352127 // zeta^376 * 2^31 = 4883425^376 * 2^31 = 64935944 * 2^31 -.word 4214680769 // zeta^376 * f(q^(-1) mod 2^32) * 2^31 = 4883425^376 * 2066201025 * 2^31 -.word 68509727 // zeta^260 * 2^31 = 4883425^260 * 2^31 = 74480401 * 2^31 -.word 4021494177 // zeta^260 * f(q^(-1) mod 2^32) * 2^31 = 4883425^260 * 2066201025 * 2^31 -.word 168998833 // zeta^324 * 2^31 = 4883425^324 * 2^31 = 79363826 * 2^31 -.word 3660421775 // zeta^324 * f(q^(-1) mod 2^32) * 2^31 = 4883425^324 * 2066201025 * 2^31 -.word 33205683 // zeta^292 * 2^31 = 4883425^292 * 2^31 = 53138797 * 2^31 -.word 3812077837 // zeta^292 * f(q^(-1) mod 2^32) * 2^31 = 4883425^292 * 2066201025 * 2^31 -.word 87259889 // zeta^356 * 2^31 = 4883425^356 * 2^31 = 57540992 * 2^31 -.word 1520510799 // zeta^356 * f(q^(-1) mod 2^32) * 2^31 = 4883425^356 * 2066201025 * 2^31 -.word 167329605 // zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 2447077371 // zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 21599771 // zeta^340 * 2^31 = 4883425^340 * 2^31 = 33575985 * 2^31 -.word 49238693 // zeta^340 * f(q^(-1) mod 2^32) * 2^31 = 4883425^340 * 2066201025 * 2^31 -.word 160078131 // zeta^308 * 2^31 = 4883425^308 * 2^31 = 69992908 * 2^31 -.word 3456607629 // zeta^308 * f(q^(-1) mod 2^32) * 2^31 = 4883425^308 * 2066201025 * 2^31 -.word 152177025 // zeta^372 * 2^31 = 4883425^372 * 2^31 = 63570934 * 2^31 -.word 3001130175 // zeta^372 * f(q^(-1) mod 2^32) * 2^31 = 4883425^372 * 2066201025 * 2^31 -.word 121371779 // zeta^268 * 2^31 = 4883425^268 * 2^31 = 37996515 * 2^31 -.word 3377919549 // zeta^268 * f(q^(-1) mod 2^32) * 2^31 = 4883425^268 * 2066201025 * 2^31 -.word 149055133 // zeta^332 * 2^31 = 4883425^332 * 2^31 = 65767635 * 2^31 -.word 2688429731 // zeta^332 * f(q^(-1) mod 2^32) * 2^31 = 4883425^332 * 2066201025 * 2^31 -.word 108284723 // zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 573993869 // zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 156954517 // zeta^364 * 2^31 = 4883425^364 * 2^31 = 82898684 * 2^31 -.word 614183851 // zeta^364 * f(q^(-1) mod 2^32) * 2^31 = 4883425^364 * 2066201025 * 2^31 -.word 112588199 // zeta^284 * 2^31 = 4883425^284 * 2^31 = 17426180 * 2^31 -.word 3369803289 // zeta^284 * f(q^(-1) mod 2^32) * 2^31 = 4883425^284 * 2066201025 * 2^31 -.word 141631875 // zeta^348 * 2^31 = 4883425^348 * 2^31 = 50306038 * 2^31 -.word 3582529853 // zeta^348 * f(q^(-1) mod 2^32) * 2^31 = 4883425^348 * 2066201025 * 2^31 -.word 77456693 // zeta^316 * 2^31 = 4883425^316 * 2^31 = 72706431 * 2^31 -.word 138405387 // zeta^316 * f(q^(-1) mod 2^32) * 2^31 = 4883425^316 * 2066201025 * 2^31 -.word 44186899 // zeta^380 * 2^31 = 4883425^380 * 2^31 = 26977205 * 2^31 -.word 1453865389 // zeta^380 * f(q^(-1) mod 2^32) * 2^31 = 4883425^380 * 2066201025 * 2^31 -// End of twiddles for base multiplication - -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_scale -ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input_scale: // Constants for scaling by 1/N -.word 75231281 // 1/96 -.word 3951395343 // 1/96 twisted -.data -roots: -.word 2534356 /// zeta^256 * 2^31 = 4883425^256 * 2^31 = 2534356 * 2^31 -.word 61636979 /// zeta^256 * f(q^(-1) mod 2^32) * 2^31 = 4883425^256 * 2066201025 * 2^31 -.word 85764716 /// zeta^128 * 2^31 = 4883425^128 * 2^31 = 85764716 * 2^31 -.word 2085846645 /// zeta^128 * f(q^(-1) mod 2^32) * 2^31 = 4883425^128 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 1 // zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 // zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 63574801 // zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 // zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 53160974 // zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 // zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 29929577 // zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 // zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 // zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 22179761 // zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 // zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 9497777 // zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 // zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 // zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 1 // XX: zeta^ 0 * 2^31 = 4883425^ 0 * 2^31 = 1 * 2^31 -.word 24 /// zeta^ 0 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 0 * 2066201025 * 2^31 -.word 63574801 // XX: zeta^288 * 2^31 = 4883425^288 * 2^31 = 63574801 * 2^31 -.word 1546175299 /// zeta^288 * f(q^(-1) mod 2^32) * 2^31 = 4883425^288 * 2066201025 * 2^31 -.word 53160974 // XX: zeta^144 * 2^31 = 4883425^144 * 2^31 = 53160974 * 2^31 -.word 1292905106 /// zeta^144 * f(q^(-1) mod 2^32) * 2^31 = 4883425^144 * 2066201025 * 2^31 -.word 22179761 // XX: zeta^ 48 * 2^31 = 4883425^ 48 * 2^31 = 22179761 * 2^31 -.word 539424395 /// zeta^ 48 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 48 * 2066201025 * 2^31 -.word 29929577 // XX: zeta^264 * 2^31 = 4883425^264 * 2^31 = 29929577 * 2^31 -.word 727904326 /// zeta^264 * f(q^(-1) mod 2^32) * 2^31 = 4883425^264 * 2066201025 * 2^31 -.word 23318782 // XX: zeta^168 * 2^31 = 4883425^168 * 2^31 = 23318782 * 2^31 -.word 567126034 /// zeta^168 * f(q^(-1) mod 2^32) * 2^31 = 4883425^168 * 2066201025 * 2^31 -.word 9497777 // XX: zeta^ 24 * 2^31 = 4883425^ 24 * 2^31 = 9497777 * 2^31 -.word 230991336 /// zeta^ 24 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 24 * 2066201025 * 2^31 -.word 23260411 // XX: zeta^312 * 2^31 = 4883425^312 * 2^31 = 23260411 * 2^31 -.word 565706418 /// zeta^312 * f(q^(-1) mod 2^32) * 2^31 = 4883425^312 * 2066201025 * 2^31 -.word 8935247 // XX: zeta^132 * 2^31 = 4883425^132 * 2^31 = 8935247 * 2^31 -.word 217310286 /// zeta^132 * f(q^(-1) mod 2^32) * 2^31 = 4883425^132 * 2066201025 * 2^31 -.word 4402195 // XX: zeta^ 36 * 2^31 = 4883425^ 36 * 2^31 = 4402195 * 2^31 -.word 107063885 /// zeta^ 36 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 36 * 2066201025 * 2^31 -.word 69162837 // XX: zeta^276 * 2^31 = 4883425^276 * 2^31 = 69162837 * 2^31 -.word 1682079511 /// zeta^276 * f(q^(-1) mod 2^32) * 2^31 = 4883425^276 * 2066201025 * 2^31 -.word 24728139 // XX: zeta^180 * 2^31 = 4883425^180 * 2^31 = 24728139 * 2^31 -.word 601402397 /// zeta^180 * f(q^(-1) mod 2^32) * 2^31 = 4883425^180 * 2066201025 * 2^31 -.word 27771120 // XX: zeta^ 12 * 2^31 = 4883425^ 12 * 2^31 = 27771120 * 2^31 -.word 675409425 /// zeta^ 12 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 12 * 2066201025 * 2^31 -.word 19248273 // XX: zeta^300 * 2^31 = 4883425^300 * 2^31 = 19248273 * 2^31 -.word 468128941 /// zeta^300 * f(q^(-1) mod 2^32) * 2^31 = 4883425^300 * 2066201025 * 2^31 -.word 37993035 // XX: zeta^156 * 2^31 = 4883425^156 * 2^31 = 37993035 * 2^31 -.word 924012208 /// zeta^156 * f(q^(-1) mod 2^32) * 2^31 = 4883425^156 * 2066201025 * 2^31 -.word 42569847 // XX: zeta^ 60 * 2^31 = 4883425^ 60 * 2^31 = 42569847 * 2^31 -.word 1035322877 /// zeta^ 60 * f(q^(-1) mod 2^32) * 2^31 = 4883425^ 60 * 2066201025 * 2^31 -.text -.align 4 -roots_addr: .word roots -.syntax unified -.type ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input, %function -.global ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input -ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -// Use r14 as marker for r0 + 512 -add r14, r0, #512 -// Use r12 as marker for r0 + 1024 -add r12, r14, #512 -// Use r11 as marker for r1 + 1008 -add r11, r1, #1008 -// Use r10 as marker for r1 + 2016 -add r10, r11, #1008 -.equ modulus, -88299073 -movw r9, #:lower16:modulus -movt r9, #:upper16:modulus -ldr r8, roots_addr -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -// input[128]: Load as Q0 -vldrw.u32 Q0, [r14, #(4 * 0)] -// input[0]: Load as Q1 -vldrw.u32 Q1, [r0, #(4 * 0)] -vmul.u32 Q2, Q0, r7 -vadd.s32 Q4, Q1, Q0 -// input[132]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 4)] -vqrdmulh.s32 Q3, Q0, r6 -// input[4]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 4)] -vsub.s32 Q5, Q1, Q0 -vmla.s32 Q2, Q3, r9 -vstrw.u32 Q4, [r1,#(0)] -vadd.s32 Q3, Q1, Q2 -vstrw.u32 Q3, [r11,#(-496)] -vsub.s32 Q5, Q5, Q2 -vstrw.u32 Q5, [r11,#(16)] -// Release input[0] from Q1 -// Release input[128] from Q0 -// input[4]: Already loaded as Q7 -// input[132]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[136]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 8)] -vqrdmulh.s32 Q1, Q7, r6 -// input[8]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 8)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-480)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(16)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(32)] -// Release input[132] from Q6 -// Release input[4] from Q7 -// input[136]: Already loaded as Q4 -// input[8]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[12]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 12)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[140]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[136] from Q4 -vstrw.u32 Q2, [r11,#(48)] -vsub.s32 Q4, Q1, Q5 -// Release input[8] from Q5 -vstrw.u32 Q4, [r11,#(-464)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(32)] -// input[140]: Already loaded as Q6 -// input[12]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[144]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 16)] -vqrdmulh.s32 Q1, Q6, r6 -// input[16]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 16)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(48)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-448)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release input[12] from Q3 -// Release input[140] from Q6 -// input[16]: Already loaded as Q7 -// input[144]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[148]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 20)] -vqrdmulh.s32 Q1, Q7, r6 -// input[20]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 20)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-432)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(64)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(80)] -// Release input[144] from Q5 -// Release input[16] from Q7 -// input[148]: Already loaded as Q4 -// input[20]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[24]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 24)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[152]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 24)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[148] from Q4 -vstrw.u32 Q2, [r11,#(96)] -vsub.s32 Q4, Q1, Q6 -// Release input[20] from Q6 -vstrw.u32 Q4, [r11,#(-416)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(80)] -// input[152]: Already loaded as Q5 -// input[24]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[156]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 28)] -vqrdmulh.s32 Q1, Q5, r6 -// input[28]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 28)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(96)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-400)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(112)] -// Release input[24] from Q3 -// Release input[152] from Q5 -// input[28]: Already loaded as Q7 -// input[156]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[160]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 32)] -vqrdmulh.s32 Q1, Q7, r6 -// input[32]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 32)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-384)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(112)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release input[156] from Q6 -// Release input[28] from Q7 -// input[160]: Already loaded as Q4 -// input[32]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[36]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 36)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[164]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 36)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[160] from Q4 -vstrw.u32 Q2, [r11,#(144)] -vsub.s32 Q4, Q1, Q5 -// Release input[32] from Q5 -vstrw.u32 Q4, [r11,#(-368)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(128)] -// input[164]: Already loaded as Q6 -// input[36]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -// input[168]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 40)] -vqrdmulh.s32 Q1, Q6, r6 -// input[40]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 40)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-352)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(160)] -// Release input[36] from Q3 -// Release input[164] from Q6 -// input[40]: Already loaded as Q7 -// input[168]: Already loaded as Q5 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q5, Q7 -// input[172]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 44)] -vqrdmulh.s32 Q1, Q7, r6 -// input[44]: Load as Q6 -vldrw.u32 Q6, [r0, #(4 * 44)] -vsub.s32 Q3, Q5, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q1, Q5, Q0 -vstrw.u32 Q1, [r1,#(160)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(176)] -// Release input[168] from Q5 -// Release input[40] from Q7 -// input[172]: Already loaded as Q4 -// input[44]: Already loaded as Q6 -vsub.s32 Q0, Q4, Q6 -vmul.u32 Q1, Q0, r7 -// input[48]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 48)] -vadd.s32 Q2, Q4, Q6 -vqrdmulh.s32 Q0, Q0, r6 -// input[176]: Load as Q5 -vldrw.u32 Q5, [r14, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[172] from Q4 -vstrw.u32 Q2, [r11,#(192)] -vsub.s32 Q4, Q1, Q6 -// Release input[44] from Q6 -vstrw.u32 Q4, [r11,#(-320)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(176)] -// input[176]: Already loaded as Q5 -// input[48]: Already loaded as Q3 -vmul.u32 Q0, Q5, r7 -vadd.s32 Q2, Q3, Q5 -// input[180]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 52)] -vqrdmulh.s32 Q1, Q5, r6 -// input[52]: Load as Q7 -vldrw.u32 Q7, [r0, #(4 * 52)] -vsub.s32 Q4, Q3, Q5 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-304)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(208)] -// Release input[48] from Q3 -// Release input[176] from Q5 -// input[52]: Already loaded as Q7 -// input[180]: Already loaded as Q6 -vmul.u32 Q0, Q7, r7 -vadd.s32 Q2, Q6, Q7 -// input[184]: Load as Q4 -vldrw.u32 Q4, [r14, #(4 * 56)] -vqrdmulh.s32 Q1, Q7, r6 -// input[56]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 56)] -vsub.s32 Q3, Q6, Q7 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q1, Q6, Q0 -vstrw.u32 Q1, [r1,#(208)] -vsub.s32 Q3, Q3, Q0 -vstrw.u32 Q3, [r11,#(224)] -// Release input[180] from Q6 -// Release input[52] from Q7 -// input[184]: Already loaded as Q4 -// input[56]: Already loaded as Q5 -vsub.s32 Q0, Q4, Q5 -vmul.u32 Q1, Q0, r7 -// input[60]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 60)] -vadd.s32 Q2, Q4, Q5 -vqrdmulh.s32 Q0, Q0, r6 -// input[188]: Load as Q6 -vldrw.u32 Q6, [r14, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vneg.s32 Q0, Q4 -// Release input[184] from Q4 -vstrw.u32 Q2, [r11,#(240)] -vsub.s32 Q4, Q1, Q5 -// Release input[56] from Q5 -vstrw.u32 Q4, [r11,#(-272)] -vsub.s32 Q0, Q0, Q1 -vstrw.u32 Q0, [r1,#(224)] -// input[188]: Already loaded as Q6 -// input[60]: Already loaded as Q3 -vmul.u32 Q0, Q6, r7 -vadd.s32 Q2, Q3, Q6 -vqrdmulh.s32 Q1, Q6, r6 -// input[64]: Load as Q5 -vldrw.u32 Q5, [r0, #(4 * 64)] -vsub.s32 Q4, Q3, Q6 -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q1, Q3, Q0 -vstrw.u32 Q1, [r11,#(-256)] -vsub.s32 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release input[60] from Q3 -// Release input[188] from Q6 -// input[64]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -vneg.s32 Q1, Q5 -// input[68]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 68)] -vqrdmulh.s32 Q2, Q5, r6 -vstrw.u32 Q5, [r11,#(-240)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(256)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(272)] -// Release input[64] from Q5 -// input[68]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -vneg.s32 Q1, Q3 -// input[72]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 72)] -vqrdmulh.s32 Q2, Q3, r6 -vstrw.u32 Q3, [r11,#(288)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(272)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-224)] -// Release input[68] from Q3 -// input[72]: Already loaded as Q4 -vstrw.u32 Q4, [r1,#(288)] -vstrw.u32 Q4, [r11,#(304)] -vstrw.u32 Q4, [r11,#(-208)] -// Release input[72] from Q4 -// input[76]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 76)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[80]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 80)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-192)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(304)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(320)] -// Release input[76] from Q0 -// input[80]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[84]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 84)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(336)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(320)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-176)] -// Release input[80] from Q4 -// input[84]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(336)] -vstrw.u32 Q3, [r11,#(352)] -vstrw.u32 Q3, [r11,#(-160)] -// Release input[84] from Q3 -// input[88]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 88)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[92]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 92)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-144)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(352)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(368)] -// Release input[88] from Q0 -// input[92]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[96]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 96)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(384)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(368)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-128)] -// Release input[92] from Q4 -// input[96]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(384)] -vstrw.u32 Q3, [r11,#(400)] -vstrw.u32 Q3, [r11,#(-112)] -// Release input[96] from Q3 -// input[100]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 100)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[104]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 104)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-96)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(400)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release input[100] from Q0 -// input[104]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[108]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 108)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(432)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(416)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-80)] -// Release input[104] from Q4 -// input[108]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(432)] -vstrw.u32 Q3, [r11,#(448)] -vstrw.u32 Q3, [r11,#(-64)] -// Release input[108] from Q3 -// input[112]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 112)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -// input[116]: Load as Q4 -vldrw.u32 Q4, [r0, #(4 * 116)] -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(-48)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(448)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(464)] -// Release input[112] from Q0 -// input[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -vneg.s32 Q1, Q4 -// input[120]: Load as Q3 -vldrw.u32 Q3, [r0, #(4 * 120)] -vqrdmulh.s32 Q2, Q4, r6 -vstrw.u32 Q4, [r11,#(480)] -vmla.s32 Q0, Q2, r9 -vstrw.u32 Q0, [r1,#(464)] -vsub.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-32)] -// Release input[116] from Q4 -// input[120]: Already loaded as Q3 -vstrw.u32 Q3, [r1,#(480)] -vstrw.u32 Q3, [r11,#(496)] -vstrw.u32 Q3, [r11,#(-16)] -// Release input[120] from Q3 -// input[124]: Load as Q0 -vldrw.u32 Q0, [r0, #(4 * 124)] -vmul.u32 Q1, Q0, r7 -vneg.s32 Q2, Q0 -vqrdmulh.s32 Q3, Q0, r6 -vstrw.u32 Q0, [r11,#(0)] -vmla.s32 Q1, Q3, r9 -vstrw.u32 Q1, [r1,#(496)] -vsub.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r10,#(-496)] -// Release input[124] from Q0 -//////////// END OF RADIX 3 ////////////////////////// -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -// output[96]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 96)] -vsub.s32 Q2, Q0, Q1 -vmul.u32 Q3, Q2, r3 -// output[192]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -60)] -vadd.s32 Q0, Q0, Q1 -// Release output[96] from Q1 -// output[0]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 0)] -vsub.s32 Q5, Q1, Q4 -// output[228]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -24)] -vadd.s32 Q1, Q1, Q4 -// Release output[192] from Q4 -vqrdmulh.s32 Q2, Q2, r2 -vsub.s32 Q4, Q1, Q0 -// output[36]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 36)] -vmla.s32 Q3, Q2, r9 -vstrw.u32 Q4, [r11,#(144)] -vadd.s32 Q1, Q1, Q0 -// Release output[288] from Q0 -vstrw.u32 Q1, [r1,#(0)] -// Release output[0] from Q1 -vsub.s32 Q2, Q5, Q3 -vstrw.u32 Q2, [r1,#(384)] -vadd.s32 Q5, Q5, Q3 -vstrw.u32 Q5, [r11,#(-240)] -// output[36]: Already loaded as Q7 -// output[228]: Already loaded as Q6 -vsub.s32 Q0, Q7, Q6 -vmul.u32 Q1, Q0, r3 -// output[324]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 72)] -vadd.s32 Q7, Q7, Q6 -// Release output[228] from Q6 -// output[132]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -120)] -vsub.s32 Q4, Q3, Q2 -// output[360]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[324] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[168]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -84)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(144)] -vadd.s32 Q3, Q3, Q7 -// Release output[36] from Q7 -vstrw.u32 Q3, [r11,#(-480)] -// Release output[132] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-96)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(288)] -// output[168]: Already loaded as Q6 -// output[360]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[72]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 72)] -vadd.s32 Q6, Q6, Q5 -// Release output[360] from Q5 -// output[264]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[108]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 108)] -vadd.s32 Q3, Q3, Q2 -// Release output[72] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[300]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-336)] -vadd.s32 Q3, Q3, Q6 -// Release output[168] from Q6 -vstrw.u32 Q3, [r11,#(48)] -// Release output[264] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(288)] -// output[300]: Already loaded as Q7 -// output[108]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[204]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -48)] -vadd.s32 Q7, Q7, Q5 -// Release output[108] from Q5 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vsub.s32 Q4, Q3, Q2 -// output[240]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -12)] -vadd.s32 Q3, Q3, Q2 -// Release output[204] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[48]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 48)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(192)] -vadd.s32 Q3, Q3, Q7 -// Release output[300] from Q7 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(432)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-192)] -// output[48]: Already loaded as Q6 -// output[240]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vadd.s32 Q6, Q6, Q5 -// Release output[240] from Q5 -// output[144]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -108)] -vsub.s32 Q4, Q3, Q2 -// output[372]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[336] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[180]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -72)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(192)] -vadd.s32 Q3, Q3, Q6 -// Release output[48] from Q6 -vstrw.u32 Q3, [r11,#(-432)] -// Release output[144] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-48)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(336)] -// output[180]: Already loaded as Q7 -// output[372]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[84]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 84)] -vadd.s32 Q7, Q7, Q5 -// Release output[372] from Q5 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[120]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 120)] -vadd.s32 Q3, Q3, Q2 -// Release output[84] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[312]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-288)] -vadd.s32 Q3, Q3, Q7 -// Release output[180] from Q7 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(336)] -// output[312]: Already loaded as Q6 -// output[120]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[216]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -36)] -vadd.s32 Q6, Q6, Q5 -// Release output[120] from Q5 -// output[24]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 24)] -vsub.s32 Q4, Q3, Q2 -// output[252]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 0)] -vadd.s32 Q3, Q3, Q2 -// Release output[216] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[60]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 60)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(240)] -vadd.s32 Q3, Q3, Q6 -// Release output[312] from Q6 -vstrw.u32 Q3, [r1,#(96)] -// Release output[24] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(480)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-144)] -// output[60]: Already loaded as Q7 -// output[252]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vadd.s32 Q7, Q7, Q5 -// Release output[252] from Q5 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vsub.s32 Q4, Q3, Q2 -// output[352]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[348] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[160]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -92)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(240)] -vadd.s32 Q3, Q3, Q7 -// Release output[60] from Q7 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(0)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(384)] -// output[160]: Already loaded as Q6 -// output[352]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vadd.s32 Q6, Q6, Q5 -// Release output[352] from Q5 -// output[256]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[100]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 100)] -vadd.s32 Q3, Q3, Q2 -// Release output[64] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[292]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-368)] -vadd.s32 Q3, Q3, Q6 -// Release output[160] from Q6 -vstrw.u32 Q3, [r11,#(16)] -// Release output[256] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(256)] -// output[292]: Already loaded as Q7 -// output[100]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[196]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -56)] -vadd.s32 Q7, Q7, Q5 -// Release output[100] from Q5 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vsub.s32 Q4, Q3, Q2 -// output[232]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -20)] -vadd.s32 Q3, Q3, Q2 -// Release output[196] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[40]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 40)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(160)] -vadd.s32 Q3, Q3, Q7 -// Release output[292] from Q7 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(400)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-224)] -// output[40]: Already loaded as Q6 -// output[232]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vadd.s32 Q6, Q6, Q5 -// Release output[232] from Q5 -// output[136]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -116)] -vsub.s32 Q4, Q3, Q2 -// output[364]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[328] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[172]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -80)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(160)] -vadd.s32 Q3, Q3, Q6 -// Release output[40] from Q6 -vstrw.u32 Q3, [r11,#(-464)] -// Release output[136] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-80)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(304)] -// output[172]: Already loaded as Q7 -// output[364]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[76]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 76)] -vadd.s32 Q7, Q7, Q5 -// Release output[364] from Q5 -// output[268]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[112]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 112)] -vadd.s32 Q3, Q3, Q2 -// Release output[76] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[304]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-320)] -vadd.s32 Q3, Q3, Q7 -// Release output[172] from Q7 -vstrw.u32 Q3, [r11,#(64)] -// Release output[268] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(304)] -// output[304]: Already loaded as Q6 -// output[112]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[208]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -44)] -vadd.s32 Q6, Q6, Q5 -// Release output[112] from Q5 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vsub.s32 Q4, Q3, Q2 -// output[244]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -8)] -vadd.s32 Q3, Q3, Q2 -// Release output[208] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[52]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 52)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(208)] -vadd.s32 Q3, Q3, Q6 -// Release output[304] from Q6 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(448)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-176)] -// output[52]: Already loaded as Q7 -// output[244]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[340]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 88)] -vadd.s32 Q7, Q7, Q5 -// Release output[244] from Q5 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vsub.s32 Q4, Q3, Q2 -// output[376]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[340] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[184]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -68)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(208)] -vadd.s32 Q3, Q3, Q7 -// Release output[52] from Q7 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-32)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(352)] -// output[184]: Already loaded as Q6 -// output[376]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vadd.s32 Q6, Q6, Q5 -// Release output[376] from Q5 -// output[280]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[124]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 124)] -vadd.s32 Q3, Q3, Q2 -// Release output[88] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[316]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-272)] -vadd.s32 Q3, Q3, Q6 -// Release output[184] from Q6 -vstrw.u32 Q3, [r11,#(112)] -// Release output[280] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(352)] -// output[316]: Already loaded as Q7 -// output[124]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[220]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -32)] -vadd.s32 Q7, Q7, Q5 -// Release output[124] from Q5 -// output[28]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 28)] -vsub.s32 Q4, Q3, Q2 -// output[224]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -28)] -vadd.s32 Q3, Q3, Q2 -// Release output[220] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[32]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 32)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(256)] -vadd.s32 Q3, Q3, Q7 -// Release output[316] from Q7 -vstrw.u32 Q3, [r1,#(112)] -// Release output[28] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-128)] -// output[32]: Already loaded as Q6 -// output[224]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vadd.s32 Q6, Q6, Q5 -// Release output[224] from Q5 -// output[128]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -124)] -vsub.s32 Q4, Q3, Q2 -// output[356]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[320] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[164]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -88)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(128)] -vadd.s32 Q3, Q3, Q6 -// Release output[32] from Q6 -vstrw.u32 Q3, [r11,#(-496)] -// Release output[128] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-112)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(272)] -// output[164]: Already loaded as Q7 -// output[356]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vadd.s32 Q7, Q7, Q5 -// Release output[356] from Q5 -// output[260]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[104]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 104)] -vadd.s32 Q3, Q3, Q2 -// Release output[68] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[296]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-352)] -vadd.s32 Q3, Q3, Q7 -// Release output[164] from Q7 -vstrw.u32 Q3, [r11,#(32)] -// Release output[260] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(272)] -// output[296]: Already loaded as Q6 -// output[104]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[200]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -52)] -vadd.s32 Q6, Q6, Q5 -// Release output[104] from Q5 -// output[8]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 8)] -vsub.s32 Q4, Q3, Q2 -// output[236]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -16)] -vadd.s32 Q3, Q3, Q2 -// Release output[200] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[44]: Load as Q7 -vldrw.u32 Q7, [r1, #(4 * 44)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(176)] -vadd.s32 Q3, Q3, Q6 -// Release output[296] from Q6 -vstrw.u32 Q3, [r1,#(32)] -// Release output[8] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(416)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-208)] -// output[44]: Already loaded as Q7 -// output[236]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[332]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 80)] -vadd.s32 Q7, Q7, Q5 -// Release output[236] from Q5 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vsub.s32 Q4, Q3, Q2 -// output[368]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[332] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[176]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * -76)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(176)] -vadd.s32 Q3, Q3, Q7 -// Release output[44] from Q7 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-64)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(320)] -// output[176]: Already loaded as Q6 -// output[368]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vadd.s32 Q6, Q6, Q5 -// Release output[368] from Q5 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[116]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 116)] -vadd.s32 Q3, Q3, Q2 -// Release output[80] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[308]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-304)] -vadd.s32 Q3, Q3, Q6 -// Release output[176] from Q6 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(320)] -// output[308]: Already loaded as Q7 -// output[116]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[212]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -40)] -vadd.s32 Q7, Q7, Q5 -// Release output[116] from Q5 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vsub.s32 Q4, Q3, Q2 -// output[248]: Load as Q5 -vldrw.u32 Q5, [r11, #(4 * -4)] -vadd.s32 Q3, Q3, Q2 -// Release output[212] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[56]: Load as Q6 -vldrw.u32 Q6, [r1, #(4 * 56)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(224)] -vadd.s32 Q3, Q3, Q7 -// Release output[308] from Q7 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r1,#(464)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(-160)] -// output[56]: Already loaded as Q6 -// output[248]: Already loaded as Q5 -vsub.s32 Q0, Q6, Q5 -vmul.u32 Q1, Q0, r3 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vadd.s32 Q6, Q6, Q5 -// Release output[248] from Q5 -// output[152]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -100)] -vsub.s32 Q4, Q3, Q2 -// output[380]: Load as Q5 -vldrw.u32 Q5, [r10, #(4 * -124)] -vadd.s32 Q3, Q3, Q2 -// Release output[344] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q6 -// output[188]: Load as Q7 -vldrw.u32 Q7, [r11, #(4 * -64)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r1,#(224)] -vadd.s32 Q3, Q3, Q6 -// Release output[56] from Q6 -vstrw.u32 Q3, [r11,#(-400)] -// Release output[152] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r11,#(-16)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r11,#(368)] -// output[188]: Already loaded as Q7 -// output[380]: Already loaded as Q5 -vsub.s32 Q0, Q7, Q5 -vmul.u32 Q1, Q0, r3 -// output[92]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 92)] -vadd.s32 Q7, Q7, Q5 -// Release output[380] from Q5 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vsub.s32 Q4, Q3, Q2 -// output[24]: Load as Q5 -vldrw.u32 Q5, [r1, #(4 * 24)] -vadd.s32 Q3, Q3, Q2 -// Release output[92] from Q2 -vqrdmulh.s32 Q0, Q0, r2 -vsub.s32 Q2, Q3, Q7 -// output[264]: Load as Q6 -vldrw.u32 Q6, [r11, #(4 * 12)] -vmla.s32 Q1, Q0, r9 -vstrw.u32 Q2, [r11,#(-256)] -vadd.s32 Q3, Q3, Q7 -// Release output[188] from Q7 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vsub.s32 Q0, Q4, Q1 -vstrw.u32 Q0, [r10,#(-496)] -vadd.s32 Q4, Q4, Q1 -vstrw.u32 Q4, [r1,#(368)] -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[24]: Already loaded as Q5 -vmul.u32 Q0, Q5, r7 -// output[144]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -108)] -vqrdmulh.s32 Q5, Q5, r6 -// output[264]: Already loaded as Q6 -vmla.s32 Q0, Q5, r9 -vmul.u32 Q2, Q1, r7 -vsub.s32 Q5, Q6, Q0 -vqrdmulh.s32 Q1, Q1, r6 -vadd.s32 Q6, Q6, Q0 -vmla.s32 Q2, Q1, r9 -// output[0]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 0)] -vmul.u32 Q3, Q5, r3 -vsub.s32 Q1, Q0, Q2 -vqrdmulh.s32 Q5, Q5, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q3, Q5, r9 -// output[156]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -96)] -vmul.u32 Q4, Q6, r5 -vsub.s32 Q5, Q1, Q3 -vqrdmulh.s32 Q6, Q6, r4 -vadd.s32 Q1, Q1, Q3 -vstrw.u32 Q5, [r1,#(96)] -// Release output[24] from Q5 -vmla.s32 Q4, Q6, r9 -vstrw.u32 Q1, [r11,#(-432)] -// Release output[144] from Q1 -vsub.s32 Q6, Q0, Q4 -vstrw.u32 Q6, [r11,#(48)] -// Release output[264] from Q6 -vadd.s32 Q0, Q0, Q4 -// output[156]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[276]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 24)] -vqrdmulh.s32 Q2, Q2, r6 -// output[12]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 12)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(0)] -// Release output[0] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[132]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -120)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[280]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-384)] -// Release output[156] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(96)] -// Release output[276] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(48)] -// Release output[12] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[280]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[16]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 16)] -vqrdmulh.s32 Q0, Q0, r6 -// output[136]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -116)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-480)] -// Release output[132] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[256]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 4)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[28]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 28)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(112)] -// Release output[280] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(64)] -// Release output[16] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-464)] -// Release output[136] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[28]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vqrdmulh.s32 Q1, Q1, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(16)] -// Release output[256] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[4]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 4)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[152]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -100)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(112)] -// Release output[28] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[152]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[272]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 20)] -vqrdmulh.s32 Q2, Q2, r6 -// output[8]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 8)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(16)] -// Release output[4] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[128]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -124)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[284]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-400)] -// Release output[152] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(80)] -// Release output[272] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(32)] -// Release output[8] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[284]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[20]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 20)] -vqrdmulh.s32 Q0, Q0, r6 -// output[140]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -112)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-496)] -// Release output[128] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[260]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 8)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(128)] -// Release output[284] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(80)] -// Release output[20] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-448)] -// Release output[140] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[312]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[48]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 48)] -vqrdmulh.s32 Q1, Q1, r6 -// output[168]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -84)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(32)] -// Release output[260] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[288]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 36)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[60]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 60)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(192)] -// Release output[48] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-336)] -// Release output[168] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[60]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[180]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -72)] -vqrdmulh.s32 Q2, Q2, r6 -// output[300]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(144)] -// Release output[288] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[36]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 36)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[184]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -68)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(240)] -// Release output[60] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-288)] -// Release output[180] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(192)] -// Release output[300] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[184]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[304]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 52)] -vqrdmulh.s32 Q0, Q0, r6 -// output[40]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 40)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(144)] -// Release output[36] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[160]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -92)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[316]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-272)] -// Release output[184] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(208)] -// Release output[304] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(160)] -// Release output[40] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[316]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vqrdmulh.s32 Q1, Q1, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-368)] -// Release output[160] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[292]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 40)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[56]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 56)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(256)] -// Release output[316] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[56]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[176]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -76)] -vqrdmulh.s32 Q2, Q2, r6 -// output[296]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 44)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(160)] -// Release output[292] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[32]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 32)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[188]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -64)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r1,#(224)] -// Release output[56] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-304)] -// Release output[176] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(176)] -// Release output[296] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[188]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[308]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 56)] -vqrdmulh.s32 Q0, Q0, r6 -// output[44]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 44)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r1,#(128)] -// Release output[32] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[164]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -88)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(-256)] -// Release output[188] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(224)] -// Release output[308] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r1,#(176)] -// Release output[44] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[216]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[336]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 84)] -vqrdmulh.s32 Q1, Q1, r6 -// output[72]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 72)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(-352)] -// Release output[164] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[192]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -60)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[348]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 96)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(336)] -// Release output[336] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(288)] -// Release output[72] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[348]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[84]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 84)] -vqrdmulh.s32 Q2, Q2, r6 -// output[204]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -48)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-240)] -// Release output[192] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[324]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 72)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[88]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 88)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(384)] -// Release output[348] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(336)] -// Release output[84] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-192)] -// Release output[204] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[88]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[208]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -44)] -vqrdmulh.s32 Q0, Q0, r6 -// output[328]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 76)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(288)] -// Release output[324] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[64]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 64)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[220]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -32)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(352)] -// Release output[88] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-176)] -// Release output[208] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(304)] -// Release output[328] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[220]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vqrdmulh.s32 Q1, Q1, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(256)] -// Release output[64] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[196]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -56)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[344]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r11,#(-128)] -// Release output[220] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[344]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[80]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 80)] -vqrdmulh.s32 Q2, Q2, r6 -// output[200]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -52)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r11,#(-224)] -// Release output[196] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[320]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 68)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[92]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 92)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(368)] -// Release output[344] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(320)] -// Release output[80] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r11,#(-208)] -// Release output[200] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[92]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[212]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -40)] -vqrdmulh.s32 Q0, Q0, r6 -// output[332]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 80)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(272)] -// Release output[320] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[68]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 68)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r1,#(368)] -// Release output[92] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-160)] -// Release output[212] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(320)] -// Release output[332] from Q4 -vadd.s32 Q2, Q2, Q6 -ldrd r7, r6, [r8], #+8 -ldrd r5, r4, [r8], #+8 -ldrd r3, r2, [r8], #+8 -// output[120]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[240]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -12)] -vqrdmulh.s32 Q1, Q1, r6 -// output[360]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 108)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r1,#(272)] -// Release output[68] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[96]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 96)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[252]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 0)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-48)] -// Release output[240] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(432)] -// Release output[360] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[252]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[372]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 120)] -vqrdmulh.s32 Q2, Q2, r6 -// output[108]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 108)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(384)] -// Release output[96] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[228]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -24)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[376]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(0)] -// Release output[252] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(480)] -// Release output[372] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(432)] -// Release output[108] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[376]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[112]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 112)] -vqrdmulh.s32 Q0, Q0, r6 -// output[232]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -20)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-96)] -// Release output[228] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[352]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 100)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -// output[124]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r11,#(496)] -// Release output[376] from Q0 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r1,#(448)] -// Release output[112] from Q3 -vsub.s32 Q4, Q2, Q6 -vstrw.u32 Q4, [r11,#(-80)] -// Release output[232] from Q4 -vadd.s32 Q2, Q2, Q6 -// output[124]: Already loaded as Q1 -vmul.u32 Q0, Q1, r7 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vqrdmulh.s32 Q1, Q1, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q1, r9 -vstrw.u32 Q2, [r11,#(400)] -// Release output[352] from Q2 -vmul.u32 Q2, Q3, r7 -vsub.s32 Q1, Q4, Q0 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q0 -vmla.s32 Q2, Q3, r9 -// output[100]: Load as Q0 -vldrw.u32 Q0, [r1, #(4 * 100)] -vmul.u32 Q5, Q1, r3 -vsub.s32 Q3, Q0, Q2 -vqrdmulh.s32 Q1, Q1, r2 -vadd.s32 Q0, Q0, Q2 -vmla.s32 Q5, Q1, r9 -// output[248]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -4)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q1, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q1, [r1,#(496)] -// Release output[124] from Q1 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vsub.s32 Q4, Q0, Q6 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q0, Q0, Q6 -// output[248]: Already loaded as Q2 -vmul.u32 Q1, Q2, r7 -// output[368]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 116)] -vqrdmulh.s32 Q2, Q2, r6 -// output[104]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 104)] -vmla.s32 Q1, Q2, r9 -vstrw.u32 Q0, [r1,#(400)] -// Release output[100] from Q0 -vmul.u32 Q0, Q3, r7 -vsub.s32 Q2, Q4, Q1 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q1 -vmla.s32 Q0, Q3, r9 -// output[224]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -28)] -vmul.u32 Q5, Q2, r3 -vsub.s32 Q3, Q1, Q0 -vqrdmulh.s32 Q2, Q2, r2 -vadd.s32 Q1, Q1, Q0 -vmla.s32 Q5, Q2, r9 -// output[380]: Load as Q0 -vldrw.u32 Q0, [r10, #(4 * -124)] -vmul.u32 Q6, Q4, r5 -vsub.s32 Q2, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q2, [r11,#(-16)] -// Release output[248] from Q2 -vmla.s32 Q6, Q4, r9 -vstrw.u32 Q3, [r11,#(464)] -// Release output[368] from Q3 -vsub.s32 Q4, Q1, Q6 -vstrw.u32 Q4, [r1,#(416)] -// Release output[104] from Q4 -vadd.s32 Q1, Q1, Q6 -// output[380]: Already loaded as Q0 -vmul.u32 Q2, Q0, r7 -// output[116]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 116)] -vqrdmulh.s32 Q0, Q0, r6 -// output[236]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -16)] -vmla.s32 Q2, Q0, r9 -vstrw.u32 Q1, [r11,#(-112)] -// Release output[224] from Q1 -vmul.u32 Q1, Q3, r7 -vsub.s32 Q0, Q4, Q2 -vqrdmulh.s32 Q3, Q3, r6 -vadd.s32 Q4, Q4, Q2 -vmla.s32 Q1, Q3, r9 -// output[356]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 104)] -vmul.u32 Q5, Q0, r3 -vsub.s32 Q3, Q2, Q1 -vqrdmulh.s32 Q0, Q0, r2 -vadd.s32 Q2, Q2, Q1 -vmla.s32 Q5, Q0, r9 -vmul.u32 Q1, Q4, r5 -vsub.s32 Q0, Q3, Q5 -vqrdmulh.s32 Q4, Q4, r4 -vadd.s32 Q3, Q3, Q5 -vstrw.u32 Q0, [r10,#(-496)] -// Release output[380] from Q0 -vmla.s32 Q1, Q4, r9 -vstrw.u32 Q3, [r1,#(464)] -// Release output[116] from Q3 -vsub.s32 Q4, Q2, Q1 -vstrw.u32 Q4, [r11,#(-64)] -// Release output[236] from Q4 -vadd.s32 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(416)] -// Release output[356] from Q2 -ldrd r7, r6, [r8], #+8 -// output[132]: Load as Q0 -vldrw.u32 Q0, [r11, #(4 * -120)] -vmul.u32 Q1, Q0, r7 -// output[0]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 0)] -vqrdmulh.s32 Q0, Q0, r6 -// output[4]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 4)] -vmla.s32 Q1, Q0, r9 -vsub.s32 Q0, Q2, Q1 -vstrw.u32 Q0, [r11,#(-480)] -// Release output[132] from Q0 -vadd.s32 Q2, Q2, Q1 -// output[4]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[256]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 4)] -vqrdmulh.s32 Q3, Q3, r6 -// output[260]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 8)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(0)] -// Release output[0] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(16)] -// Release output[4] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[260]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[128]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[12]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 12)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(16)] -// Release output[256] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(32)] -// Release output[260] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[12]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[264]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 12)] -vqrdmulh.s32 Q3, Q3, r6 -// output[268]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 16)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-496)] -// Release output[128] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(48)] -// Release output[12] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[268]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[136]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[140]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -112)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(48)] -// Release output[264] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(64)] -// Release output[268] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[140]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[8]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 8)] -vqrdmulh.s32 Q3, Q3, r6 -// output[276]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-464)] -// Release output[136] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-448)] -// Release output[140] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[276]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[144]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -108)] -vqrdmulh.s32 Q4, Q4, r6 -// output[148]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -104)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(32)] -// Release output[8] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(96)] -// Release output[276] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[148]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[16]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 16)] -vqrdmulh.s32 Q3, Q3, r6 -// output[20]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 20)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-432)] -// Release output[144] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-416)] -// Release output[148] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[20]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[272]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[156]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(64)] -// Release output[16] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(80)] -// Release output[20] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[156]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[24]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 24)] -vqrdmulh.s32 Q3, Q3, r6 -// output[28]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 28)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(80)] -// Release output[272] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-384)] -// Release output[156] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[28]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[280]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[284]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 32)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(96)] -// Release output[24] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(112)] -// Release output[28] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[284]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[152]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[36]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 36)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(112)] -// Release output[280] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(128)] -// Release output[284] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[36]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[288]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 36)] -vqrdmulh.s32 Q4, Q4, r6 -// output[292]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 40)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-400)] -// Release output[152] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(144)] -// Release output[36] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[292]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[160]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[164]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -88)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(144)] -// Release output[288] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(160)] -// Release output[292] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[164]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[32]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 32)] -vqrdmulh.s32 Q4, Q4, r6 -// output[300]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-368)] -// Release output[160] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-352)] -// Release output[164] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[300]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[168]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -84)] -vqrdmulh.s32 Q3, Q3, r6 -// output[172]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -80)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(128)] -// Release output[32] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(192)] -// Release output[300] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[172]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[40]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 40)] -vqrdmulh.s32 Q4, Q4, r6 -// output[44]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 44)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-336)] -// Release output[168] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-320)] -// Release output[172] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[44]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[296]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[180]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(160)] -// Release output[40] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(176)] -// Release output[44] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[180]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[48]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 48)] -vqrdmulh.s32 Q4, Q4, r6 -// output[52]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 52)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(176)] -// Release output[296] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-288)] -// Release output[180] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[52]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[304]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[308]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 56)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(192)] -// Release output[48] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(208)] -// Release output[52] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[308]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[176]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[60]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 60)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(208)] -// Release output[304] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(224)] -// Release output[308] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[60]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[312]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 60)] -vqrdmulh.s32 Q3, Q3, r6 -// output[316]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 64)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-304)] -// Release output[176] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(240)] -// Release output[60] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[316]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[184]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[188]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -64)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(240)] -// Release output[312] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(256)] -// Release output[316] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[188]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[56]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 56)] -vqrdmulh.s32 Q3, Q3, r6 -// output[324]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 72)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-272)] -// Release output[184] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-256)] -// Release output[188] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[324]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[192]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -60)] -vqrdmulh.s32 Q4, Q4, r6 -// output[196]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -56)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(224)] -// Release output[56] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(288)] -// Release output[324] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[196]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[64]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 64)] -vqrdmulh.s32 Q3, Q3, r6 -// output[68]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 68)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-240)] -// Release output[192] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-224)] -// Release output[196] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[68]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[320]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 68)] -vqrdmulh.s32 Q4, Q4, r6 -// output[204]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -48)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(256)] -// Release output[64] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(272)] -// Release output[68] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[204]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[72]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 72)] -vqrdmulh.s32 Q3, Q3, r6 -// output[76]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 76)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(272)] -// Release output[320] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-192)] -// Release output[204] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[76]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[328]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 76)] -vqrdmulh.s32 Q4, Q4, r6 -// output[332]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 80)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(288)] -// Release output[72] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(304)] -// Release output[76] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[332]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[200]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -52)] -vqrdmulh.s32 Q3, Q3, r6 -// output[84]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 84)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(304)] -// Release output[328] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(320)] -// Release output[332] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[84]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[336]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 84)] -vqrdmulh.s32 Q4, Q4, r6 -// output[340]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 88)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-208)] -// Release output[200] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(336)] -// Release output[84] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[340]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[208]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -44)] -vqrdmulh.s32 Q3, Q3, r6 -// output[212]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -40)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(336)] -// Release output[336] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(352)] -// Release output[340] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[212]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[80]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 80)] -vqrdmulh.s32 Q4, Q4, r6 -// output[348]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 96)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-176)] -// Release output[208] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-160)] -// Release output[212] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[348]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[216]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -36)] -vqrdmulh.s32 Q3, Q3, r6 -// output[220]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -32)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(320)] -// Release output[80] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(384)] -// Release output[348] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[220]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[88]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 88)] -vqrdmulh.s32 Q4, Q4, r6 -// output[92]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 92)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(-144)] -// Release output[216] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-128)] -// Release output[220] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[92]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[344]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 92)] -vqrdmulh.s32 Q3, Q3, r6 -// output[228]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * -24)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(352)] -// Release output[88] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(368)] -// Release output[92] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[228]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[96]: Load as Q2 -vldrw.u32 Q2, [r1, #(4 * 96)] -vqrdmulh.s32 Q4, Q4, r6 -// output[100]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 100)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(368)] -// Release output[344] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(-96)] -// Release output[228] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[100]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[352]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 100)] -vqrdmulh.s32 Q3, Q3, r6 -// output[356]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 104)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r1,#(384)] -// Release output[96] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(400)] -// Release output[100] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[356]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[224]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -28)] -vqrdmulh.s32 Q4, Q4, r6 -// output[108]: Load as Q3 -vldrw.u32 Q3, [r1, #(4 * 108)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(400)] -// Release output[352] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(416)] -// Release output[356] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[108]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[360]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * 108)] -vqrdmulh.s32 Q3, Q3, r6 -// output[364]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 112)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-112)] -// Release output[224] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r1,#(432)] -// Release output[108] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[364]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[232]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -20)] -vqrdmulh.s32 Q4, Q4, r6 -// output[236]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -16)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r11,#(432)] -// Release output[360] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(448)] -// Release output[364] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[236]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[104]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 104)] -vqrdmulh.s32 Q3, Q3, r6 -// output[372]: Load as Q4 -vldrw.u32 Q4, [r11, #(4 * 120)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-80)] -// Release output[232] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-64)] -// Release output[236] from Q3 -vadd.s32 Q1, Q1, Q0 -ldrd r7, r6, [r8], #+8 -// output[372]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[240]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * -12)] -vqrdmulh.s32 Q4, Q4, r6 -// output[244]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * -8)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(416)] -// Release output[104] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r11,#(480)] -// Release output[372] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[244]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[112]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 112)] -vqrdmulh.s32 Q3, Q3, r6 -// output[116]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 116)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(-48)] -// Release output[240] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(-32)] -// Release output[244] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[116]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[368]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 116)] -vqrdmulh.s32 Q4, Q4, r6 -// output[252]: Load as Q3 -vldrw.u32 Q3, [r11, #(4 * 0)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(448)] -// Release output[112] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(464)] -// Release output[116] from Q4 -vadd.s32 Q2, Q2, Q0 -ldrd r7, r6, [r8], #+8 -// output[252]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[120]: Load as Q1 -vldrw.u32 Q1, [r1, #(4 * 120)] -vqrdmulh.s32 Q3, Q3, r6 -// output[124]: Load as Q4 -vldrw.u32 Q4, [r1, #(4 * 124)] -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(464)] -// Release output[368] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r11,#(0)] -// Release output[252] from Q3 -vadd.s32 Q1, Q1, Q0 -// output[124]: Already loaded as Q4 -vmul.u32 Q0, Q4, r7 -// output[376]: Load as Q2 -vldrw.u32 Q2, [r11, #(4 * 124)] -vqrdmulh.s32 Q4, Q4, r6 -// output[380]: Load as Q3 -vldrw.u32 Q3, [r10, #(4 * -124)] -vmla.s32 Q0, Q4, r9 -vstrw.u32 Q1, [r1,#(480)] -// Release output[120] from Q1 -vsub.s32 Q4, Q2, Q0 -vstrw.u32 Q4, [r1,#(496)] -// Release output[124] from Q4 -vadd.s32 Q2, Q2, Q0 -// output[380]: Already loaded as Q3 -vmul.u32 Q0, Q3, r7 -// output[248]: Load as Q1 -vldrw.u32 Q1, [r11, #(4 * -4)] -vqrdmulh.s32 Q3, Q3, r6 -vmla.s32 Q0, Q3, r9 -vstrw.u32 Q2, [r11,#(496)] -// Release output[376] from Q2 -vsub.s32 Q3, Q1, Q0 -vstrw.u32 Q3, [r10,#(-496)] -// Release output[380] from Q3 -vadd.s32 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-16)] -// Release output[248] from Q1 -.equ modulus_inv, 2228766271 -movw r14, #:lower16:modulus_inv -movt r14, #:upper16:modulus_inv -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 3042 -// Instruction count: 2201 \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s b/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s deleted file mode 100644 index e9a6bb3..0000000 --- a/tests/poly/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s b/tests/poly/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s deleted file mode 100644 index e890c41..0000000 --- a/tests/poly/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve -poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -vldrw.u32 Q4, [r14, #(4 * -124)] -vldrw.u32 Q5, [r0, #(4 * 96)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 32)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -60)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 100)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -56)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 4)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 104)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 108)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -48)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 12)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 112)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -44)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 116)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -40)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 20)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 120)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -36)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 124)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -32)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 28)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/poly/poly.mk b/tests/poly/poly.mk index d7175ab..a89bf56 100644 --- a/tests/poly/poly.mk +++ b/tests/poly/poly.mk @@ -18,9 +18,10 @@ POLY_ASMS += ../../asm/manual/schoolbook/poly_u16_32.s POLY_ASMS += ../../asm/manual/schoolbook/poly_u16_32_acc.s POLY_ASMS += $(POLY_ASM_DIR)/inv_ntt_u32_33556993_28678040_complete.s POLY_ASMS += $(POLY_ASM_DIR)/ntt_u32_33556993_28678040_complete.s -POLY_ASMS += $(POLY_ASM_DIR)/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s -POLY_ASMS += $(POLY_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s SABER_ASM_DIR = ../../asm/auto/saber POLY_ASMS += $(SABER_ASM_DIR)/inv_ntt_u32_33556993_28678040_incomplete.s POLY_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete.s -POLY_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete_double.s \ No newline at end of file +POLY_ASMS += $(SABER_ASM_DIR)/ntt_u32_33556993_28678040_incomplete_double.s +TOOM_ASM_DIR=../../asm/auto/poly/toom4 +POLY_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +POLY_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s \ No newline at end of file diff --git a/tests/sqmag/cmplx_mag_sqr_fx.s b/tests/sqmag/cmplx_mag_sqr_fx.s deleted file mode 100644 index c3f683c..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx.s +++ /dev/null @@ -1,38 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx, %function - .global cmplx_mag_sqr_fx - - .text - .align 4 -cmplx_mag_sqr_fx: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end -start: - // deinterleave real/imag - vld20.32 {qr, qi}, [in] - vld21.32 {qr, qi}, [in]! - // square real/imag - vmulh.s32 qtmp, qr, qr - vmulh.s32 qout, qi, qi - // accumulate & halving - vhadd.s32 qout, qout, qtmp - vstrw.32 qout, [out], #16 - le lr, start -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr diff --git a/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s b/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s deleted file mode 100644 index b2acc6e..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll1.s +++ /dev/null @@ -1,69 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx_opt_M55_unroll1, %function - .global cmplx_mag_sqr_fx_opt_M55_unroll1 - - .text - .align 4 -cmplx_mag_sqr_fx_opt_M55_unroll1: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end - vld20.32 {q4,q5}, [r1] // *. - // gap // .. - vld21.32 {q4,q5}, [r1]! // .* - - // original source code - // vld20.32 {q4,q5}, [r1] // *. - // vld21.32 {q4,q5}, [r1]! // .* - - sub lr, lr, #1 -.p2align 2 -start: - vmulh.s32 q3, q5, q5 // ...*.. - // gap // ...... - vmulh.s32 q1, q4, q4 // ..*... - vld20.32 {q4,q5}, [r1] // e..... - vhadd.s32 q1, q3, q1 // ....*. - vld21.32 {q4,q5}, [r1]! // .e.... - // gap // ...... - vstrw.u32 q1, [r0] , #16 // .....* - - // original source code - // vld20.32 {q0,q1}, [r1] // e......... - // vld21.32 {q0,q1}, [r1]! // ..e....... - // vmulh.s32 q2, q0, q0 // .....*.... - // vmulh.s32 q3, q1, q1 // ....*..... - // vhadd.s32 q3, q3, q2 // .......*.. - // vstrw.u32 q3, [r0] , #16 // .........* - - le lr, start - vmulh.s32 q1, q5, q5 // *... - // gap // .... - vmulh.s32 q5, q4, q4 // .*.. - // gap // .... - vhadd.s32 q5, q1, q5 // ..*. - vstrw.u32 q5, [r0] , #16 // ...* - - // original source code - // vmulh.s32 q3, q5, q5 // *... - // vmulh.s32 q1, q4, q4 // .*.. - // vhadd.s32 q1, q3, q1 // ..*. - // vstrw.u32 q1, [r0] , #16 // ...* - -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr \ No newline at end of file diff --git a/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s b/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s deleted file mode 100644 index 41d4cca..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll2.s +++ /dev/null @@ -1,94 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx_opt_M55_unroll2, %function - .global cmplx_mag_sqr_fx_opt_M55_unroll2 - - .text - .align 4 -cmplx_mag_sqr_fx_opt_M55_unroll2: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end - vld20.32 {q2,q3}, [r1] // *...... - // gap // ....... - vld21.32 {q2,q3}, [r1]! // .*..... - // gap // ....... - vld20.32 {q4,q5}, [r1] // ..*.... - vmulh.s32 q0, q2, q2 // ...*... - vld21.32 {q4,q5}, [r1]! // ....*.. - // gap // ....... - vmulh.s32 q7, q4, q4 // .....*. - // gap // ....... - vmulh.s32 q4, q3, q3 // ......* - - // original source code - // vld20.32 {q2,q3}, [r1] // *...... - // vld21.32 {q2,q3}, [r1]! // .*..... - // vld20.32 {q4,q5}, [r1] // ..*.... - // vmulh.s32 q0, q2, q2 // ...*... - // vld21.32 {q4,q5}, [r1]! // ....*.. - // vmulh.s32 q7, q4, q4 // .....*. - // vmulh.s32 q4, q3, q3 // ......* - - lsr lr, lr, #1 - sub lr, lr, #1 -.p2align 2 -start: - vld20.32 {q2,q3}, [r1] // e........... - vmulh.s32 q6, q5, q5 // .........*.. - vld21.32 {q2,q3}, [r1]! // .e.......... - vhadd.s32 q1, q4, q0 // ....*....... - vld20.32 {q4,q5}, [r1] // ......e..... - vmulh.s32 q0, q2, q2 // ..e......... - vld21.32 {q4,q5}, [r1]! // .......e.... - vhadd.s32 q2, q6, q7 // ..........*. - vstrw.u32 q1, [r0] , #16 // .....*...... - vmulh.s32 q7, q4, q4 // ........e... - vstrw.u32 q2, [r0] , #16 // ...........* - vmulh.s32 q4, q3, q3 // ...e........ - // gap // ............ - - // original source code - // vld20.32 {q0,q1}, [r1] // e...................... - // vld21.32 {q0,q1}, [r1]! // ..e.................... - // vmulh.s32 q2, q0, q0 // .....e................. - // vmulh.s32 q3, q1, q1 // ...........e........... - // vhadd.s32 q3, q3, q2 // ...............*....... - // vstrw.u32 q3, [r0] , #16 // ....................*.. - // vld20.32 {q0,q1}, [r1] // ....e.................. - // vld21.32 {q0,q1}, [r1]! // ......e................ - // vmulh.s32 q2, q0, q0 // .........e............. - // vmulh.s32 q3, q1, q1 // .............*......... - // vhadd.s32 q3, q3, q2 // ...................*... - // vstrw.u32 q3, [r0] , #16 // ......................* - - le lr, start - vhadd.s32 q6, q4, q0 // .*... - vmulh.s32 q2, q5, q5 // *.... - vstrw.u32 q6, [r0] , #16 // ...*. - vhadd.s32 q6, q2, q7 // ..*.. - vstrw.u32 q6, [r0] , #16 // ....* - - // original source code - // vmulh.s32 q6, q5, q5 // .*... - // vhadd.s32 q1, q4, q0 // *.... - // vhadd.s32 q2, q6, q7 // ...*. - // vstrw.u32 q1, [r0] , #16 // ..*.. - // vstrw.u32 q2, [r0] , #16 // ....* - -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr \ No newline at end of file diff --git a/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s b/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s deleted file mode 100644 index d5fc37f..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx_opt_M55_unroll4.s +++ /dev/null @@ -1,140 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx_opt_M55_unroll4, %function - .global cmplx_mag_sqr_fx_opt_M55_unroll4 - - .text - .align 4 -cmplx_mag_sqr_fx_opt_M55_unroll4: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end - vld20.32 {q2,q3}, [r1] // *.............. - // gap // ............... - vld21.32 {q2,q3}, [r1]! // .*............. - // gap // ............... - vld20.32 {q0,q1}, [r1] // ..*............ - vmulh.s32 q2, q2, q2 // ...*........... - vld20.32 {q5,q6}, [r1] // ....*.......... - vmulh.s32 q4, q3, q3 // .....*......... - vld21.32 {q5,q6}, [r1]! // ......*........ - vhadd.s32 q3, q4, q2 // .......*....... - vmulh.s32 q2, q5, q5 // ...........*... - vld20.32 {q4,q5}, [r1] // ........*...... - vmulh.s32 q7, q6, q6 // .........*..... - vld21.32 {q4,q5}, [r1]! // ..........*.... - vhadd.s32 q6, q7, q2 // ..............* - vld21.32 {q0,q1}, [r1]! // ............*.. - vmulh.s32 q5, q5, q5 // .............*. - - // original source code - // vld20.32 {q4,q5}, [r1] // *.............. - // vld21.32 {q4,q5}, [r1]! // .*............. - // vld20.32 {q0,q1}, [r1] // ..*............ - // vmulh.s32 q3, q4, q4 // ...*........... - // vld20.32 {q6,q7}, [r1] // ....*.......... - // vmulh.s32 q4, q5, q5 // .....*......... - // vld21.32 {q6,q7}, [r1]! // ......*........ - // vhadd.s32 q3, q4, q3 // .......*....... - // vld20.32 {q4,q5}, [r1] // .........*..... - // vmulh.s32 q7, q7, q7 // ..........*.... - // vld21.32 {q4,q5}, [r1]! // ...........*... - // vmulh.s32 q6, q6, q6 // ........*...... - // vld21.32 {q0,q1}, [r1]! // .............*. - // vmulh.s32 q5, q5, q5 // ..............* - // vhadd.s32 q6, q7, q6 // ............*.. - - lsr lr, lr, #2 - sub lr, lr, #1 -.p2align 2 -start: - vstrw.u32 q3, [r0] , #16 // .....*.................. - vmulh.s32 q4, q4, q4 // ..............*......... - vstrw.u32 q6, [r0] , #16 // ...........*............ - vhadd.s32 q7, q5, q4 // ................*....... - vstrw.u32 q7, [r0] , #16 // .................*...... - // gap // ........................ - vmulh.s32 q6, q0, q0 // ....................*... - vld20.32 {q4,q5}, [r1] // e....................... - vmulh.s32 q2, q1, q1 // .....................*.. - vld21.32 {q4,q5}, [r1]! // .e...................... - vhadd.s32 q2, q2, q6 // ......................*. - vld20.32 {q0,q1}, [r1] // ..................e..... - vmulh.s32 q3, q4, q4 // ..e..................... - vld20.32 {q6,q7}, [r1] // ......e................. - vmulh.s32 q4, q5, q5 // ...e.................... - vld21.32 {q6,q7}, [r1]! // .......e................ - vhadd.s32 q3, q4, q3 // ....e................... - vld20.32 {q4,q5}, [r1] // ............e........... - vmulh.s32 q7, q7, q7 // .........e.............. - vld21.32 {q4,q5}, [r1]! // .............e.......... - vmulh.s32 q6, q6, q6 // ........e............... - vld21.32 {q0,q1}, [r1]! // ...................e.... - vmulh.s32 q5, q5, q5 // ...............e........ - vstrw.u32 q2, [r0] , #16 // .......................* - vhadd.s32 q6, q7, q6 // ..........e............. - - // original source code - // vld20.32 {q0,q1}, [r1] // e........................................ - // vld21.32 {q0,q1}, [r1]! // ..e...................................... - // vmulh.s32 q2, q0, q0 // .....e................................... - // vmulh.s32 q3, q1, q1 // .......e................................. - // vhadd.s32 q3, q3, q2 // .........e............................... - // vstrw.u32 q3, [r0] , #16 // ..................*...................... - // vld20.32 {q0,q1}, [r1] // ......e.................................. - // vld21.32 {q0,q1}, [r1]! // ........e................................ - // vmulh.s32 q2, q0, q0 // .............e........................... - // vmulh.s32 q3, q1, q1 // ...........e............................. - // vhadd.s32 q3, q3, q2 // .................e....................... - // vstrw.u32 q3, [r0] , #16 // ....................*.................... - // vld20.32 {q0,q1}, [r1] // ..........e.............................. - // vld21.32 {q0,q1}, [r1]! // ............e............................ - // vmulh.s32 q2, q0, q0 // ...................*..................... - // vmulh.s32 q3, q1, q1 // ...............e......................... - // vhadd.s32 q3, q3, q2 // .....................*................... - // vstrw.u32 q3, [r0] , #16 // ......................*.................. - // vld20.32 {q0,q1}, [r1] // ....e.................................... - // vld21.32 {q0,q1}, [r1]! // ..............e.......................... - // vmulh.s32 q2, q0, q0 // .......................*................. - // vmulh.s32 q3, q1, q1 // .........................*............... - // vhadd.s32 q3, q3, q2 // ...........................*............. - // vstrw.u32 q3, [r0] , #16 // ........................................* - - le lr, start - vstrw.u32 q3, [r0] , #16 // *........ - vmulh.s32 q2, q4, q4 // .*....... - vstrw.u32 q6, [r0] , #16 // ..*...... - vmulh.s32 q4, q0, q0 // .....*... - vhadd.s32 q0, q5, q2 // ...*..... - vmulh.s32 q6, q1, q1 // ......*.. - vstrw.u32 q0, [r0] , #16 // ....*.... - vhadd.s32 q6, q6, q4 // .......*. - vstrw.u32 q6, [r0] , #16 // ........* - - // original source code - // vstrw.u32 q3, [r0] , #16 // *........ - // vmulh.s32 q4, q4, q4 // .*....... - // vstrw.u32 q6, [r0] , #16 // ..*...... - // vhadd.s32 q7, q5, q4 // ....*.... - // vstrw.u32 q7, [r0] , #16 // ......*.. - // vmulh.s32 q6, q0, q0 // ...*..... - // vmulh.s32 q2, q1, q1 // .....*... - // vhadd.s32 q2, q2, q6 // .......*. - // vstrw.u32 q2, [r0] , #16 // ........* - -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr \ No newline at end of file diff --git a/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s b/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s deleted file mode 100644 index 4a051e5..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll1.s +++ /dev/null @@ -1,70 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx_opt_M85_unroll1, %function - .global cmplx_mag_sqr_fx_opt_M85_unroll1 - - .text - .align 4 -cmplx_mag_sqr_fx_opt_M85_unroll1: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end - vld20.32 {q2,q3}, [r1] // *. - // gap // .. - vld21.32 {q2,q3}, [r1]! // .* - - // original source code - // vld20.32 {q2,q3}, [r1] // *. - // vld21.32 {q2,q3}, [r1]! // .* - - sub lr, lr, #1 -.p2align 2 -start: - vmulh.s32 q7, q3, q3 // ...*.. - // gap // ...... - vmulh.s32 q1, q2, q2 // ..*... - vld20.32 {q2,q3}, [r1] // e..... - // gap // ...... - vld21.32 {q2,q3}, [r1]! // .e.... - vhadd.s32 q0, q7, q1 // ....*. - vstrw.u32 q0, [r0] , #16 // .....* - // gap // ...... - - // original source code - // vld20.32 {q0,q1}, [r1] // e......... - // vld21.32 {q0,q1}, [r1]! // .e........ - // vmulh.s32 q2, q0, q0 // .....*.... - // vmulh.s32 q3, q1, q1 // ....*..... - // vhadd.s32 q3, q3, q2 // ........*. - // vstrw.u32 q3, [r0] , #16 // .........* - - le lr, start - vmulh.s32 q3, q3, q3 // *... - // gap // .... - vmulh.s32 q1, q2, q2 // .*.. - // gap // .... - vhadd.s32 q1, q3, q1 // ..*. - vstrw.u32 q1, [r0] , #16 // ...* - - // original source code - // vmulh.s32 q7, q3, q3 // *... - // vmulh.s32 q1, q2, q2 // .*.. - // vhadd.s32 q0, q7, q1 // ..*. - // vstrw.u32 q0, [r0] , #16 // ...* - -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr \ No newline at end of file diff --git a/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s b/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s deleted file mode 100644 index 2dc9336..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll2.s +++ /dev/null @@ -1,93 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx_opt_M85_unroll2, %function - .global cmplx_mag_sqr_fx_opt_M85_unroll2 - - .text - .align 4 -cmplx_mag_sqr_fx_opt_M85_unroll2: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end - vld20.32 {q3,q4}, [r1] // *.... - // gap // ..... - vld21.32 {q3,q4}, [r1]! // .*... - // gap // ..... - vld20.32 {q6,q7}, [r1] // ..*.. - // gap // ..... - vmulh.s32 q5, q3, q3 // ...*. - vld21.32 {q6,q7}, [r1]! // ....* - - // original source code - // vld20.32 {q3,q4}, [r1] // *.... - // vld21.32 {q3,q4}, [r1]! // .*... - // vld20.32 {q6,q7}, [r1] // ..*.. - // vmulh.s32 q5, q3, q3 // ...*. - // vld21.32 {q6,q7}, [r1]! // ....* - - lsr lr, lr, #1 - sub lr, lr, #1 -.p2align 2 -start: - vmulh.s32 q1, q4, q4 // ...*........ - vld20.32 {q3,q4}, [r1] // e........... - vhadd.s32 q5, q1, q5 // ....*....... - vld21.32 {q3,q4}, [r1]! // .e.......... - vmulh.s32 q2, q7, q7 // .........*.. - vstrw.u32 q5, [r0] , #16 // .....*...... - vmulh.s32 q1, q6, q6 // ........*... - vld20.32 {q6,q7}, [r1] // ......e..... - vmulh.s32 q5, q3, q3 // ..e......... - vhadd.s32 q0, q2, q1 // ..........*. - vstrw.u32 q0, [r0] , #16 // ...........* - vld21.32 {q6,q7}, [r1]! // .......e.... - - // original source code - // vld20.32 {q0,q1}, [r1] // e..................... - // vld21.32 {q0,q1}, [r1]! // ..e................... - // vmulh.s32 q2, q0, q0 // .......e.............. - // vmulh.s32 q3, q1, q1 // ...........*.......... - // vhadd.s32 q3, q3, q2 // .............*........ - // vstrw.u32 q3, [r0] , #16 // ................*..... - // vld20.32 {q0,q1}, [r1] // ......e............... - // vld21.32 {q0,q1}, [r1]! // ..........e........... - // vmulh.s32 q2, q0, q0 // .................*.... - // vmulh.s32 q3, q1, q1 // ...............*...... - // vhadd.s32 q3, q3, q2 // ....................*. - // vstrw.u32 q3, [r0] , #16 // .....................* - - le lr, start - vmulh.s32 q0, q4, q4 // *...... - // gap // ....... - vmulh.s32 q1, q6, q6 // ....*.. - vhadd.s32 q5, q0, q5 // .*..... - vmulh.s32 q0, q7, q7 // ..*.... - vstrw.u32 q5, [r0] , #16 // ...*... - vhadd.s32 q1, q0, q1 // .....*. - vstrw.u32 q1, [r0] , #16 // ......* - - // original source code - // vmulh.s32 q1, q4, q4 // *...... - // vhadd.s32 q5, q1, q5 // ..*.... - // vmulh.s32 q2, q7, q7 // ...*... - // vstrw.u32 q5, [r0] , #16 // ....*.. - // vmulh.s32 q1, q6, q6 // .*..... - // vhadd.s32 q0, q2, q1 // .....*. - // vstrw.u32 q0, [r0] , #16 // ......* - -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr \ No newline at end of file diff --git a/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s b/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s deleted file mode 100644 index da0416f..0000000 --- a/tests/sqmag/cmplx_mag_sqr_fx_opt_M85_unroll4.s +++ /dev/null @@ -1,141 +0,0 @@ - .syntax unified - .type cmplx_mag_sqr_fx_opt_M85_unroll4, %function - .global cmplx_mag_sqr_fx_opt_M85_unroll4 - - .text - .align 4 -cmplx_mag_sqr_fx_opt_M85_unroll4: - push {r4-r12,lr} - vpush {d0-d15} - - out .req r0 - in .req r1 - sz .req r2 - - qr .req q0 - qi .req q1 - qtmp .req q2 - qout .req q3 - - lsr lr, sz, #2 - wls lr, lr, end - vld20.32 {q6,q7}, [r1] // *........... - // gap // ............ - vld21.32 {q6,q7}, [r1]! // .*.......... - // gap // ............ - vld20.32 {q2,q3}, [r1] // ..*......... - // gap // ............ - vld21.32 {q2,q3}, [r1]! // ...*........ - vmulh.s32 q4, q6, q6 // ....*....... - vld20.32 {q0,q1}, [r1] // .....*...... - // gap // ............ - vld21.32 {q0,q1}, [r1]! // .......*.... - vmulh.s32 q3, q3, q3 // ........*... - vld20.32 {q5,q6}, [r1] // .........*.. - vmulh.s32 q2, q2, q2 // ......*..... - vld21.32 {q5,q6}, [r1]! // ...........* - vmulh.s32 q0, q0, q0 // ..........*. - - // original source code - // vld20.32 {q6,q7}, [r1] // *........... - // vld21.32 {q6,q7}, [r1]! // .*.......... - // vld20.32 {q2,q3}, [r1] // ..*......... - // vld21.32 {q2,q3}, [r1]! // ...*........ - // vmulh.s32 q4, q6, q6 // ....*....... - // vld20.32 {q0,q1}, [r1] // .....*...... - // vmulh.s32 q2, q2, q2 // .........*.. - // vld21.32 {q0,q1}, [r1]! // ......*..... - // vmulh.s32 q3, q3, q3 // .......*.... - // vld20.32 {q5,q6}, [r1] // ........*... - // vmulh.s32 q0, q0, q0 // ...........* - // vld21.32 {q5,q6}, [r1]! // ..........*. - - lsr lr, lr, #2 - sub lr, lr, #1 -.p2align 2 -start: - vmulh.s32 q7, q7, q7 // ...*.................... - vhadd.s32 q2, q3, q2 // ..........*............. - vmulh.s32 q1, q1, q1 // ...............*........ - vhadd.s32 q4, q7, q4 // ....*................... - vstrw.u32 q4, [r0] , #16 // .....*.................. - vmulh.s32 q4, q6, q6 // .....................*.. - vstrw.u32 q2, [r0] , #16 // ...........*............ - vld20.32 {q6,q7}, [r1] // e....................... - vmulh.s32 q3, q5, q5 // ....................*... - vld21.32 {q6,q7}, [r1]! // .e...................... - vhadd.s32 q5, q4, q3 // ......................*. - vld20.32 {q2,q3}, [r1] // ......e................. - vhadd.s32 q1, q1, q0 // ................*....... - vld21.32 {q2,q3}, [r1]! // .......e................ - vmulh.s32 q4, q6, q6 // ..e..................... - vstrw.u32 q1, [r0] , #16 // .................*...... - vld20.32 {q0,q1}, [r1] // ............e........... - vmulh.s32 q2, q2, q2 // ........e............... - vld21.32 {q0,q1}, [r1]! // .............e.......... - vmulh.s32 q3, q3, q3 // .........e.............. - vstrw.u32 q5, [r0] , #16 // .......................* - vld20.32 {q5,q6}, [r1] // ..................e..... - vmulh.s32 q0, q0, q0 // ..............e......... - vld21.32 {q5,q6}, [r1]! // ...................e.... - - // original source code - // vld20.32 {q0,q1}, [r1] // e..................................... - // vld21.32 {q0,q1}, [r1]! // ..e................................... - // vmulh.s32 q2, q0, q0 // .......e.............................. - // vmulh.s32 q3, q1, q1 // .................*.................... - // vhadd.s32 q3, q3, q2 // ....................*................. - // vstrw.u32 q3, [r0] , #16 // .....................*................ - // vld20.32 {q0,q1}, [r1] // ....e................................. - // vld21.32 {q0,q1}, [r1]! // ......e............................... - // vmulh.s32 q2, q0, q0 // ..........e........................... - // vmulh.s32 q3, q1, q1 // ............e......................... - // vhadd.s32 q3, q3, q2 // ..................*................... - // vstrw.u32 q3, [r0] , #16 // .......................*.............. - // vld20.32 {q0,q1}, [r1] // .........e............................ - // vld21.32 {q0,q1}, [r1]! // ...........e.......................... - // vmulh.s32 q2, q0, q0 // ...............e...................... - // vmulh.s32 q3, q1, q1 // ...................*.................. - // vhadd.s32 q3, q3, q2 // .............................*........ - // vstrw.u32 q3, [r0] , #16 // ................................*..... - // vld20.32 {q0,q1}, [r1] // ..............e....................... - // vld21.32 {q0,q1}, [r1]! // ................e..................... - // vmulh.s32 q2, q0, q0 // .........................*............ - // vmulh.s32 q3, q1, q1 // ......................*............... - // vhadd.s32 q3, q3, q2 // ...........................*.......... - // vstrw.u32 q3, [r0] , #16 // .....................................* - - le lr, start - vmulh.s32 q7, q7, q7 // *........... - vhadd.s32 q2, q3, q2 // .*.......... - vmulh.s32 q1, q1, q1 // ..*......... - vhadd.s32 q3, q7, q4 // ...*........ - vstrw.u32 q3, [r0] , #16 // ....*....... - vhadd.s32 q3, q1, q0 // .........*.. - vmulh.s32 q5, q5, q5 // .......*.... - vstrw.u32 q2, [r0] , #16 // ......*..... - vmulh.s32 q6, q6, q6 // .....*...... - vstrw.u32 q3, [r0] , #16 // ..........*. - vhadd.s32 q6, q6, q5 // ........*... - vstrw.u32 q6, [r0] , #16 // ...........* - - // original source code - // vmulh.s32 q7, q7, q7 // *........... - // vhadd.s32 q2, q3, q2 // .*.......... - // vmulh.s32 q1, q1, q1 // ..*......... - // vhadd.s32 q4, q7, q4 // ...*........ - // vstrw.u32 q4, [r0] , #16 // ....*....... - // vmulh.s32 q4, q6, q6 // ........*... - // vstrw.u32 q2, [r0] , #16 // .......*.... - // vmulh.s32 q3, q5, q5 // ......*..... - // vhadd.s32 q5, q4, q3 // ..........*. - // vhadd.s32 q1, q1, q0 // .....*...... - // vstrw.u32 q1, [r0] , #16 // .........*.. - // vstrw.u32 q5, [r0] , #16 // ...........* - -end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr \ No newline at end of file diff --git a/tests/sqmag/sqmag.mk b/tests/sqmag/sqmag.mk index 42f81f8..a54a292 100644 --- a/tests/sqmag/sqmag.mk +++ b/tests/sqmag/sqmag.mk @@ -11,10 +11,11 @@ SQMAG_PLATFORMS += m85-an555 SQMAG_SOURCES += main.c # Assembly sources required for this test -SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll1.s -SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll2.s -SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M55_unroll4.s -SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M85_unroll1.s -SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M85_unroll2.s -SQMAG_ASMS += cmplx_mag_sqr_fx_opt_M85_unroll4.s -SQMAG_ASMS += cmplx_mag_sqr_fx.s \ No newline at end of file +SQMAG_ASM_DIR = ../../asm/manual/sqmag +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx_opt_M55_unroll1.s +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx_opt_M55_unroll2.s +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx_opt_M55_unroll4.s +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx_opt_M85_unroll1.s +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx_opt_M85_unroll2.s +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx_opt_M85_unroll4.s +SQMAG_ASMS += $(SQMAG_ASM_DIR)/cmplx_mag_sqr_fx.s \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_192_toom3_mve.s b/tests/toom/auto/poly_u16_mul_192_toom3_mve.s deleted file mode 100644 index 7835385..0000000 --- a/tests/toom/auto/poly_u16_mul_192_toom3_mve.s +++ /dev/null @@ -1,708 +0,0 @@ -.syntax unified -.type poly_u16_mul_64_C, %function -.global poly_u16_mul_64_C -.syntax unified -.type poly_u16_mul_192_toom3_mve, %function -.global poly_u16_mul_192_toom3_mve -poly_u16_mul_192_toom3_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #1280 -add sp, sp, #504 -add r14, sp, #1008 -add r1, r1, #504 -add r2, r2, #504 -mov r12, #-2 -mov r11, #3 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -62)] -vadd.u16 Q2, Q0, Q1 -vldrw.u32 Q3, [r1, #(4 * -94)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [sp,#(-504)] -vmla.s16 Q2, Q3, r12 -vstrw.u32 Q2, [sp,#(-248)] -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r1, #(4 * -58)] -vmla.s16 Q2, Q1, r11 -vstrw.u32 Q2, [sp,#(8)] -vldrw.u32 Q0, [r1, #(4 * -122)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-488)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-232)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r1, #(4 * -54)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(24)] -vldrw.u32 Q0, [r1, #(4 * -118)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * -86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-472)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-216)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r1, #(4 * -50)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(40)] -vldrw.u32 Q0, [r1, #(4 * -114)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * -82)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-456)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-200)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r1, #(4 * -46)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(56)] -vldrw.u32 Q0, [r1, #(4 * -110)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * -78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-440)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-184)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r1, #(4 * -42)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(72)] -vldrw.u32 Q0, [r1, #(4 * -106)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * -74)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-424)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-168)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r1, #(4 * -38)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(88)] -vldrw.u32 Q0, [r1, #(4 * -102)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * -70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-408)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-152)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r1, #(4 * -34)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(104)] -vldrw.u32 Q0, [r1, #(4 * -98)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * -66)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-392)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-136)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r2, #(4 * -62)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(120)] -vldrw.u32 Q0, [r2, #(4 * -126)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * -94)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-376)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-120)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r2, #(4 * -58)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(136)] -vldrw.u32 Q0, [r2, #(4 * -122)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-360)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-104)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r2, #(4 * -54)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(152)] -vldrw.u32 Q0, [r2, #(4 * -118)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * -86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-344)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-88)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r2, #(4 * -50)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(168)] -vldrw.u32 Q0, [r2, #(4 * -114)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * -82)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-328)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-72)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r2, #(4 * -46)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(184)] -vldrw.u32 Q0, [r2, #(4 * -110)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * -78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-312)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r2, #(4 * -42)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(200)] -vldrw.u32 Q0, [r2, #(4 * -106)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * -74)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-296)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r2, #(4 * -38)] -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(216)] -vldrw.u32 Q0, [r2, #(4 * -102)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * -70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-280)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-24)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r2, #(4 * -34)] -vmla.s16 Q1, Q3, r11 -vstrw.u32 Q1, [sp,#(232)] -vldrw.u32 Q0, [r2, #(4 * -98)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * -66)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-264)] -vmla.s16 Q1, Q2, r12 -vstrw.u32 Q1, [sp,#(-8)] -vsub.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q4, r11 -vstrw.u32 Q1, [sp,#(248)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #256 -add r10, r2, #256 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(512) -add r2, sp, #(640) -add r0, sp, #(768) -bl poly_u16_mul_64_C -add r1, sp, #(256) -add r2, sp, #(384) -add r0, sp, #(512) -bl poly_u16_mul_64_C -add r1, sp, #(0) -add r2, sp, #(128) -add r0, sp, #(256) -bl poly_u16_mul_64_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_64_C -add r1, r11, #(128) -add r2, r10, #(128) -add r0, sp, #(1024) -bl poly_u16_mul_64_C -add sp, sp, #504 -add r14, sp, #1008 -mov r12, #43691 -mov r11, #2 -mov r10, #-1 -vldrw.u32 Q0, [sp, #(4 * -62)] -vldrw.u32 Q1, [sp, #(4 * 66)] -vsub.u16 Q1, Q1, Q0 -vmul.u16 Q1, Q1, r12 -vldrw.u32 Q2, [sp, #(4 * 2)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q3, [sp, #(4 * -126)] -vmla.s16 Q2, Q3, r10 -vshr.u16 Q0, Q0, #1 -vldrw.u32 Q4, [sp, #(4 * 70)] -vsub.u16 Q1, Q2, Q1 -vldrw.u32 Q5, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vldrw.u32 Q6, [sp, #(4 * -58)] -vshr.u16 Q1, Q1, #1 -vmla.s16 Q1, Q5, r11 -vstrw.u32 Q1, [sp,#(264)] -vsub.u16 Q0, Q0, Q1 -vstrw.u32 Q0, [sp,#(-248)] -vsub.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q4, Q4, Q6 -vmul.u16 Q4, Q4, r12 -vldrw.u32 Q0, [sp, #(4 * 6)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -122)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 74)] -vsub.u16 Q4, Q0, Q4 -vldrw.u32 Q3, [r14, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q4, Q4, #1 -vmla.s16 Q4, Q3, r11 -vstrw.u32 Q4, [sp,#(280)] -vsub.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-232)] -vsub.u16 Q0, Q0, Q3 -vstrw.u32 Q0, [sp,#(24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 10)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -118)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 78)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -114)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -50)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r11 -vstrw.u32 Q2, [sp,#(296)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [sp,#(-216)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(40)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 14)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -114)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 82)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r11 -vstrw.u32 Q3, [sp,#(312)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [sp,#(-200)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(56)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 18)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -110)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 86)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -106)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -42)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r11 -vstrw.u32 Q2, [sp,#(328)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [sp,#(-184)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(72)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 22)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -106)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 90)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r11 -vstrw.u32 Q3, [sp,#(344)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [sp,#(-168)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(88)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 26)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -102)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 94)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -98)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -34)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r11 -vstrw.u32 Q2, [sp,#(360)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [sp,#(-152)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(104)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 30)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -98)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 98)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r11 -vstrw.u32 Q3, [sp,#(376)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [sp,#(-136)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(120)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 34)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -94)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 102)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -90)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -26)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [sp, #(4 * -62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-248)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-488)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [sp, #(4 * 2)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(8)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(264)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 38)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -90)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 106)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [sp, #(4 * -58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-232)] -vmla.s16 Q3, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-472)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [sp, #(4 * 6)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [sp,#(24)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(280)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 42)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -86)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 110)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -82)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -18)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [sp, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-216)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-456)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [sp, #(4 * 10)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(40)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(296)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 46)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -82)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 114)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [sp, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q3, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-440)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [sp, #(4 * 14)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [sp,#(56)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(312)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 50)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -78)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 118)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -74)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -10)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [sp, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-184)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-424)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [sp, #(4 * 18)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(72)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(328)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 54)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -74)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [sp, #(4 * 122)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [sp, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-168)] -vmla.s16 Q3, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-408)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [sp, #(4 * 22)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [sp,#(88)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(344)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [sp, #(4 * 58)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -70)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [sp, #(4 * 126)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -66)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [sp, #(4 * -2)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [sp, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-152)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-392)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [sp, #(4 * 26)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(104)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [sp, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(360)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [sp, #(4 * 62)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -66)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q2, [r14, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q4, [sp, #(4 * -34)] -vadd.u16 Q4, Q4, Q1 -vstrw.u32 Q4, [sp,#(-136)] -vmla.s16 Q3, Q2, r11 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-376)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [sp, #(4 * 30)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [sp,#(120)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q1, [sp, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(376)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -add sp, sp, #1280 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_256_toom4_mve.s b/tests/toom/auto/poly_u16_mul_256_toom4_mve.s deleted file mode 100644 index 5b5271e..0000000 --- a/tests/toom/auto/poly_u16_mul_256_toom4_mve.s +++ /dev/null @@ -1,1287 +0,0 @@ -.syntax unified -.type poly_u16_mul_64_C, %function -.global poly_u16_mul_64_C -.syntax unified -.type poly_u16_mul_256_toom4_mve, %function -.global poly_u16_mul_256_toom4_mve -poly_u16_mul_256_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #1792 -add sp, sp, #504 -add r14, sp, #1008 -add r1, r1, #504 -add r2, r2, #504 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -94)] -vldrw.u32 Q2, [r1, #(4 * -62)] -vldrw.u32 Q3, [r1, #(4 * -30)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(264)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -90)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * -58)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * -26)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-488)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-248)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(280)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -86)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * -54)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * -22)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-472)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-232)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(296)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -82)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * -50)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * -18)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-456)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-216)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(312)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -78)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * -46)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * -14)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-440)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-200)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(328)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -74)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * -42)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * -10)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-424)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-184)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(344)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -70)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * -38)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * -6)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-408)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-168)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(360)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -66)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * -34)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * -2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-392)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-152)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -94)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * -62)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * -30)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-376)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-136)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(392)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -90)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * -58)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * -26)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-360)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-120)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(408)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -86)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * -54)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * -22)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-344)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-104)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(424)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -82)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * -50)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * -18)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-328)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-88)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -78)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * -46)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * -14)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-312)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(-72)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(456)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -74)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * -42)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * -10)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-296)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-56)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(472)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -70)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * -38)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * -6)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-280)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-40)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(488)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -66)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * -34)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * -2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-264)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-24)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [sp,#(248)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-248)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-264)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-8)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #256 -add r10, r2, #256 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(1024) -add r2, sp, #(1152) -add r0, sp, #(1280) -bl poly_u16_mul_64_C -add r1, sp, #(768) -add r2, sp, #(896) -add r0, sp, #(1024) -bl poly_u16_mul_64_C -add r1, sp, #(512) -add r2, sp, #(640) -add r0, sp, #(768) -bl poly_u16_mul_64_C -add r1, sp, #(256) -add r2, sp, #(384) -add r0, sp, #(512) -bl poly_u16_mul_64_C -add r1, sp, #(0) -add r2, sp, #(128) -add r0, sp, #(256) -bl poly_u16_mul_64_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_64_C -add r1, r11, #(128) -add r2, r10, #(128) -add r0, sp, #(1536) -bl poly_u16_mul_64_C -add sp, sp, #504 -add r14, sp, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vldrw.u32 Q1, [sp, #(4 * 2)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -62)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -122)] -vldrw.u32 Q4, [sp, #(4 * 66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [sp, #(4 * 70)] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [sp, #(4 * 6)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [sp,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [sp,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -118)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 74)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -114)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 78)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 82)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 86)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 90)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 94)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [sp,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 98)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [sp,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [sp,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -62)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-248)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 102)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 2)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -58)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-232)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 106)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(24)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 10)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -54)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-216)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 110)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(40)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 14)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -50)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 114)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(56)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 18)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -46)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-184)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 118)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(72)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 22)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -42)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(-168)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [sp, #(4 * 122)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(88)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 26)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -38)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-152)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [sp, #(4 * 126)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [sp, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * 62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q2, [sp, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(104)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 30)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #(4 * -34)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(-136)] -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q3, [sp, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r14,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vldrw.u32 Q1, [sp, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(120)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 34)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(136)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #128 -add sp, sp, #1792 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_512_toom4_mve.s b/tests/toom/auto/poly_u16_mul_512_toom4_mve.s deleted file mode 100644 index 87219a7..0000000 --- a/tests/toom/auto/poly_u16_mul_512_toom4_mve.s +++ /dev/null @@ -1,2501 +0,0 @@ -.syntax unified -.type poly_u16_mul_128_C, %function -.global poly_u16_mul_128_C -.syntax unified -.type poly_u16_mul_512_toom4_mve, %function -.global poly_u16_mul_512_toom4_mve -poly_u16_mul_512_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #3584 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r1, r1, #504 -add r10, r1, #1008 -add r2, r2, #504 -add r9, r2, #1008 -mov r8, #1 -mov r7, #2 -mov r6, #3 -mov r5, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -62)] -vldrw.u32 Q2, [r1, #(4 * 2)] -vldrw.u32 Q3, [r1, #(4 * 66)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-488)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(24)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -58)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 6)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-472)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(8)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-472)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(40)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -54)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 10)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 74)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-456)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-456)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(56)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -50)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 14)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-440)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(40)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -46)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 18)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * 82)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-424)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-424)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(88)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -42)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 22)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-408)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(72)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-408)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(104)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -38)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 26)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 90)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-392)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(120)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -34)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 30)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 94)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(104)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -30)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 34)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * 98)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-360)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-360)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 38)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 102)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-344)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(136)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-344)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 42)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 106)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-328)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 46)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 110)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(168)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 50)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r1, #(4 * 114)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-296)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-296)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 54)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * 118)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-280)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(200)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 58)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r1, #(4 * 122)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-264)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r1, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 62)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r1, #(4 * 126)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(232)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-248)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -62)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 2)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 66)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-232)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -58)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 6)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(264)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-216)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -54)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 10)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 74)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-200)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -50)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 14)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-184)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(296)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-184)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -46)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 18)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 82)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-168)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -42)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 22)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-152)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -38)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 26)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 90)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-136)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -34)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 30)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 94)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-120)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-120)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -30)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 34)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 98)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-104)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 38)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 102)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-88)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 42)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 106)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-72)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 46)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 110)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-56)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(-56)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 50)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r2, #(4 * 114)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r6 -vstrw.u32 Q6, [r14,#(-40)] -vmla.s16 Q6, Q5, r7 -vmla.s16 Q5, Q1, r6 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q7, Q2, r6 -vmla.s16 Q7, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 54)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r2, #(4 * 118)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r6 -vstrw.u32 Q4, [r14,#(-24)] -vmla.s16 Q4, Q7, r7 -vmla.s16 Q7, Q1, r6 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q6, Q2, r6 -vmla.s16 Q6, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 58)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * 122)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r6 -vstrw.u32 Q5, [r14,#(-8)] -vmla.s16 Q5, Q6, r7 -vmla.s16 Q6, Q1, r6 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q4, Q2, r6 -vmla.s16 Q4, Q3, r5 -vldrw.u32 Q0, [r2, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 62)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * 126)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(8)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r6 -vstrw.u32 Q7, [r14,#(8)] -vmla.s16 Q7, Q4, r7 -vmla.s16 Q4, Q1, r6 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q5, Q2, r6 -vmla.s16 Q5, Q3, r5 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(24)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-8)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(504)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #512 -add r10, r2, #512 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(2048) -add r2, sp, #(2304) -add r0, sp, #(2560) -bl poly_u16_mul_128_C -add r1, sp, #(1536) -add r2, sp, #(1792) -add r0, sp, #(2048) -bl poly_u16_mul_128_C -add r1, sp, #(1024) -add r2, sp, #(1280) -add r0, sp, #(1536) -bl poly_u16_mul_128_C -add r1, sp, #(512) -add r2, sp, #(768) -add r0, sp, #(1024) -bl poly_u16_mul_128_C -add r1, sp, #(0) -add r2, sp, #(256) -add r0, sp, #(512) -bl poly_u16_mul_128_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_128_C -add r1, r11, #(256) -add r2, r10, #(256) -add r0, sp, #(3072) -bl poly_u16_mul_128_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #-64 -mov r9, #45 -mov r8, #-8 -mov r6, #43691 -mov r5, #16 -mov r4, #30 -mov r3, #61167 -mov r2, #-65 -mov r1, #36409 -mov r0, #1 -vldrw.u32 Q0, [r12, #(4 * 10)] -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * 2)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * -118)] -vldrw.u32 Q4, [r14, #(4 * 6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r11, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r2 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r8 -vldrw.u32 Q5, [r14, #(4 * 10)] -vmla.s16 Q0, Q3, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r5 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(-472)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r1 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r4 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r3 -vstrw.u32 Q2, [sp,#(8)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 14)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 18)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 22)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 26)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 30)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 34)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 38)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 42)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 46)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 50)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 54)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 58)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -70)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 62)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -66)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 66)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 70)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vstrw.u32 Q1, [r14,#(-248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vstrw.u32 Q0, [sp,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 2)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(8)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 74)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 6)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(24)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 78)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -114)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 10)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(40)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 82)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -110)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 14)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 86)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -106)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 18)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(72)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 90)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -102)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 22)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(88)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 94)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -98)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 26)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(104)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 98)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -94)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 30)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 102)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -90)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 34)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(136)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 106)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -86)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 38)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(152)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 110)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -82)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 42)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(168)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 114)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -78)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 46)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 118)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -74)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 50)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(200)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -6)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -70)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 54)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(216)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 126)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -2)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -66)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r2 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 58)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(232)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r12, #(4 * -122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q7, [r14, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r1 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 2)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r4 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -62)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q2, [r14, #(4 * -66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-264)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r2 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r0 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #(4 * 62)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(248)] -vmla.s16 Q1, Q2, r8 -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r6 -vldrw.u32 Q3, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r5 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * 70)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r12,#(280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r1 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r4 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r3 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-216)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #256 -add sp, sp, #3584 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_64_toom4_mve.s b/tests/toom/auto/poly_u16_mul_64_toom4_mve.s deleted file mode 100644 index 1e57d8e..0000000 --- a/tests/toom/auto/poly_u16_mul_64_toom4_mve.s +++ /dev/null @@ -1,379 +0,0 @@ -.syntax unified -.type poly_u16_mul_16_C, %function -.global poly_u16_mul_16_C -.syntax unified -.type poly_u16_mul_64_toom4_mve, %function -.global poly_u16_mul_64_toom4_mve -poly_u16_mul_64_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #448 -add sp, sp, #504 -add r1, r1, #504 -add r2, r2, #504 -mov r14, #1 -mov r12, #2 -mov r11, #3 -mov r10, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -118)] -vldrw.u32 Q2, [r1, #(4 * -110)] -vldrw.u32 Q3, [r1, #(4 * -102)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r11 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q6, Q5, r12 -vmla.s16 Q5, Q1, r11 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q7, Q2, r11 -vmla.s16 Q7, Q3, r10 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -114)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * -106)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r1, #(4 * -98)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [sp,#(-248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(-440)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r11 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q4, Q7, r12 -vmla.s16 Q7, Q1, r11 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q6, Q2, r11 -vmla.s16 Q6, Q3, r10 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -118)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * -110)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r2, #(4 * -102)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [sp,#(-232)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(-424)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r11 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q5, Q6, r12 -vmla.s16 Q6, Q1, r11 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q4, Q2, r11 -vmla.s16 Q4, Q3, r10 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -114)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * -106)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r2, #(4 * -98)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [sp,#(-216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(-408)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r11 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q7, Q4, r12 -vmla.s16 Q4, Q1, r11 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q5, Q2, r11 -vmla.s16 Q5, Q3, r10 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [sp,#(-200)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(-456)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [sp,#(-392)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #64 -add r10, r2, #64 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(256) -add r2, sp, #(288) -add r0, sp, #(320) -bl poly_u16_mul_16_C -add r1, sp, #(192) -add r2, sp, #(224) -add r0, sp, #(256) -bl poly_u16_mul_16_C -add r1, sp, #(128) -add r2, sp, #(160) -add r0, sp, #(192) -bl poly_u16_mul_16_C -add r1, sp, #(64) -add r2, sp, #(96) -add r0, sp, #(128) -bl poly_u16_mul_16_C -add r1, sp, #(0) -add r2, sp, #(32) -add r0, sp, #(64) -bl poly_u16_mul_16_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_16_C -add r1, r11, #(32) -add r2, r10, #(32) -add r0, sp, #(384) -bl poly_u16_mul_16_C -add sp, sp, #504 -mov r14, #-64 -mov r12, #45 -mov r11, #-8 -mov r10, #43691 -mov r9, #16 -mov r8, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [sp, #(4 * -46)] -vldrw.u32 Q1, [sp, #(4 * -94)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -110)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [sp, #(4 * -62)] -vldrw.u32 Q4, [sp, #(4 * -78)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [sp, #(4 * -30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [sp, #(4 * -42)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r11 -vldrw.u32 Q5, [sp, #(4 * -74)] -vmla.s16 Q0, Q3, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [sp,#(-376)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r9 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(-248)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [sp, #(4 * -90)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r8 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [sp,#(-312)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [sp,#(-440)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [sp,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [sp, #(4 * -26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [sp, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [sp, #(4 * -70)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [sp,#(-360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [sp,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [sp,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [sp,#(-424)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [sp,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [sp, #(4 * -22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [sp, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * -110)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(-440)] -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [sp, #(4 * -66)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vldrw.u32 Q7, [sp, #(4 * -78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [sp,#(-312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [sp, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [sp,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [sp, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [sp, #(4 * -62)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [sp,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [sp, #(4 * -94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [sp,#(-376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [sp, #(4 * -30)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [sp,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * -50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [sp, #(4 * -18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [sp, #(4 * -106)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q1, Q2, r11 -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vldrw.u32 Q3, [sp, #(4 * -74)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [sp,#(-296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [sp, #(4 * -42)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [sp,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [sp, #(4 * -58)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [sp,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q1, [sp, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [sp,#(-360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [sp, #(4 * -26)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [sp,#(-104)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #32 -add sp, sp, #448 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_768_toom3_mve.s b/tests/toom/auto/poly_u16_mul_768_toom3_mve.s deleted file mode 100644 index af3e296..0000000 --- a/tests/toom/auto/poly_u16_mul_768_toom3_mve.s +++ /dev/null @@ -1,2662 +0,0 @@ -.syntax unified -.type poly_u16_mul_256_C, %function -.global poly_u16_mul_256_C -.syntax unified -.type poly_u16_mul_768_toom3_mve, %function -.global poly_u16_mul_768_toom3_mve -poly_u16_mul_768_toom3_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #5120 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #-2 -mov r5, #3 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r8, #(4 * -122)] -vadd.u16 Q2, Q0, Q1 -vldrw.u32 Q3, [r1, #(4 * 2)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [sp,#(-504)] -vmla.s16 Q2, Q3, r6 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r8, #(4 * -118)] -vmla.s16 Q2, Q1, r5 -vstrw.u32 Q2, [r12,#(-472)] -vldrw.u32 Q0, [r1, #(4 * -122)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 6)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-488)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-472)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -114)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-456)] -vldrw.u32 Q0, [r1, #(4 * -118)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 10)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-472)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-456)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -110)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-440)] -vldrw.u32 Q0, [r1, #(4 * -114)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-456)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-440)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -106)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-424)] -vldrw.u32 Q0, [r1, #(4 * -110)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 18)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-440)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-424)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -102)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-408)] -vldrw.u32 Q0, [r1, #(4 * -106)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 22)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-424)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-408)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -98)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-392)] -vldrw.u32 Q0, [r1, #(4 * -102)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 26)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-408)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-392)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -94)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-376)] -vldrw.u32 Q0, [r1, #(4 * -98)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 30)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-392)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-376)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -90)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-360)] -vldrw.u32 Q0, [r1, #(4 * -94)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 34)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-376)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-360)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -86)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-344)] -vldrw.u32 Q0, [r1, #(4 * -90)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 38)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-360)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-344)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -82)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-328)] -vldrw.u32 Q0, [r1, #(4 * -86)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 42)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-344)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-328)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -78)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-312)] -vldrw.u32 Q0, [r1, #(4 * -82)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 46)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-328)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-312)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -74)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-296)] -vldrw.u32 Q0, [r1, #(4 * -78)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 50)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-312)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-296)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -70)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-280)] -vldrw.u32 Q0, [r1, #(4 * -74)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 54)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-296)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-280)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -66)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-264)] -vldrw.u32 Q0, [r1, #(4 * -70)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 58)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-280)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-264)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -62)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-248)] -vldrw.u32 Q0, [r1, #(4 * -66)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 62)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-264)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -58)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-232)] -vldrw.u32 Q0, [r1, #(4 * -62)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 66)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-248)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-232)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -54)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-216)] -vldrw.u32 Q0, [r1, #(4 * -58)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-232)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-216)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -50)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-200)] -vldrw.u32 Q0, [r1, #(4 * -54)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 74)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-216)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-200)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -46)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-184)] -vldrw.u32 Q0, [r1, #(4 * -50)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-200)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-184)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -42)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-168)] -vldrw.u32 Q0, [r1, #(4 * -46)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-184)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-168)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -38)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-152)] -vldrw.u32 Q0, [r1, #(4 * -42)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-168)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-152)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -34)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-136)] -vldrw.u32 Q0, [r1, #(4 * -38)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 90)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-152)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-136)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -30)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-120)] -vldrw.u32 Q0, [r1, #(4 * -34)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 94)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-136)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-120)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -26)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-104)] -vldrw.u32 Q0, [r1, #(4 * -30)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 98)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-120)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-104)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -22)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-88)] -vldrw.u32 Q0, [r1, #(4 * -26)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 102)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-104)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-88)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -18)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-72)] -vldrw.u32 Q0, [r1, #(4 * -22)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 106)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-88)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-72)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -14)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-56)] -vldrw.u32 Q0, [r1, #(4 * -18)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 110)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-72)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -10)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-40)] -vldrw.u32 Q0, [r1, #(4 * -14)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 114)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-56)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * -6)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(-24)] -vldrw.u32 Q0, [r1, #(4 * -10)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 118)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-40)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-24)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r8, #(4 * -2)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(-8)] -vldrw.u32 Q0, [r1, #(4 * -6)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r1, #(4 * 122)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-24)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(-8)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r8, #(4 * 2)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(8)] -vldrw.u32 Q0, [r1, #(4 * -2)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r1, #(4 * 126)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(-8)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(8)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -122)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(24)] -vldrw.u32 Q0, [r2, #(4 * -126)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 2)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(8)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(24)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -118)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(40)] -vldrw.u32 Q0, [r2, #(4 * -122)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 6)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(24)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -114)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(56)] -vldrw.u32 Q0, [r2, #(4 * -118)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 10)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(40)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -110)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(72)] -vldrw.u32 Q0, [r2, #(4 * -114)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(56)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(72)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -106)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(88)] -vldrw.u32 Q0, [r2, #(4 * -110)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 18)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(72)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(88)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -102)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(104)] -vldrw.u32 Q0, [r2, #(4 * -106)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 22)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(88)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(104)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -98)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(120)] -vldrw.u32 Q0, [r2, #(4 * -102)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 26)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(104)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(120)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -94)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(136)] -vldrw.u32 Q0, [r2, #(4 * -98)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 30)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(120)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(136)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -90)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(152)] -vldrw.u32 Q0, [r2, #(4 * -94)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 34)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(136)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(152)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -86)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(168)] -vldrw.u32 Q0, [r2, #(4 * -90)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 38)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(152)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(168)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -82)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(184)] -vldrw.u32 Q0, [r2, #(4 * -86)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 42)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(168)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(184)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -78)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(200)] -vldrw.u32 Q0, [r2, #(4 * -82)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 46)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(184)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(200)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -74)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(216)] -vldrw.u32 Q0, [r2, #(4 * -78)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 50)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(200)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(216)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -70)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(232)] -vldrw.u32 Q0, [r2, #(4 * -74)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 54)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(216)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(232)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -66)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(248)] -vldrw.u32 Q0, [r2, #(4 * -70)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 58)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(232)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(248)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -62)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(264)] -vldrw.u32 Q0, [r2, #(4 * -66)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 62)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(248)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(264)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -58)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(280)] -vldrw.u32 Q0, [r2, #(4 * -62)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 66)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(264)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(280)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -54)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(296)] -vldrw.u32 Q0, [r2, #(4 * -58)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(280)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(296)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -50)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(312)] -vldrw.u32 Q0, [r2, #(4 * -54)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 74)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(296)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(312)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -46)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(328)] -vldrw.u32 Q0, [r2, #(4 * -50)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(312)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(328)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -42)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(344)] -vldrw.u32 Q0, [r2, #(4 * -46)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(328)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(344)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -38)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(360)] -vldrw.u32 Q0, [r2, #(4 * -42)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(344)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(360)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -34)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(376)] -vldrw.u32 Q0, [r2, #(4 * -38)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 90)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(360)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(376)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -30)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(392)] -vldrw.u32 Q0, [r2, #(4 * -34)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 94)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(376)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -26)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(408)] -vldrw.u32 Q0, [r2, #(4 * -30)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 98)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(392)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(408)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -22)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(424)] -vldrw.u32 Q0, [r2, #(4 * -26)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 102)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(408)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(424)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -18)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(440)] -vldrw.u32 Q0, [r2, #(4 * -22)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 106)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(424)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(440)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -14)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(456)] -vldrw.u32 Q0, [r2, #(4 * -18)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 110)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(440)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(456)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -10)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(472)] -vldrw.u32 Q0, [r2, #(4 * -14)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 114)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(456)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(472)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * -6)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r12,#(488)] -vldrw.u32 Q0, [r2, #(4 * -10)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 118)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(472)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(488)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r7, #(4 * -2)] -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r12,#(504)] -vldrw.u32 Q0, [r2, #(4 * -6)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r2, #(4 * 122)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(488)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r14,#(504)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r7, #(4 * 2)] -vmla.s16 Q1, Q3, r5 -vstrw.u32 Q1, [r11,#(-488)] -vldrw.u32 Q0, [r2, #(4 * -2)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r2, #(4 * 126)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [sp,#(504)] -vmla.s16 Q1, Q2, r6 -vstrw.u32 Q1, [r12,#(-488)] -vsub.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q4, r5 -vstrw.u32 Q1, [r11,#(-472)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #1024 -add r10, r2, #1024 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(2048) -add r2, sp, #(2560) -add r0, sp, #(3072) -bl poly_u16_mul_256_C -add r1, sp, #(1024) -add r2, sp, #(1536) -add r0, sp, #(2048) -bl poly_u16_mul_256_C -add r1, sp, #(0) -add r2, sp, #(512) -add r0, sp, #(1024) -bl poly_u16_mul_256_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_256_C -add r1, r11, #(512) -add r2, r10, #(512) -add r0, sp, #(4096) -bl poly_u16_mul_256_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #43691 -mov r6, #2 -mov r5, #-1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vldrw.u32 Q1, [r11, #(4 * -114)] -vsub.u16 Q1, Q1, Q0 -vmul.u16 Q1, Q1, r8 -vldrw.u32 Q2, [r12, #(4 * -118)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q3, [sp, #(4 * -126)] -vmla.s16 Q2, Q3, r5 -vshr.u16 Q0, Q0, #1 -vldrw.u32 Q4, [r11, #(4 * -110)] -vsub.u16 Q1, Q2, Q1 -vldrw.u32 Q5, [r10, #(4 * -110)] -vadd.u16 Q2, Q2, Q0 -vldrw.u32 Q6, [r14, #(4 * -118)] -vshr.u16 Q1, Q1, #1 -vmla.s16 Q1, Q5, r6 -vstrw.u32 Q1, [r11,#(-456)] -vsub.u16 Q0, Q0, Q1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q6 -vmul.u16 Q4, Q4, r8 -vldrw.u32 Q0, [r12, #(4 * -114)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -122)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -106)] -vsub.u16 Q4, Q0, Q4 -vldrw.u32 Q3, [r10, #(4 * -106)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -114)] -vshr.u16 Q4, Q4, #1 -vmla.s16 Q4, Q3, r6 -vstrw.u32 Q4, [r11,#(-440)] -vsub.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-472)] -vsub.u16 Q0, Q0, Q3 -vstrw.u32 Q0, [r12,#(-456)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -110)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -118)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -102)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -102)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -110)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-424)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-456)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-440)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -106)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -114)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -98)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -98)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -106)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-408)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-424)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -102)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -110)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -94)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -94)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -102)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-392)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-424)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-408)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -98)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -106)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -90)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -90)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -98)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-376)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-408)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-392)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -94)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -102)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -86)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -94)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-360)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-392)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-376)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -90)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -98)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -82)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -82)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -90)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-344)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-360)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -86)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -94)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -78)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -86)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-328)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-360)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-344)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -82)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -90)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -74)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -74)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -82)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-312)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-344)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-328)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -78)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -86)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -70)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -78)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-296)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-328)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-312)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -74)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -82)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -66)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -66)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -74)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-280)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-296)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -70)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -78)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -62)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -70)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-264)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-296)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-280)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -66)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -74)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -58)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -58)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -66)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-248)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-280)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-264)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -62)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -70)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -54)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -62)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-232)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-264)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-248)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -58)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -66)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -50)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -50)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -58)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-216)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-232)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -54)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -62)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -46)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-200)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-232)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-216)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -50)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -58)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -42)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -50)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-184)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-216)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-200)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -46)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -54)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -38)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-168)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-200)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-184)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -42)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -50)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -34)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -42)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-152)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-168)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -38)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -46)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -30)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-136)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-168)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-152)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -34)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -42)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -26)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -34)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-120)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-152)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-136)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -30)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -38)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -22)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-104)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-136)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-120)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -26)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -34)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -18)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -26)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-88)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-104)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -22)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -30)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -14)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-72)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-104)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-88)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -18)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -26)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -10)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -18)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-56)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-88)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-72)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -14)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -22)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -6)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-40)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-72)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-56)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -10)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -18)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -2)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -10)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(-24)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-40)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -6)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -14)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 2)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(-8)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-40)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-24)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -2)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -10)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 6)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -2)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(8)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-24)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-8)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 2)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * -6)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 10)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r6 -vstrw.u32 Q2, [r11,#(24)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-8)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(8)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 6)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * -2)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 14)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 6)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r6 -vstrw.u32 Q3, [r11,#(40)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(8)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 10)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 2)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 18)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 18)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 10)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -110)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-440)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-472)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-456)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 14)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 6)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -106)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-424)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-440)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 18)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 10)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 26)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 26)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 18)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -102)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-408)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-440)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-424)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 22)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 14)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -98)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-392)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-424)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-408)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 26)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 18)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 34)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 34)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 26)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -94)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-376)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-408)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-392)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 30)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 22)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 38)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-360)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-392)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-376)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 34)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 26)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 42)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 42)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 34)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-344)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-376)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-360)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 38)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 30)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 46)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -94)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-328)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-360)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-344)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 42)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 34)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 50)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 50)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 42)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-312)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-344)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-328)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 46)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 38)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 54)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-296)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-328)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-312)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 50)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 42)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 58)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 58)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 50)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-280)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-312)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-296)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 54)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 46)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 62)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-264)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-296)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-280)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 58)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 50)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 66)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 66)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 58)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-248)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-280)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-264)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 62)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 54)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 70)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-280)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-232)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-264)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-248)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 66)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 58)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 74)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 74)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 66)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-264)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-216)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-248)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-232)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 70)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 62)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 78)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 78)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-248)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-200)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-232)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-216)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 74)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 66)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 82)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 82)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 74)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-232)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-184)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -54)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-216)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-200)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 78)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 70)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 86)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 86)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-216)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-168)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -50)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-200)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-184)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 82)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 74)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 90)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 90)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 82)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-200)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-152)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -46)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-184)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-168)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 86)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 78)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 94)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 94)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-184)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-136)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -42)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-168)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-152)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 90)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 82)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 98)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 98)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 90)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-168)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-120)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -38)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-152)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-136)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 94)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 86)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 102)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 102)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-152)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-104)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -34)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-136)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-120)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 98)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 90)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 106)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 106)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 98)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-136)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-88)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -30)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-120)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-104)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 102)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 94)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 110)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 110)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-120)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-72)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -26)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-104)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-88)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 106)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 98)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 114)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 114)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 106)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-104)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-56)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -22)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-88)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-72)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 110)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 102)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 118)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 118)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-88)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-40)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -18)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-72)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-56)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 114)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 106)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 122)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 122)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 114)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-72)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-24)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -14)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-56)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-40)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 118)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 110)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 126)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 126)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-56)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-8)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -10)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-40)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 122)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 114)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r10, #(4 * -122)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r9, #(4 * -122)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 122)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-40)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(8)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -6)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-24)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-8)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 126)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 118)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r10, #(4 * -118)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r9, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 126)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-24)] -vmla.s16 Q3, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(24)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -2)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-8)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(8)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r11, #(4 * -122)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [sp, #(4 * 122)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r10, #(4 * -114)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r9, #(4 * -114)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r12, #(4 * -122)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-8)] -vmla.s16 Q2, Q4, r6 -vldrw.u32 Q1, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(40)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * 2)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(8)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(24)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r11, #(4 * -118)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [sp, #(4 * 126)] -vmla.s16 Q0, Q1, r5 -vshr.u16 Q6, Q6, #1 -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q2, [r9, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q4, [r14, #(4 * 2)] -vadd.u16 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(8)] -vmla.s16 Q3, Q2, r6 -vldrw.u32 Q1, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(56)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * 6)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(24)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q1, [r11, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(40)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #512 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #512 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #512 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #512 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #512 -add sp, sp, #5120 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_768_toom4_mve.s b/tests/toom/auto/poly_u16_mul_768_toom4_mve.s deleted file mode 100644 index ae6d8a7..0000000 --- a/tests/toom/auto/poly_u16_mul_768_toom4_mve.s +++ /dev/null @@ -1,3759 +0,0 @@ -.syntax unified -.type poly_u16_mul_192_C, %function -.global poly_u16_mul_192_C -.syntax unified -.type poly_u16_mul_768_toom4_mve, %function -.global poly_u16_mul_768_toom4_mve -poly_u16_mul_768_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #5376 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #1 -mov r5, #2 -mov r4, #3 -mov r3, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -30)] -vldrw.u32 Q2, [r1, #(4 * 66)] -vldrw.u32 Q3, [r8, #(4 * -90)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(24)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-216)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 70)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-456)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(264)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(40)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-200)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 74)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -82)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-440)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(56)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-184)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 78)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-424)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(296)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(72)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 82)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -74)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-408)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(88)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-152)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-392)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(104)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-136)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -66)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-376)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(120)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-120)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -62)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-360)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(136)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 2)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -58)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-344)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-88)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 6)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -54)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-328)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-72)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 10)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -50)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-312)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-56)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 14)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -46)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-296)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 18)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -42)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-280)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-24)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 22)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -38)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-264)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-8)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 26)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -34)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-248)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(8)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 30)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 126)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -30)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-232)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -62)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 34)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -26)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-216)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(504)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(40)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -58)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 38)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -22)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-200)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-488)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(56)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -54)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 42)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -18)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-184)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(72)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -50)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 46)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -14)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-168)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-456)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -46)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 50)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -10)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-152)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(104)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -42)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 54)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -6)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-136)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-424)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(120)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -38)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 58)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -2)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-120)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(136)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -34)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 62)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * 2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-104)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-392)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -30)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 66)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -90)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-88)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(168)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -26)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 70)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -86)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-72)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-360)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(184)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 74)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -82)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-56)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(200)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 78)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -78)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-40)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-328)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 82)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -74)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-24)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(232)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -70)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-8)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-296)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -66)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(264)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -62)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(24)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-264)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 2)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -58)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-8)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(296)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 6)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -54)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(56)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-232)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-456)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 10)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -50)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(328)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 14)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -46)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(88)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-200)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-424)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 18)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -42)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(360)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 22)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -38)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(120)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-168)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-392)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 26)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -34)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(392)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 30)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 126)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -30)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(152)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-136)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-360)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -62)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 34)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -26)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(424)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -58)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 38)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -22)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(184)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-104)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-328)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -54)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 42)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -18)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(456)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -50)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 46)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -14)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-72)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-296)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -46)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 50)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -10)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(488)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -42)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 54)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -6)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-40)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-264)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(504)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -38)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 58)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -2)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-488)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -34)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 62)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(280)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-8)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-232)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r11,#(296)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(248)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(8)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #768 -add r10, r2, #768 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(3072) -add r2, sp, #(3456) -add r0, sp, #(3840) -bl poly_u16_mul_192_C -add r1, sp, #(2304) -add r2, sp, #(2688) -add r0, sp, #(3072) -bl poly_u16_mul_192_C -add r1, sp, #(1536) -add r2, sp, #(1920) -add r0, sp, #(2304) -bl poly_u16_mul_192_C -add r1, sp, #(768) -add r2, sp, #(1152) -add r0, sp, #(1536) -bl poly_u16_mul_192_C -add r1, sp, #(0) -add r2, sp, #(384) -add r0, sp, #(768) -bl poly_u16_mul_192_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_192_C -add r1, r11, #(384) -add r2, r10, #(384) -add r0, sp, #(4608) -bl poly_u16_mul_192_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r6, #45 -mov r5, #-8 -mov r4, #43691 -mov r3, #16 -mov r2, #30 -mov r1, #61167 -mov r0, #-65 -vldrw.u32 Q0, [r11, #(4 * 78)] -vldrw.u32 Q1, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * 66)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -114)] -vldrw.u32 Q4, [r12, #(4 * -54)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r0 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r11, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r5 -vldrw.u32 Q5, [r12, #(4 * -50)] -vmla.s16 Q0, Q3, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(24)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r3 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-456)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 10)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r2 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-216)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r1 -vstrw.u32 Q2, [sp,#(264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r11,#(312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -46)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -42)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -38)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -34)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -30)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -26)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -22)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -18)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -14)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -10)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * -6)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -58)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -54)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -50)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 66)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(264)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 18)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 70)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(280)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(40)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(296)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 26)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(312)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(72)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 34)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -94)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(104)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -90)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 30)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -86)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 34)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(136)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -82)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -78)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -74)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -70)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 126)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -66)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -62)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -58)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -54)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -94)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-376)] -vmla.s16 Q1, Q2, r5 -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q3, [r12, #(4 * 38)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * -82)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * -22)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(440)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #384 -add sp, sp, #5376 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_mul_832_toom4_mve.s b/tests/toom/auto/poly_u16_mul_832_toom4_mve.s deleted file mode 100644 index 2a95384..0000000 --- a/tests/toom/auto/poly_u16_mul_832_toom4_mve.s +++ /dev/null @@ -1,4065 +0,0 @@ -.syntax unified -.type poly_u16_mul_208_C, %function -.global poly_u16_mul_208_C -.syntax unified -.type poly_u16_mul_832_toom4_mve, %function -.global poly_u16_mul_832_toom4_mve -poly_u16_mul_832_toom4_mve: -push {r4-r11,lr} -vpush {d8-d15} -sub sp, sp, #5824 -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -add r1, r1, #504 -add r8, r1, #1008 -add r2, r2, #504 -add r7, r2, #1008 -mov r6, #1 -mov r5, #2 -mov r4, #3 -mov r3, #7 -vldrw.u32 Q0, [r1, #(4 * -126)] -vldrw.u32 Q1, [r1, #(4 * -22)] -vldrw.u32 Q2, [r1, #(4 * 82)] -vldrw.u32 Q3, [r8, #(4 * -66)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-24)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -122)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -18)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -62)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-200)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(328)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-504)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(168)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-8)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -118)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * -14)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -58)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-184)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-488)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(8)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -114)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * -10)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -54)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-168)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(360)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-472)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(200)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -110)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * -6)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -50)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-152)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-456)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(40)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -106)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * -2)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -46)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-136)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(392)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-440)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(232)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(56)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -102)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 2)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -42)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-120)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-424)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(72)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -98)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 6)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -38)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-104)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(424)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-408)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(264)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -94)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 10)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r1, #(4 * 114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -34)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-88)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-392)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(104)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -90)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 14)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r1, #(4 * 118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -30)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-72)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [sp,#(456)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-376)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(296)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(120)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -86)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 18)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r1, #(4 * 122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -26)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-56)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [sp,#(472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-360)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(136)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -82)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 22)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r1, #(4 * 126)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -22)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-40)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [sp,#(488)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-344)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(328)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -78)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 26)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -18)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-24)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [sp,#(504)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-328)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(168)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -74)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 30)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -118)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * -14)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-8)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-488)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-312)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(360)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(184)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -70)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 34)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * -10)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(8)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-472)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-296)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(200)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -66)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 38)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -110)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * -6)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(24)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-456)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-280)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(392)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -62)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 42)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * -2)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(40)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-264)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(232)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -58)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 46)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -102)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * 2)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(56)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-424)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-248)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(424)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -54)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 50)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * 6)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(72)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-408)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-232)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(264)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -50)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 54)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -94)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * 10)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(88)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-392)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-216)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r14,#(456)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -46)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 58)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -90)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * 14)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(104)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-200)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r14,#(472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(296)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -42)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 62)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -86)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * 18)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(120)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-360)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-184)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r14,#(488)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -38)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r1, #(4 * 66)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r8, #(4 * -82)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r8, #(4 * 22)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(136)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-344)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-168)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r14,#(504)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(328)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -34)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r1, #(4 * 70)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r8, #(4 * -78)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r8, #(4 * 26)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(152)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-328)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-152)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-488)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -30)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r1, #(4 * 74)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r8, #(4 * -74)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r8, #(4 * 30)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(168)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-136)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-472)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(360)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r1, #(4 * -26)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r1, #(4 * 78)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r8, #(4 * -70)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r8, #(4 * 34)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(184)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-296)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-120)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-456)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -126)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -22)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 82)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -66)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(200)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-280)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-104)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-440)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(392)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -122)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -18)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 86)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -62)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(216)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-264)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-88)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-424)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -118)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * -14)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 90)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -58)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(232)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-72)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-408)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(424)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -114)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * -10)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 94)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -54)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(248)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-232)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(-56)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-392)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -110)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * -6)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 98)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -50)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(264)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-216)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(-40)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-376)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(456)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -106)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * -2)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 102)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -46)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(280)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-200)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(-24)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-360)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -102)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 2)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 106)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -42)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(296)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(-8)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-344)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(488)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -98)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 6)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 110)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -38)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(312)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-168)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(8)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-328)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(504)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -94)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 10)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r2, #(4 * 114)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -34)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-152)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(24)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-312)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-488)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -90)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 14)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r2, #(4 * 118)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -30)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-136)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(40)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-296)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-472)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -86)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 18)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r2, #(4 * 122)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -26)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(360)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(56)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-280)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-456)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -82)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 22)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r2, #(4 * 126)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -22)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(376)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-104)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(72)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-264)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-440)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -78)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 26)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -122)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -18)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-88)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(88)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-248)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-424)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -74)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 30)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -118)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * -14)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-72)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(104)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-232)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-408)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -70)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 34)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -114)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * -10)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(424)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(120)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-216)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-392)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -66)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 38)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -110)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * -6)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(440)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-40)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(136)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-200)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-376)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -62)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 42)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -106)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * -2)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-24)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(152)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-184)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-360)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -58)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 46)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -102)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 2)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-8)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(168)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-168)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-344)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -54)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 50)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -98)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * 6)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(488)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(8)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(184)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-152)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-328)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -50)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 54)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -94)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * 10)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(504)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(24)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(200)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-136)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-312)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -46)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 58)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -90)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * 14)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(40)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(216)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-120)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-296)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -42)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 62)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -86)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 18)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(56)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(232)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-104)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-280)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -38)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r2, #(4 * 66)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r7, #(4 * -82)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r7, #(4 * 22)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r10,#(-456)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(72)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [sp,#(248)] -vmla.s16 Q4, Q0, r4 -vstrw.u32 Q6, [r12,#(-88)] -vmla.s16 Q6, Q5, r5 -vmla.s16 Q5, Q1, r4 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r11,#(-264)] -vmla.s16 Q7, Q2, r4 -vmla.s16 Q7, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -34)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r2, #(4 * 70)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r7, #(4 * -78)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r7, #(4 * 26)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(88)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [sp,#(264)] -vmla.s16 Q5, Q0, r4 -vstrw.u32 Q4, [r12,#(-72)] -vmla.s16 Q4, Q7, r5 -vmla.s16 Q7, Q1, r4 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r11,#(-248)] -vmla.s16 Q6, Q2, r4 -vmla.s16 Q6, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -30)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r2, #(4 * 74)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r7, #(4 * -74)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r7, #(4 * 30)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(104)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [sp,#(280)] -vmla.s16 Q7, Q0, r4 -vstrw.u32 Q5, [r12,#(-56)] -vmla.s16 Q5, Q6, r5 -vmla.s16 Q6, Q1, r4 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r11,#(-232)] -vmla.s16 Q4, Q2, r4 -vmla.s16 Q4, Q3, r3 -vldrw.u32 Q0, [r2, #(4 * -26)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r2, #(4 * 78)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r7, #(4 * -70)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r7, #(4 * 34)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(120)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [sp,#(296)] -vmla.s16 Q6, Q0, r4 -vstrw.u32 Q7, [r12,#(-40)] -vmla.s16 Q7, Q4, r5 -vmla.s16 Q4, Q1, r4 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r11,#(-216)] -vmla.s16 Q5, Q2, r4 -vmla.s16 Q5, Q3, r3 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r10,#(-392)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [sp,#(312)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(136)] -sub sp, sp, #504 -sub r1, r1, #504 -sub r2, r2, #504 -add r11, r1, #832 -add r10, r2, #832 -mov r9, r1 -mov r8, r2 -mov r7, r0 -add r1, sp, #(3328) -add r2, sp, #(3744) -add r0, sp, #(4160) -bl poly_u16_mul_208_C -add r1, sp, #(2496) -add r2, sp, #(2912) -add r0, sp, #(3328) -bl poly_u16_mul_208_C -add r1, sp, #(1664) -add r2, sp, #(2080) -add r0, sp, #(2496) -bl poly_u16_mul_208_C -add r1, sp, #(832) -add r2, sp, #(1248) -add r0, sp, #(1664) -bl poly_u16_mul_208_C -add r1, sp, #(0) -add r2, sp, #(416) -add r0, sp, #(832) -bl poly_u16_mul_208_C -add r1, r9, #(0) -add r2, r8, #(0) -add r0, sp, #(0) -bl poly_u16_mul_208_C -add r1, r11, #(416) -add r2, r10, #(416) -add r0, sp, #(4992) -bl poly_u16_mul_208_C -add sp, sp, #504 -add r14, sp, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r6, #45 -mov r5, #-8 -mov r4, #43691 -mov r3, #16 -mov r2, #30 -mov r1, #61167 -mov r0, #-65 -vldrw.u32 Q0, [r10, #(4 * -94)] -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [sp, #(4 * 82)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -50)] -vldrw.u32 Q4, [r12, #(4 * -6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [sp, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r0 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r5 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q0, Q3, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r3 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-200)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r2 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r1 -vstrw.u32 Q2, [sp,#(328)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [sp, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [sp, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [sp,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(408)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(424)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(440)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r14,#(504)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-8)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vstrw.u32 Q1, [r12,#(-456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 126)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(504)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -74)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -70)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -22)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -18)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -114)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -14)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [sp,#(488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -110)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -10)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [sp, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [sp,#(504)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -106)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -6)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -102)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -2)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -98)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 2)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -94)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 6)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -90)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 10)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -86)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 14)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -82)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 18)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(424)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 126)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -78)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 22)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -74)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 26)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -70)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 30)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -66)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 34)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 66)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -62)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -6)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 38)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r14, #(4 * 126)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 70)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q3, [r11, #(4 * -58)] -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 42)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r12, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [sp, #(4 * 74)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r0 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q1, Q2, r5 -vldrw.u32 Q5, [r11, #(4 * -54)] -vmla.s16 Q4, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q7, [r12, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * 2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -10)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r2 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 46)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [sp, #(4 * 78)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r0 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -70)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q1, Q2, r5 -vmla.s16 Q6, Q2, r6 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q3, [r12, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r3 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * 6)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r2 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * 50)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r1 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-152)] -sub sp, sp, #504 -add r14, sp, #0 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -vldrh.u16 Q0, [r14], #+16 -vstrh.u16 Q0, [r7], #+16 -add r14, #416 -add sp, sp, #5824 -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom3_fwd_192.s b/tests/toom/auto/poly_u16_toom3_fwd_192.s deleted file mode 100644 index 33845d5..0000000 --- a/tests/toom/auto/poly_u16_toom3_fwd_192.s +++ /dev/null @@ -1,100 +0,0 @@ -.syntax unified -.type poly_u16_toom3_fwd_192_mve, %function -.global poly_u16_toom3_fwd_192_mve -poly_u16_toom3_fwd_192_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -mov r14, #-2 -mov r12, #3 -vldrw.u32 Q0, [r0, #(4 * -126)] -vldrw.u32 Q1, [r0, #(4 * -62)] -vadd.u16 Q2, Q0, Q1 -vldrw.u32 Q3, [r0, #(4 * -94)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r0,#(-376)] -vmla.s16 Q2, Q3, r14 -vstrw.u32 Q2, [r0,#(-120)] -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * -58)] -vmla.s16 Q2, Q1, r12 -vstrw.u32 Q2, [r0,#(8)] -vldrw.u32 Q0, [r0, #(4 * -122)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-360)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-104)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r0, #(4 * -54)] -vmla.s16 Q1, Q4, r12 -vstrw.u32 Q1, [r0,#(24)] -vldrw.u32 Q0, [r0, #(4 * -118)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * -86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-344)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-88)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r0, #(4 * -50)] -vmla.s16 Q1, Q3, r12 -vstrw.u32 Q1, [r0,#(40)] -vldrw.u32 Q0, [r0, #(4 * -114)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * -82)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-328)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-72)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r0, #(4 * -46)] -vmla.s16 Q1, Q4, r12 -vstrw.u32 Q1, [r0,#(56)] -vldrw.u32 Q0, [r0, #(4 * -110)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * -78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-312)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r0, #(4 * -42)] -vmla.s16 Q1, Q3, r12 -vstrw.u32 Q1, [r0,#(72)] -vldrw.u32 Q0, [r0, #(4 * -106)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * -74)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-296)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r0, #(4 * -38)] -vmla.s16 Q1, Q4, r12 -vstrw.u32 Q1, [r0,#(88)] -vldrw.u32 Q0, [r0, #(4 * -102)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * -70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-280)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-24)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r0, #(4 * -34)] -vmla.s16 Q1, Q3, r12 -vstrw.u32 Q1, [r0,#(104)] -vldrw.u32 Q0, [r0, #(4 * -98)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * -66)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(-264)] -vmla.s16 Q1, Q2, r14 -vstrw.u32 Q1, [r0,#(-8)] -vsub.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q4, r12 -vstrw.u32 Q1, [r0,#(120)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom3_fwd_768.s b/tests/toom/auto/poly_u16_toom3_fwd_768.s deleted file mode 100644 index fd6f62f..0000000 --- a/tests/toom/auto/poly_u16_toom3_fwd_768.s +++ /dev/null @@ -1,366 +0,0 @@ -.syntax unified -.type poly_u16_toom3_fwd_768_mve, %function -.global poly_u16_toom3_fwd_768_mve -poly_u16_toom3_fwd_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-2 -mov r10, #3 -vldrw.u32 Q0, [r0, #(4 * -126)] -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q2, Q0, Q1 -vldrw.u32 Q3, [r0, #(4 * 2)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r0,#(8)] -vmla.s16 Q2, Q3, r11 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -118)] -vmla.s16 Q2, Q1, r10 -vstrw.u32 Q2, [r12,#(-472)] -vldrw.u32 Q0, [r0, #(4 * -122)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 6)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(24)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -114)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-456)] -vldrw.u32 Q0, [r0, #(4 * -118)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 10)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(40)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -110)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-440)] -vldrw.u32 Q0, [r0, #(4 * -114)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(56)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(72)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -106)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-424)] -vldrw.u32 Q0, [r0, #(4 * -110)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 18)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(72)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(88)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -102)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-408)] -vldrw.u32 Q0, [r0, #(4 * -106)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 22)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(88)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(104)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -98)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-392)] -vldrw.u32 Q0, [r0, #(4 * -102)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 26)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(104)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(120)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -94)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-376)] -vldrw.u32 Q0, [r0, #(4 * -98)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 30)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(120)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(136)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -90)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-360)] -vldrw.u32 Q0, [r0, #(4 * -94)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 34)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(136)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(152)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -86)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-344)] -vldrw.u32 Q0, [r0, #(4 * -90)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 38)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(152)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(168)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -82)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-328)] -vldrw.u32 Q0, [r0, #(4 * -86)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 42)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(168)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(184)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -78)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-312)] -vldrw.u32 Q0, [r0, #(4 * -82)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 46)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(184)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(200)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -74)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-296)] -vldrw.u32 Q0, [r0, #(4 * -78)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 50)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(200)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(216)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -70)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-280)] -vldrw.u32 Q0, [r0, #(4 * -74)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 54)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(216)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(232)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -66)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-264)] -vldrw.u32 Q0, [r0, #(4 * -70)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 58)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(232)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(248)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -62)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-248)] -vldrw.u32 Q0, [r0, #(4 * -66)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 62)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(248)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(264)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -58)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-232)] -vldrw.u32 Q0, [r0, #(4 * -62)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 66)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(264)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(280)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -54)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-216)] -vldrw.u32 Q0, [r0, #(4 * -58)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(280)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(296)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -50)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-200)] -vldrw.u32 Q0, [r0, #(4 * -54)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 74)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(296)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(312)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -46)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-184)] -vldrw.u32 Q0, [r0, #(4 * -50)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(312)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(328)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -42)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-168)] -vldrw.u32 Q0, [r0, #(4 * -46)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(328)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(344)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -38)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-152)] -vldrw.u32 Q0, [r0, #(4 * -42)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(344)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(360)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -34)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-136)] -vldrw.u32 Q0, [r0, #(4 * -38)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 90)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(360)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(376)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -30)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-120)] -vldrw.u32 Q0, [r0, #(4 * -34)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 94)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(376)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -26)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-104)] -vldrw.u32 Q0, [r0, #(4 * -30)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 98)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(392)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(408)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -22)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-88)] -vldrw.u32 Q0, [r0, #(4 * -26)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 102)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(408)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(424)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -18)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-72)] -vldrw.u32 Q0, [r0, #(4 * -22)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 106)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(424)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(440)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -14)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-56)] -vldrw.u32 Q0, [r0, #(4 * -18)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 110)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(440)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(456)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -10)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-40)] -vldrw.u32 Q0, [r0, #(4 * -14)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 114)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(456)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(472)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * -6)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(-24)] -vldrw.u32 Q0, [r0, #(4 * -10)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 118)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(472)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(488)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q3, [r14, #(4 * -2)] -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(-8)] -vldrw.u32 Q0, [r0, #(4 * -6)] -vadd.u16 Q1, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 122)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(488)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r14,#(504)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q4, [r14, #(4 * 2)] -vmla.s16 Q1, Q3, r10 -vstrw.u32 Q1, [r12,#(8)] -vldrw.u32 Q0, [r0, #(4 * -2)] -vadd.u16 Q1, Q0, Q4 -vldrw.u32 Q2, [r0, #(4 * 126)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r0,#(504)] -vmla.s16 Q1, Q2, r11 -vstrw.u32 Q1, [r12,#(-488)] -vsub.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q4, r10 -vstrw.u32 Q1, [r12,#(24)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom3_inv_full_192.s b/tests/toom/auto/poly_u16_toom3_inv_full_192.s deleted file mode 100644 index 396254d..0000000 --- a/tests/toom/auto/poly_u16_toom3_inv_full_192.s +++ /dev/null @@ -1,390 +0,0 @@ -.syntax unified -.type poly_u16_toom3_inv_192_mve, %function -.global poly_u16_toom3_inv_192_mve -poly_u16_toom3_inv_192_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #43691 -mov r11, #2 -mov r10, #-1 -vldrw.u32 Q0, [r0, #(4 * -62)] -vldrw.u32 Q1, [r0, #(4 * 66)] -vsub.u16 Q1, Q1, Q0 -vmul.u16 Q1, Q1, r12 -vldrw.u32 Q2, [r0, #(4 * 2)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q3, [r0, #(4 * -126)] -vmla.s16 Q2, Q3, r10 -vshr.u16 Q0, Q0, #1 -vldrw.u32 Q4, [r0, #(4 * 70)] -vsub.u16 Q1, Q2, Q1 -vldrw.u32 Q5, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vldrw.u32 Q6, [r0, #(4 * -58)] -vshr.u16 Q1, Q1, #1 -vmla.s16 Q1, Q5, r11 -vstrw.u32 Q1, [r0,#(264)] -vsub.u16 Q0, Q0, Q1 -vstrw.u32 Q0, [r0,#(-248)] -vsub.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q4, Q4, Q6 -vmul.u16 Q4, Q4, r12 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -122)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 74)] -vsub.u16 Q4, Q0, Q4 -vldrw.u32 Q3, [r14, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q4, Q4, #1 -vmla.s16 Q4, Q3, r11 -vstrw.u32 Q4, [r0,#(280)] -vsub.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-232)] -vsub.u16 Q0, Q0, Q3 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -118)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 78)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -114)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -50)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r11 -vstrw.u32 Q2, [r0,#(296)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(-216)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -114)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 82)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r11 -vstrw.u32 Q3, [r0,#(312)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(-200)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -110)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 86)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -106)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -42)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r11 -vstrw.u32 Q2, [r0,#(328)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(-184)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -106)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 90)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r11 -vstrw.u32 Q3, [r0,#(344)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(-168)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -102)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 94)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -98)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -34)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r11 -vstrw.u32 Q2, [r0,#(360)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(-152)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -98)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 98)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r11 -vstrw.u32 Q3, [r0,#(376)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(-136)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -94)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 102)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -90)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -26)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r0, #(4 * -62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-248)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-488)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r0, #(4 * 2)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r0,#(8)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(264)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -90)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 106)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r0, #(4 * -58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-232)] -vmla.s16 Q3, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-472)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r0, #(4 * 6)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r0,#(24)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(280)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -86)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 110)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -82)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -18)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r0, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-216)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-456)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r0, #(4 * 10)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r0,#(40)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(296)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -82)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 114)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r0, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-200)] -vmla.s16 Q3, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-440)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r0, #(4 * 14)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r0,#(56)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(312)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -78)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 118)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -74)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -10)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r0, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-184)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-424)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r0, #(4 * 18)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r0,#(72)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(328)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -74)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * 122)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r14, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r0, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-168)] -vmla.s16 Q3, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-408)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r0, #(4 * 22)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r0,#(88)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(344)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r12 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -70)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * 126)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r14, #(4 * -66)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -2)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r0, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(-152)] -vmla.s16 Q2, Q4, r11 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r14,#(-392)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r0, #(4 * 26)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r0,#(104)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r0, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(360)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r12 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -66)] -vmla.s16 Q0, Q1, r10 -vshr.u16 Q6, Q6, #1 -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q2, [r14, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q4, [r0, #(4 * -34)] -vadd.u16 Q4, Q4, Q1 -vstrw.u32 Q4, [r0,#(-136)] -vmla.s16 Q3, Q2, r11 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r14,#(-376)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r0,#(120)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q1, [r0, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(376)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom3_inv_full_768.s b/tests/toom/auto/poly_u16_toom3_inv_full_768.s deleted file mode 100644 index 5dd6946..0000000 --- a/tests/toom/auto/poly_u16_toom3_inv_full_768.s +++ /dev/null @@ -1,1522 +0,0 @@ -.syntax unified -.type poly_u16_toom3_inv_768_mve, %function -.global poly_u16_toom3_inv_768_mve -poly_u16_toom3_inv_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #43691 -mov r7, #2 -mov r6, #-1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vldrw.u32 Q1, [r11, #(4 * -114)] -vsub.u16 Q1, Q1, Q0 -vmul.u16 Q1, Q1, r8 -vldrw.u32 Q2, [r12, #(4 * -118)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q3, [r0, #(4 * -126)] -vmla.s16 Q2, Q3, r6 -vshr.u16 Q0, Q0, #1 -vldrw.u32 Q4, [r11, #(4 * -110)] -vsub.u16 Q1, Q2, Q1 -vldrw.u32 Q5, [r10, #(4 * -110)] -vadd.u16 Q2, Q2, Q0 -vldrw.u32 Q6, [r14, #(4 * -118)] -vshr.u16 Q1, Q1, #1 -vmla.s16 Q1, Q5, r7 -vstrw.u32 Q1, [r11,#(-456)] -vsub.u16 Q0, Q0, Q1 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q6 -vmul.u16 Q4, Q4, r8 -vldrw.u32 Q0, [r12, #(4 * -114)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -122)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -106)] -vsub.u16 Q4, Q0, Q4 -vldrw.u32 Q3, [r10, #(4 * -106)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -114)] -vshr.u16 Q4, Q4, #1 -vmla.s16 Q4, Q3, r7 -vstrw.u32 Q4, [r11,#(-440)] -vsub.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-472)] -vsub.u16 Q0, Q0, Q3 -vstrw.u32 Q0, [r12,#(-456)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -110)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -118)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -102)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -102)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -110)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-424)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-456)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-440)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -106)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -114)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -98)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -98)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -106)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-408)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-440)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-424)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -102)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -110)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -94)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -94)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -102)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-392)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-424)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-408)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -98)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -106)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -90)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -90)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -98)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-376)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-408)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-392)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -94)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -102)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -86)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -94)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-360)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-392)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-376)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -90)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -98)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -82)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -82)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -90)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-344)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-376)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-360)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -86)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -94)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -78)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -86)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-328)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-360)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-344)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -82)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -90)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -74)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -74)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -82)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-312)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-344)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-328)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -78)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -86)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -70)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -78)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-296)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-328)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-312)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -74)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -82)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -66)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -66)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -74)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-280)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-312)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-296)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -70)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -78)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -62)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -70)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-264)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-296)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-280)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -66)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -74)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -58)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -58)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -66)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-248)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-280)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-264)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -62)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -70)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -54)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -62)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-232)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-264)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-248)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -58)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -66)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -50)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -50)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -58)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-216)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-248)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-232)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -54)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -62)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -46)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-200)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-232)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-216)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -50)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -58)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -42)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -50)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-184)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-216)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-200)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -46)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -54)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -38)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-168)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-200)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-184)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -42)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -50)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -34)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -42)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-152)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-184)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-168)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -38)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -46)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -30)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-136)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-168)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-152)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -34)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -42)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -26)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -34)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-120)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-152)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-136)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -30)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -38)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -22)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-104)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-136)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-120)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -26)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -34)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -18)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -26)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-88)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-120)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-104)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -22)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -30)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -14)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-72)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-104)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-88)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -18)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -26)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -10)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -18)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-56)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-88)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-72)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -14)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -22)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * -6)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-40)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-72)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-56)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -10)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -18)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * -2)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -10)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(-24)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-56)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-40)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * -6)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -14)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 2)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(-8)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-40)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-24)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * -2)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -10)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 6)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * -2)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(8)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(-24)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(-8)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 2)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -6)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 10)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r7 -vstrw.u32 Q2, [r11,#(24)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r14,#(-8)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(8)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 6)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -2)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 14)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 6)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r7 -vstrw.u32 Q3, [r11,#(40)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r14,#(8)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r12,#(24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 10)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 2)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 18)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 18)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 10)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -110)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-440)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-472)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-456)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 14)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 6)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -106)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-424)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-440)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 18)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 10)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 26)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 26)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 18)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -102)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-408)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-440)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-424)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 22)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 14)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -98)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-392)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-424)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-408)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 26)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 18)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 34)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 34)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 26)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -94)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-376)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-408)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-392)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 30)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 22)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 38)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-360)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-392)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-376)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 34)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 26)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 42)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 42)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 34)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-344)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-376)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-360)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 38)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 30)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 46)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -94)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-328)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-360)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-344)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 42)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 34)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 50)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 50)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 42)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-312)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-344)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-328)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 46)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 38)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 54)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-296)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-328)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-312)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 50)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 42)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 58)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 58)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 50)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-280)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-312)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-296)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 54)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 46)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 62)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-264)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-296)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-280)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 58)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 50)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 66)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 66)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 58)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-248)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-280)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-264)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 62)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 54)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 70)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-280)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-232)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-264)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-248)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 66)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 58)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 74)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 74)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 66)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-264)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-216)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-248)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-232)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 70)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 62)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 78)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 78)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-248)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-200)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-232)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-216)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 74)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 66)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 82)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 82)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 74)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-232)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-184)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -54)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-216)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-200)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 78)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 70)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 86)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 86)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-216)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-168)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -50)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-200)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-184)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 82)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 74)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 90)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 90)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 82)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-200)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-152)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -46)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-184)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-168)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 86)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 78)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 94)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 94)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-184)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-136)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -42)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-168)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-152)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 90)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 82)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 98)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 98)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 90)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-168)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-120)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -38)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-152)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-136)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 94)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 86)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 102)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 102)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-152)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-104)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -34)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-136)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-120)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 98)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 90)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 106)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 106)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 98)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-136)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-88)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -30)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-120)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-104)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 102)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 94)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 110)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 110)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-120)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-72)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -26)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-104)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-88)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 106)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 98)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 114)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 114)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 106)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-104)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-56)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -22)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-88)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-72)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 110)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 102)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 118)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 118)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-88)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-40)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -18)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-72)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-56)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 114)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 106)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r11, #(4 * 122)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r10, #(4 * 122)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 114)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-72)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(-24)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -14)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-56)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-40)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 118)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 110)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r11, #(4 * 126)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r10, #(4 * 126)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-56)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(-8)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -10)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-40)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r12, #(4 * 122)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 114)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r10, #(4 * -122)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r9, #(4 * -122)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r14, #(4 * 122)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-40)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(8)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * -6)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-24)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(-8)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r12, #(4 * 126)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 118)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r10, #(4 * -118)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r9, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r14, #(4 * 126)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q7, [r14, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-24)] -vmla.s16 Q3, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(24)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * -2)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(-8)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(8)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r8 -vldrw.u32 Q0, [r11, #(4 * -122)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * 122)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r10, #(4 * -114)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r9, #(4 * -114)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r12, #(4 * -122)] -vshr.u16 Q2, Q2, #1 -vldrw.u32 Q7, [r14, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(-8)] -vmla.s16 Q2, Q4, r7 -vldrw.u32 Q1, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q2 -vstrw.u32 Q1, [r10,#(40)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q1, [r12, #(4 * 2)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(8)] -vsub.u16 Q0, Q0, Q4 -vldrw.u32 Q1, [r11, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(24)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r8 -vldrw.u32 Q0, [r11, #(4 * -118)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * 126)] -vmla.s16 Q0, Q1, r6 -vshr.u16 Q6, Q6, #1 -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q2, [r9, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q4, [r14, #(4 * 2)] -vadd.u16 Q4, Q4, Q1 -vstrw.u32 Q4, [r14,#(8)] -vmla.s16 Q3, Q2, r7 -vldrw.u32 Q1, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q3 -vstrw.u32 Q1, [r10,#(56)] -vsub.u16 Q6, Q6, Q3 -vldrw.u32 Q1, [r12, #(4 * 6)] -vadd.u16 Q1, Q1, Q6 -vstrw.u32 Q1, [r12,#(24)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q1, [r11, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r11,#(40)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom3_inv_half_192.s b/tests/toom/auto/poly_u16_toom3_inv_half_192.s deleted file mode 100644 index 70a7f09..0000000 --- a/tests/toom/auto/poly_u16_toom3_inv_half_192.s +++ /dev/null @@ -1,165 +0,0 @@ -.syntax unified -.type poly_u16_toom3_inv_half_192_mve, %function -.global poly_u16_toom3_inv_half_192_mve -poly_u16_toom3_inv_half_192_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -mov r14, #43691 -mov r12, #2 -mov r11, #-1 -vldrw.u32 Q0, [r0, #(4 * -94)] -vldrw.u32 Q1, [r0, #(4 * -30)] -vsub.u16 Q1, Q1, Q0 -vmul.u16 Q1, Q1, r14 -vldrw.u32 Q2, [r0, #(4 * -62)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q3, [r0, #(4 * -126)] -vmla.s16 Q2, Q3, r11 -vshr.u16 Q0, Q0, #1 -vldrw.u32 Q4, [r0, #(4 * -26)] -vsub.u16 Q1, Q2, Q1 -vldrw.u32 Q5, [r0, #(4 * 2)] -vadd.u16 Q2, Q2, Q0 -vldrw.u32 Q6, [r0, #(4 * -90)] -vshr.u16 Q1, Q1, #1 -vmla.s16 Q1, Q5, r12 -vstrw.u32 Q1, [r0,#(-120)] -vsub.u16 Q0, Q0, Q1 -vstrw.u32 Q0, [r0,#(-376)] -vsub.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q4, Q4, Q6 -vmul.u16 Q4, Q4, r14 -vldrw.u32 Q0, [r0, #(4 * -58)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -122)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * -22)] -vsub.u16 Q4, Q0, Q4 -vldrw.u32 Q3, [r0, #(4 * 6)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q4, Q4, #1 -vmla.s16 Q4, Q3, r12 -vstrw.u32 Q4, [r0,#(-104)] -vsub.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-360)] -vsub.u16 Q0, Q0, Q3 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r14 -vldrw.u32 Q0, [r0, #(4 * -54)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -118)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * -18)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r0, #(4 * 10)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -82)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r12 -vstrw.u32 Q2, [r0,#(-88)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(-344)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r14 -vldrw.u32 Q0, [r0, #(4 * -50)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -114)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * -14)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r0, #(4 * 14)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r12 -vstrw.u32 Q3, [r0,#(-72)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(-328)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r14 -vldrw.u32 Q0, [r0, #(4 * -46)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -110)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * -10)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r0, #(4 * 18)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -74)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r12 -vstrw.u32 Q2, [r0,#(-56)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(-312)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r14 -vldrw.u32 Q0, [r0, #(4 * -42)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -106)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r0, #(4 * -6)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r0, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r12 -vstrw.u32 Q3, [r0,#(-40)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(-296)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r14 -vldrw.u32 Q0, [r0, #(4 * -38)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -102)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r0, #(4 * -2)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r0, #(4 * 26)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * -66)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r12 -vstrw.u32 Q2, [r0,#(-24)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(-280)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r14 -vldrw.u32 Q0, [r0, #(4 * -34)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -98)] -vmla.s16 Q0, Q1, r11 -vshr.u16 Q6, Q6, #1 -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q2, [r0, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q2, r12 -vstrw.u32 Q3, [r0,#(-8)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(-264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r0,#(-136)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom3_inv_half_768.s b/tests/toom/auto/poly_u16_toom3_inv_half_768.s deleted file mode 100644 index 0e953ea..0000000 --- a/tests/toom/auto/poly_u16_toom3_inv_half_768.s +++ /dev/null @@ -1,623 +0,0 @@ -.syntax unified -.type poly_u16_toom3_inv_half_768_mve, %function -.global poly_u16_toom3_inv_half_768_mve -poly_u16_toom3_inv_half_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #43691 -mov r10, #2 -mov r9, #-1 -vldrw.u32 Q0, [r0, #(4 * 2)] -vldrw.u32 Q1, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q0 -vmul.u16 Q1, Q1, r11 -vldrw.u32 Q2, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q2 -vldrw.u32 Q3, [r0, #(4 * -126)] -vmla.s16 Q2, Q3, r9 -vshr.u16 Q0, Q0, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q2, Q1 -vldrw.u32 Q5, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vldrw.u32 Q6, [r0, #(4 * 6)] -vshr.u16 Q1, Q1, #1 -vmla.s16 Q1, Q5, r10 -vstrw.u32 Q1, [r14,#(24)] -vsub.u16 Q0, Q0, Q1 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q6 -vmul.u16 Q4, Q4, r11 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -122)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 14)] -vsub.u16 Q4, Q0, Q4 -vldrw.u32 Q3, [r12, #(4 * -114)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 10)] -vshr.u16 Q4, Q4, #1 -vmla.s16 Q4, Q3, r10 -vstrw.u32 Q4, [r14,#(40)] -vsub.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(24)] -vsub.u16 Q0, Q0, Q3 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -118)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 18)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -110)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 14)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(40)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -114)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 22)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -106)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(72)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(56)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -110)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 26)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -102)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 22)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(72)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -106)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 30)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -98)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(104)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(88)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -102)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 34)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -94)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 30)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(104)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -98)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -90)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(136)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(120)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -94)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 42)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -86)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 38)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(136)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -90)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -82)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(168)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(152)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -86)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 50)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -78)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 46)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(168)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -82)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -74)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(200)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(184)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -78)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 58)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -70)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 54)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(200)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -74)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -66)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(232)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(216)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -70)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 66)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -62)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 62)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(232)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-264)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -66)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -58)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 66)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(264)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(248)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-248)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -62)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 74)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -54)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 70)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(264)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-232)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -58)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -50)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 74)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(296)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(280)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-216)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -54)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 82)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -46)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 78)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(296)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-200)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -50)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -42)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 82)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(328)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(312)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-184)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -46)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 90)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -38)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 86)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(328)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-168)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -42)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -34)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 90)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(360)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(344)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-152)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -38)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 98)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -30)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 94)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(360)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-136)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -34)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 102)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -26)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 98)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(392)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(376)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-120)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -30)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 106)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -22)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 102)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(392)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-104)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -26)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 110)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -18)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 106)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(424)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(408)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-88)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -22)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 114)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -14)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 110)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(424)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-72)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -18)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 118)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -10)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 114)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(456)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(440)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-56)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -14)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r14, #(4 * 122)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * -6)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 118)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(456)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-40)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -10)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vldrw.u32 Q2, [r14, #(4 * 126)] -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q4, [r12, #(4 * -2)] -vadd.u16 Q0, Q0, Q6 -vldrw.u32 Q5, [r0, #(4 * 122)] -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q4, r10 -vstrw.u32 Q3, [r14,#(488)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(472)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-24)] -vsub.u16 Q2, Q2, Q5 -vmul.u16 Q2, Q2, r11 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q5, Q5, Q0 -vldrw.u32 Q1, [r0, #(4 * -6)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q3, [r12, #(4 * -122)] -vsub.u16 Q2, Q0, Q2 -vldrw.u32 Q4, [r12, #(4 * 2)] -vadd.u16 Q0, Q0, Q5 -vldrw.u32 Q6, [r0, #(4 * 126)] -vshr.u16 Q2, Q2, #1 -vmla.s16 Q2, Q4, r10 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q5, Q5, Q2 -vstrw.u32 Q5, [r0,#(488)] -vsub.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(-8)] -vsub.u16 Q3, Q3, Q6 -vmul.u16 Q3, Q3, r11 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q1, [r0, #(4 * -2)] -vmla.s16 Q0, Q1, r9 -vshr.u16 Q6, Q6, #1 -vsub.u16 Q3, Q0, Q3 -vldrw.u32 Q2, [r12, #(4 * 6)] -vadd.u16 Q0, Q0, Q6 -vshr.u16 Q3, Q3, #1 -vmla.s16 Q3, Q2, r10 -vstrw.u32 Q3, [r12,#(-488)] -vsub.u16 Q6, Q6, Q3 -vstrw.u32 Q6, [r0,#(504)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(8)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256.s b/tests/toom/auto/poly_u16_toom4_fwd_256.s deleted file mode 100644 index 862a5b9..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256.s +++ /dev/null @@ -1,182 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_256_mve, %function -.global poly_u16_toom4_fwd_256_mve -poly_u16_toom4_fwd_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-240)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-224)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-208)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-192)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r11 -vmla.s16 Q5, Q1, r10 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-304)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-176)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r11 -vmla.s16 Q7, Q1, r10 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-288)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-160)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r11 -vmla.s16 Q6, Q1, r10 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-272)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-144)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r11 -vmla.s16 Q4, Q1, r10 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-256)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-128)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(368)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_bottom.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_bottom.s deleted file mode 100644 index cdce8ae..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_bottom.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_bottom_256_mve, %function -.global poly_u16_toom4_fwd_dual_bottom_256_mve -poly_u16_toom4_fwd_dual_bottom_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #-384 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s deleted file mode 100644 index 500ab41..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x1_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(144)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-240)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-48)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(384)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(400)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(176)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-208)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-16)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(416)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-192)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(0)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(432)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(240)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(208)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(16)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(448)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(256)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(464)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(272)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(240)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-144)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(48)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(480)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(288)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(256)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-128)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(64)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(304)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(496)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s deleted file mode 100644 index ab5d324..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_karatsuba_x2_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r14, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-144)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r12,#(-288)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(432)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(288)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-128)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r12,#(-272)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(304)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-112)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r12,#(-256)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(464)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(320)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-96)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r12,#(-240)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(336)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-48)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r12,#(-192)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(240)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-32)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r12,#(-176)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(256)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-320)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-16)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(128)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r12,#(-160)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(272)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-304)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(0)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(144)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r12,#(-144)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(288)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-432)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(432)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-288)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s deleted file mode 100644 index e9a6bb3..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve -poly_u16_toom4_fwd_dual_packed_limbs_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s deleted file mode 100644 index 9a2599d..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_packed_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_packed_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_packed_oop_256_mve -poly_u16_toom4_fwd_dual_packed_oop_256_mve: -push {r4-r11,lr} -vpush {d0-d15} -add r14, r1, #1008 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d0-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_top.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_top.s deleted file mode 100644 index 27ca088..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_top.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_top_256_mve, %function -.global poly_u16_toom4_fwd_dual_top_256_mve -poly_u16_toom4_fwd_dual_top_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #512 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r0,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r0,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r0,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r0,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r0,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r0,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r0,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r0,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r0], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r0,#(-32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_top_oop.s b/tests/toom/auto/poly_u16_toom4_fwd_256_dual_top_oop.s deleted file mode 100644 index a076a12..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_256_dual_top_oop.s +++ /dev/null @@ -1,198 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_dual_top_oop_256_mve, %function -.global poly_u16_toom4_fwd_dual_top_oop_256_mve -poly_u16_toom4_fwd_dual_top_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #512 -mov r12, #1 -mov r11, #2 -mov r10, #3 -mov r9, #7 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q6, Q6, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q7, Q6, Q4 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q6, Q6, Q4 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-32)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(-32)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(-48)] -vmla.s16 Q4, Q0, r10 -vstrw.u32 Q6, [r1,#(48)] -vmla.s16 Q6, Q5, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q5, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14], #48 -vmla.s16 Q7, Q2, r10 -vmla.s16 Q7, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q4, Q4, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q6, Q4, Q5 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q4, Q4, Q5 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-32)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(-32)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(-48)] -vmla.s16 Q5, Q0, r10 -vstrw.u32 Q4, [r1,#(48)] -vmla.s16 Q4, Q7, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q7, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14], #48 -vmla.s16 Q6, Q2, r10 -vmla.s16 Q6, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q5, Q5, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q4, Q5, Q7 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q5, Q5, Q7 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(-48)] -vmla.s16 Q7, Q0, r10 -vstrw.u32 Q5, [r1,#(48)] -vmla.s16 Q5, Q6, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q6, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14], #48 -vmla.s16 Q4, Q2, r10 -vmla.s16 Q4, Q3, r9 -vld40.u16 {Q0, Q1, Q2, Q3}, [r0] -vshl.u16 Q7, Q7, #1 -vld41.u16 {Q0, Q1, Q2, Q3}, [r0] -vsub.u16 Q5, Q7, Q6 -vld42.u16 {Q0, Q1, Q2, Q3}, [r0] -vadd.u16 Q7, Q7, Q6 -vld43.u16 {Q0, Q1, Q2, Q3}, [r0]! -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-32)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(-32)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vmla.s16 Q6, Q0, r10 -vstrw.u32 Q7, [r1,#(48)] -vmla.s16 Q7, Q4, r11 -vstrw.u32 Q0, [r1], #64 -vmla.s16 Q4, Q1, r10 -vstrw.u32 Q3, [r14,#(32)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14], #48 -vmla.s16 Q5, Q2, r10 -vmla.s16 Q5, Q3, r9 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(-48)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(-32)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_512.s b/tests/toom/auto/poly_u16_toom4_fwd_512.s deleted file mode 100644 index c2b43fe..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_512.s +++ /dev/null @@ -1,351 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_512_mve, %function -.global poly_u16_toom4_fwd_512_mve -poly_u16_toom4_fwd_512_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 64)] -vldrw.u32 Q2, [r14, #(4 * -124)] -vldrw.u32 Q3, [r14, #(4 * -60)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(16)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(272)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 68)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -120)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -56)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-496)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(256)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(32)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(288)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 72)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -116)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -52)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-480)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(272)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(48)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(304)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 76)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -112)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * -48)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-464)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(288)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(64)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(320)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 80)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -108)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * -44)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-432)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-448)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(304)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(80)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(336)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 84)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -104)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -40)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-416)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(320)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(96)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(352)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 88)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -100)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -36)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-400)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(336)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(112)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(368)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 92)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -96)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * -32)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-384)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(352)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(128)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(384)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 32)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 96)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -92)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * -28)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-368)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(368)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(400)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 36)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -88)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -24)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-352)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-368)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(384)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(416)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 40)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -84)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -20)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-336)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-352)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(400)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(432)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 44)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -80)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * -16)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-320)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-336)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(416)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(448)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 48)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -76)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * -12)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(-304)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-320)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(432)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(208)] -vmla.s16 Q6, Q5, r10 -vmla.s16 Q5, Q1, r9 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(464)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 52)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -72)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * -8)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-288)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-304)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(448)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(224)] -vmla.s16 Q4, Q7, r10 -vmla.s16 Q7, Q1, r9 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(480)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 56)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -68)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * -4)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-272)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-288)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(464)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(240)] -vmla.s16 Q5, Q6, r10 -vmla.s16 Q6, Q1, r9 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(496)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 60)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -64)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 0)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-256)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-272)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(480)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(256)] -vmla.s16 Q7, Q4, r10 -vmla.s16 Q4, Q1, r9 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-496)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-240)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r0,#(496)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-256)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_768.s b/tests/toom/auto/poly_u16_toom4_fwd_768.s deleted file mode 100644 index 03ea09d..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_768.s +++ /dev/null @@ -1,520 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_768_mve, %function -.global poly_u16_toom4_fwd_768_mve -poly_u16_toom4_fwd_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 96)] -vldrw.u32 Q2, [r14, #(4 * -60)] -vldrw.u32 Q3, [r14, #(4 * 36)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-480)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-96)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -56)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 40)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(288)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-240)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(384)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-464)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-80)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 104)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -52)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 44)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(304)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-224)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(400)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-448)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(-64)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -48)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 48)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(320)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-208)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(416)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-432)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(-48)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -44)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 52)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(336)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-192)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(432)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-416)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(-32)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 56)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(352)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(448)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-400)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(-16)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 60)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(368)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(464)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-384)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(0)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 64)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(384)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-144)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(480)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-368)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(16)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 68)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(400)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(496)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-352)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(32)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 36)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 72)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(416)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-336)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(48)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 40)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 76)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(432)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-320)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(64)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 44)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 80)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-80)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-304)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(80)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 48)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -12)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 84)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r12,#(464)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-288)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(96)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 52)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 88)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-272)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(112)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 56)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 92)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(496)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-256)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(128)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 60)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 0)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 96)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-16)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-240)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(144)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 4)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 100)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(0)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-224)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(160)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 68)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 104)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-464)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(16)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-208)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(176)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 72)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 12)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 108)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-192)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(192)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 76)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 112)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-432)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(48)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-176)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(208)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 80)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 20)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 116)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-416)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-160)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(224)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 84)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 120)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-400)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(80)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-304)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-144)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(240)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 88)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 28)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 124)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-384)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-288)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-128)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(256)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 92)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #(4 * -124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-368)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(112)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-272)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-112)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(272)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r11,#(-352)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r14,#(-256)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(128)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_832.s b/tests/toom/auto/poly_u16_toom4_fwd_832.s deleted file mode 100644 index dc90468..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_832.s +++ /dev/null @@ -1,562 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_832_mve, %function -.global poly_u16_toom4_fwd_832_mve -poly_u16_toom4_fwd_832_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 104)] -vldrw.u32 Q2, [r14, #(4 * -44)] -vldrw.u32 Q3, [r14, #(4 * 60)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-352)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(64)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 108)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -40)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 64)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-176)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(416)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-336)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(80)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 112)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -36)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 68)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(496)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(432)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-320)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(96)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 72)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-496)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-144)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r0,#(448)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-304)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(112)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 120)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -28)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 76)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r0,#(464)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-288)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(128)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 124)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 80)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-464)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-112)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r0,#(480)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-272)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(144)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -124)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -20)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 84)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r0,#(496)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-256)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(160)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -120)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * -16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 88)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-432)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-80)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-496)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-240)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(176)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -116)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * -12)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 92)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-416)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-480)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-224)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(192)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 36)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -112)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * -8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 96)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-400)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-48)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-464)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-208)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(208)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 40)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -108)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * -4)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 100)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-384)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-448)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-192)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(224)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 44)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -104)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 0)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 104)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-368)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-16)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-432)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-176)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(240)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 48)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -100)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 4)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 108)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-352)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(0)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-416)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-160)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(256)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 52)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -96)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 8)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r14, #(4 * 112)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-336)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(16)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-400)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-144)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(272)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 56)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -92)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 12)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r14, #(4 * 116)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-320)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(32)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-384)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-128)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(288)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 60)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -88)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 16)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r14, #(4 * 120)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-304)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(48)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-368)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-112)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(304)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -84)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 20)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r14, #(4 * 124)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-288)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(64)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-352)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-96)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(320)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 68)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -80)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 24)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #(4 * -124)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-272)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(80)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-336)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-80)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(336)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 72)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -76)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 28)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r12, #(4 * -120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-256)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(96)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-320)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(-64)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(352)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 76)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -72)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 32)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #(4 * -116)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-240)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(112)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-304)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(-48)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(368)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 80)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -68)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 36)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r12, #(4 * -112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-224)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(128)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-288)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(-32)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(384)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 84)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -64)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 40)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #(4 * -108)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-208)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(144)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-272)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(-16)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(400)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 88)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r14, #(4 * -60)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r14, #(4 * 44)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r12, #(4 * -104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r11,#(-192)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(160)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r14,#(-256)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r12,#(0)] -vmla.s16 Q5, Q6, r9 -vmla.s16 Q6, Q1, r8 -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r12,#(416)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 92)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r14, #(4 * -56)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r14, #(4 * 48)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r12, #(4 * -100)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r11,#(-176)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(176)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r14,#(-240)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r12,#(16)] -vmla.s16 Q7, Q4, r9 -vmla.s16 Q4, Q1, r8 -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r12,#(432)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 96)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r14, #(4 * -52)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r14, #(4 * 52)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r12, #(4 * -96)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r11,#(-160)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(192)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r14,#(-224)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r12,#(32)] -vmla.s16 Q6, Q5, r9 -vmla.s16 Q5, Q1, r8 -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r12,#(448)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 100)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r14, #(4 * -48)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r14, #(4 * 56)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r12, #(4 * -92)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r11,#(-144)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(208)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r14,#(-208)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r12,#(48)] -vmla.s16 Q4, Q7, r9 -vmla.s16 Q7, Q1, r8 -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r12,#(464)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vshl.u16 Q5, Q5, #1 -vstrw.u32 Q6, [r11,#(-128)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q4, [r14,#(-192)] -vadd.u16 Q5, Q5, Q7 -vstrw.u32 Q5, [r14,#(224)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s b/tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s deleted file mode 100644 index ab0d57f..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve, %function -.global poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve -poly_u16_toom4_fwd_karatsuba_x1_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r0, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(144)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-240)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-48)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(384)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-224)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(400)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(176)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-208)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-16)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(416)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-192)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(0)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(432)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(240)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r14,#(-368)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(208)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-176)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(16)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(448)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(256)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r14,#(-352)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-160)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(32)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(464)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(272)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r14,#(-336)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(240)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-144)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(48)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(480)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(288)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r14,#(-320)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(256)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-128)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(64)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(304)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(496)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s b/tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s deleted file mode 100644 index dfd6a5d..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s +++ /dev/null @@ -1,200 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve, %function -.global poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve -poly_u16_toom4_fwd_karatsuba_x2_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r14, #1008 -add r11, r0, #1008 -mov r10, #1 -mov r9, #2 -mov r8, #3 -mov r7, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r14,#(-144)] -vmla.s16 Q6, Q5, r9 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r8 -vstrw.u32 Q3, [r12,#(-288)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(144)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(432)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-432)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(288)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r14,#(-128)] -vmla.s16 Q4, Q7, r9 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r8 -vstrw.u32 Q3, [r12,#(-272)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(160)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(448)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-416)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(304)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r14,#(-112)] -vmla.s16 Q5, Q6, r9 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r8 -vstrw.u32 Q3, [r12,#(-256)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(176)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(464)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-400)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(320)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r14,#(-96)] -vmla.s16 Q7, Q4, r9 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r8 -vstrw.u32 Q3, [r12,#(-240)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(192)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(480)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r14,#(-384)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(336)] -vmla.s16 Q4, Q0, r8 -vstrw.u32 Q6, [r14,#(-48)] -vmla.s16 Q6, Q5, r9 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q5, Q1, r8 -vstrw.u32 Q3, [r12,#(-192)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(240)] -vmla.s16 Q7, Q2, r8 -vmla.s16 Q7, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r12,#(-480)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r14,#(-336)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q5, Q0, r8 -vstrw.u32 Q4, [r14,#(-32)] -vmla.s16 Q4, Q7, r9 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q7, Q1, r8 -vstrw.u32 Q3, [r12,#(-176)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(256)] -vmla.s16 Q6, Q2, r8 -vmla.s16 Q6, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r12,#(-464)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r14,#(-320)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q7, Q0, r8 -vstrw.u32 Q5, [r14,#(-16)] -vmla.s16 Q5, Q6, r9 -vstrw.u32 Q0, [r1,#(128)] -vmla.s16 Q6, Q1, r8 -vstrw.u32 Q3, [r12,#(-160)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(272)] -vmla.s16 Q4, Q2, r8 -vmla.s16 Q4, Q3, r7 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r12,#(-448)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r14,#(-304)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q6, Q0, r8 -vstrw.u32 Q7, [r14,#(0)] -vmla.s16 Q7, Q4, r9 -vstrw.u32 Q0, [r1,#(144)] -vmla.s16 Q4, Q1, r8 -vstrw.u32 Q3, [r12,#(-144)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(288)] -vmla.s16 Q5, Q2, r8 -vmla.s16 Q5, Q3, r7 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r12,#(-432)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(432)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r14,#(-288)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_fwd_oop_256.s b/tests/toom/auto/poly_u16_toom4_fwd_oop_256.s deleted file mode 100644 index 294c21e..0000000 --- a/tests/toom/auto/poly_u16_toom4_fwd_oop_256.s +++ /dev/null @@ -1,199 +0,0 @@ -.syntax unified -.type poly_u16_toom4_fwd_oop_256_mve, %function -.global poly_u16_toom4_fwd_oop_256_mve -poly_u16_toom4_fwd_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r1, #1008 -add r12, r0, #1008 -mov r11, #1 -mov r10, #2 -mov r9, #3 -mov r8, #7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vldrw.u32 Q1, [r0, #(4 * 32)] -vldrw.u32 Q2, [r0, #(4 * 64)] -vldrw.u32 Q3, [r0, #(4 * 96)] -vadd.u16 Q4, Q0, Q2 -vadd.u16 Q5, Q1, Q3 -vsub.u16 Q6, Q4, Q5 -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r1,#(384)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(0)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(-240)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-496)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 4)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 100)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-368)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(256)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(128)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r1,#(400)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(16)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(-224)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-480)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 8)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 104)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-352)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(272)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(144)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r1,#(416)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(32)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(-208)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-464)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 12)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 108)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-336)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(288)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(160)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r1,#(432)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(48)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(-192)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-448)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 16)] -vshl.u16 Q6, Q6, #1 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q7, Q6, Q4 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q4 -vldrw.u32 Q3, [r0, #(4 * 112)] -vadd.u16 Q4, Q0, Q2 -vstrw.u32 Q5, [r14,#(-320)] -vadd.u16 Q5, Q1, Q3 -vstrw.u32 Q6, [r1,#(304)] -vsub.u16 Q6, Q4, Q5 -vstrw.u32 Q7, [r1,#(176)] -vmla.s16 Q4, Q0, r9 -vstrw.u32 Q6, [r1,#(448)] -vmla.s16 Q6, Q5, r10 -vstrw.u32 Q0, [r1,#(64)] -vmla.s16 Q5, Q1, r9 -vstrw.u32 Q3, [r14,#(-176)] -vadd.u16 Q7, Q6, Q1 -vstrw.u32 Q6, [r14,#(-432)] -vmla.s16 Q7, Q2, r9 -vmla.s16 Q7, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 20)] -vshl.u16 Q4, Q4, #1 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q6, Q4, Q5 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q3, [r0, #(4 * 116)] -vadd.u16 Q5, Q0, Q2 -vstrw.u32 Q7, [r14,#(-304)] -vadd.u16 Q7, Q1, Q3 -vstrw.u32 Q4, [r1,#(320)] -vsub.u16 Q4, Q5, Q7 -vstrw.u32 Q6, [r1,#(192)] -vmla.s16 Q5, Q0, r9 -vstrw.u32 Q4, [r1,#(464)] -vmla.s16 Q4, Q7, r10 -vstrw.u32 Q0, [r1,#(80)] -vmla.s16 Q7, Q1, r9 -vstrw.u32 Q3, [r14,#(-160)] -vadd.u16 Q6, Q4, Q1 -vstrw.u32 Q4, [r14,#(-416)] -vmla.s16 Q6, Q2, r9 -vmla.s16 Q6, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 24)] -vshl.u16 Q5, Q5, #1 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q4, Q5, Q7 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q5, Q5, Q7 -vldrw.u32 Q3, [r0, #(4 * 120)] -vadd.u16 Q7, Q0, Q2 -vstrw.u32 Q6, [r14,#(-288)] -vadd.u16 Q6, Q1, Q3 -vstrw.u32 Q5, [r1,#(336)] -vsub.u16 Q5, Q7, Q6 -vstrw.u32 Q4, [r1,#(208)] -vmla.s16 Q7, Q0, r9 -vstrw.u32 Q5, [r1,#(480)] -vmla.s16 Q5, Q6, r10 -vstrw.u32 Q0, [r1,#(96)] -vmla.s16 Q6, Q1, r9 -vstrw.u32 Q3, [r14,#(-144)] -vadd.u16 Q4, Q5, Q1 -vstrw.u32 Q5, [r14,#(-400)] -vmla.s16 Q4, Q2, r9 -vmla.s16 Q4, Q3, r8 -vldrw.u32 Q0, [r0, #(4 * 28)] -vshl.u16 Q7, Q7, #1 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q5, Q7, Q6 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q6 -vldrw.u32 Q3, [r0, #(4 * 124)] -vadd.u16 Q6, Q0, Q2 -vstrw.u32 Q4, [r14,#(-272)] -vadd.u16 Q4, Q1, Q3 -vstrw.u32 Q7, [r1,#(352)] -vsub.u16 Q7, Q6, Q4 -vstrw.u32 Q5, [r1,#(224)] -vmla.s16 Q6, Q0, r9 -vstrw.u32 Q7, [r1,#(496)] -vmla.s16 Q7, Q4, r10 -vstrw.u32 Q0, [r1,#(112)] -vmla.s16 Q4, Q1, r9 -vstrw.u32 Q3, [r14,#(-128)] -vadd.u16 Q5, Q7, Q1 -vstrw.u32 Q7, [r14,#(-384)] -vmla.s16 Q5, Q2, r9 -vmla.s16 Q5, Q3, r8 -vshl.u16 Q6, Q6, #1 -vstrw.u32 Q5, [r14,#(-256)] -vsub.u16 Q5, Q6, Q4 -vstrw.u32 Q5, [r1,#(240)] -vadd.u16 Q6, Q6, Q4 -vstrw.u32 Q6, [r1,#(368)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_dual_bottom_256.s b/tests/toom/auto/poly_u16_toom4_inv_dual_bottom_256.s deleted file mode 100644 index d6b99ea..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_dual_bottom_256.s +++ /dev/null @@ -1,381 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_bottom_256_mve, %function -.global poly_u16_toom4_inv_dual_bottom_256_mve -poly_u16_toom4_inv_dual_bottom_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -mov r14, #0 -mov r12, #0 -mov r11, #0 -mov r10, #21840 -mov r9, #45 -mov r8, #43691 -mov r7, #8 -mov r6, #-30 -mov r5, #4369 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -mov r1, #-1 -vldrw.u32 Q4, [r0, #(4 * -96)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r2 -vldrw.u32 Q6, [r0, #(4 * -92)] -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -88)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -80)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -84)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -92)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -84)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -88)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -96)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -88)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -92)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -100)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -92)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -96)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -104)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -96)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -108)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -100)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -104)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q7, [r0, #(4 * -112)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q7, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -104)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -108)] -vshr.u16 Q1, Q1, #2 -vmla.s16 Q6, Q1, r1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vmla.s16 Q1, Q2, r1 -vldrw.u32 Q6, [r0, #(4 * -116)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vmla.s16 Q4, Q0, r1 -vadd.u16 Q2, Q2, Q2 -vmla.s16 Q4, Q6, r1 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vshr.u16 Q1, Q1, #2 -vmla.s16 Q7, Q1, r1 -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vst43.u16 {Q0,Q1,Q2,Q3}, [r0] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r11 -vmov.u16 Q0[1], r12 -vmov.u16 Q0[2], r14 -vldrw.u32 Q1, [r0, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s b/tests/toom/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s deleted file mode 100644 index 26e8f23..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_dual_bottom_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_bottom_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_bottom_oop_256_mve -poly_u16_toom4_inv_dual_bottom_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -mov r14, #0 -mov r12, #0 -mov r11, #0 -mov r10, #21840 -mov r9, #45 -mov r8, #43691 -mov r7, #8 -mov r6, #-30 -mov r5, #4369 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q4, [r0, #(4 * -96)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r2 -vldrw.u32 Q6, [r0, #(4 * -92)] -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -88)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -80)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -84)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 20)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -76)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 16)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -68)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -72)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 44)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -64)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -56)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -60)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 60)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -52)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 48)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -44)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -48)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 76)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 68)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -40)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -32)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -36)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 92)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 84)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -28)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 80)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vldrw.u32 Q6, [r0, #(4 * -20)] -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -24)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 108)] -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r0, #(4 * -16)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 96)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q7, r14, #16 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r2 -vldrw.u32 Q7, [r0, #(4 * -8)] -vmla.s16 Q5, Q6, r3 -vshlc Q4, r11, #16 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q4, [r0, #(4 * -12)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 124)] -vshlc Q6, r12, #16 -vmla.s16 Q1, Q6, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vmla.s16 Q4, Q5, r2 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r4 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r0, #(4 * -4)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 112)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r9 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r2 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r8 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r10 -vshlc Q6, r14, #16 -vmla.s16 Q7, Q5, r7 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r2 -vmla.s16 Q5, Q7, r3 -vshlc Q4, r11, #16 -vmul.u16 Q7, Q7, r3 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r6 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r5 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r12, #16 -vmla.s16 Q1, Q7, r2 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r11 -vmov.u16 Q0[1], r12 -vmov.u16 Q0[2], r14 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s b/tests/toom/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s deleted file mode 100644 index e890c41..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve -poly_u16_toom4_inv_dual_packed_limbs_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -vldrw.u32 Q4, [r14, #(4 * -124)] -vldrw.u32 Q5, [r0, #(4 * 96)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q6, [r14, #(4 * -92)] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 64)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 32)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -60)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -88)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -120)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 100)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 68)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -56)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 4)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -84)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -116)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 104)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 40)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -52)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 8)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -80)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 108)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 76)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 44)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -48)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 12)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -76)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -108)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 112)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 80)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 48)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -44)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 16)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -72)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -104)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 116)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 84)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -40)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 20)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -68)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 120)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 56)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -36)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 24)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -64)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -96)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 124)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 92)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 60)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -32)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 28)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_dual_top_256.s b/tests/toom/auto/poly_u16_toom4_inv_dual_top_256.s deleted file mode 100644 index e87718e..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_dual_top_256.s +++ /dev/null @@ -1,381 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_top_256_mve, %function -.global poly_u16_toom4_inv_dual_top_256_mve -poly_u16_toom4_inv_dual_top_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -mov r1, #1 -vldrw.u32 Q4, [r14, #(4 * -124)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vmla.s16 Q4, Q5, r1 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -116)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -108)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -104)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #(4 * -96)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -92)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -80)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #(4 * -72)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -68)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -60)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -56)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vldrw.u32 Q6, [r14, #(4 * -48)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -44)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q6, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q7, r1 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vmla.s16 Q1, Q6, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vmla.s16 Q4, Q5, r1 -vst43.u16 {Q0,Q1,Q2,Q3}, [r0]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -32)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vmla.s16 Q1, Q7, r1 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vmla.s16 Q2, Q6, r1 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vmla.s16 Q1, Q7, r1 -vst40.u16 {Q0,Q1,Q2,Q3}, [r0] -vst41.u16 {Q0,Q1,Q2,Q3}, [r0] -vst42.u16 {Q0,Q1,Q2,Q3}, [r0] -vst43.u16 {Q0,Q1,Q2,Q3}, [r0] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r0, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_dual_top_oop_256.s b/tests/toom/auto/poly_u16_toom4_inv_dual_top_oop_256.s deleted file mode 100644 index c69fe71..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_dual_top_oop_256.s +++ /dev/null @@ -1,380 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_dual_top_oop_256_mve, %function -.global poly_u16_toom4_inv_dual_top_oop_256_mve -poly_u16_toom4_inv_dual_top_oop_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r14, r0, #1008 -mov r12, #0 -mov r11, #0 -mov r10, #0 -mov r9, #21840 -mov r8, #45 -mov r7, #43691 -mov r6, #8 -mov r5, #-30 -mov r4, #4369 -mov r3, #-65 -mov r2, #36409 -vldrw.u32 Q4, [r14, #(4 * -124)] -vldrw.u32 Q5, [r0, #(4 * 12)] -vsub.u16 Q5, Q5, Q4 -vshr.u16 Q5, Q5, #1 -vadd.u16 Q4, Q4, Q5 -vldrw.u32 Q6, [r14, #(4 * -120)] -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 8)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 4)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -116)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 0)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -108)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -112)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 28)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 24)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 20)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -104)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 16)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -96)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -100)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 44)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 40)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 36)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -92)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 32)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -84)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -88)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 60)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 56)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 52)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -80)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 48)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -72)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -76)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 76)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 72)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 68)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -68)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 64)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -60)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -64)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 92)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 88)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 84)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -56)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 80)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vldrw.u32 Q6, [r14, #(4 * -48)] -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -52)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vldrw.u32 Q5, [r0, #(4 * 108)] -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q6, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 104)] -vadd.u16 Q6, Q6, Q2 -vldrw.u32 Q1, [r0, #(4 * 100)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q7, [r14, #(4 * -44)] -vsub.u16 Q2, Q2, Q7 -vldrw.u32 Q0, [r0, #(4 * 96)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q7 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q6, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q6 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q6, Q6, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q7, r12, #16 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q7 -vldrw.u32 Q7, [r14, #(4 * -36)] -vmla.s16 Q5, Q6, r2 -vshlc Q4, r10, #16 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q6, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vldrw.u32 Q4, [r14, #(4 * -40)] -vshr.u16 Q1, Q1, #2 -vsub.u16 Q6, Q6, Q1 -vldrw.u32 Q5, [r0, #(4 * 124)] -vshlc Q6, r11, #16 -vadd.u16 Q1, Q1, Q6 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vsub.u16 Q5, Q5, Q4 -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vshr.u16 Q5, Q5, #1 -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vadd.u16 Q4, Q4, Q5 -vst43.u16 {Q0,Q1,Q2,Q3}, [r1]! -vmla.s16 Q7, Q4, r3 -vldrw.u32 Q2, [r0, #(4 * 120)] -vadd.u16 Q7, Q7, Q2 -vldrw.u32 Q1, [r0, #(4 * 116)] -vsub.u16 Q1, Q1, Q2 -vldrw.u32 Q6, [r14, #(4 * -32)] -vsub.u16 Q2, Q2, Q6 -vldrw.u32 Q0, [r0, #(4 * 112)] -vsub.u16 Q4, Q4, Q0 -vadd.u16 Q2, Q2, Q2 -vsub.u16 Q4, Q4, Q6 -vadd.u16 Q2, Q2, Q1 -vmla.s16 Q7, Q4, r8 -vshr.u16 Q2, Q2, #3 -vadd.u16 Q1, Q1, Q7 -vsub.u16 Q2, Q2, Q4 -vmul.u16 Q2, Q2, r7 -vshr.u16 Q7, Q7, #1 -vmla.s16 Q2, Q0, r9 -vshlc Q6, r12, #16 -vmla.s16 Q7, Q5, r6 -vsub.u16 Q4, Q4, Q2 -vadd.u16 Q2, Q2, Q6 -vmla.s16 Q5, Q7, r2 -vshlc Q4, r10, #16 -vmul.u16 Q7, Q7, r2 -vneg.s16 Q3, Q5 -vmla.s16 Q1, Q7, r5 -vadd.u16 Q0, Q0, Q4 -vmul.u16 Q1, Q1, r4 -vshr.u16 Q1, Q1, #2 -vsub.u16 Q7, Q7, Q1 -vshlc Q7, r11, #16 -vadd.u16 Q1, Q1, Q7 -vst40.u16 {Q0,Q1,Q2,Q3}, [r1] -vst41.u16 {Q0,Q1,Q2,Q3}, [r1] -vst42.u16 {Q0,Q1,Q2,Q3}, [r1] -vst43.u16 {Q0,Q1,Q2,Q3}, [r1] -vmov.u16 Q0, #0 -vmov.u16 Q0[0], r10 -vmov.u16 Q0[1], r11 -vmov.u16 Q0[2], r12 -vldrw.u32 Q1, [r1, #-448]! -vsub.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r1] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_full_256.s b/tests/toom/auto/poly_u16_toom4_inv_full_256.s deleted file mode 100644 index dfbbd76..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_full_256.s +++ /dev/null @@ -1,765 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_256_mve, %function -.global poly_u16_toom4_inv_256_mve -poly_u16_toom4_inv_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r7, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vldrw.u32 Q1, [r0, #(4 * 2)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -62)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -122)] -vldrw.u32 Q4, [r0, #(4 * 66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [r0, #(4 * 70)] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 6)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r7 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -118)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 74)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -114)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 78)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 82)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 86)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 90)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 94)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 98)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -62)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-248)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 102)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 2)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -58)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-232)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 106)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 10)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -54)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-216)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 110)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 14)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -50)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-200)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 114)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 18)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -46)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-184)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 118)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 22)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -42)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(-168)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 122)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 26)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * -38)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(-152)] -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 126)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q7, [r0, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r14, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r14,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q2, [r0, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r14, #(4 * 30)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r14,#(120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r0, #(4 * -34)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(-136)] -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vldrw.u32 Q3, [r0, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r14,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r14,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r14, #(4 * 34)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r14,#(136)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_full_512.s b/tests/toom/auto/poly_u16_toom4_inv_full_512.s deleted file mode 100644 index 466055f..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_full_512.s +++ /dev/null @@ -1,1511 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_512_mve, %function -.global poly_u16_toom4_inv_512_mve -poly_u16_toom4_inv_512_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -mov r10, #-64 -mov r9, #45 -mov r8, #-8 -mov r7, #43691 -mov r6, #16 -mov r5, #30 -mov r4, #61167 -mov r3, #-65 -mov r2, #36409 -mov r1, #1 -vldrw.u32 Q0, [r12, #(4 * 10)] -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 2)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * -118)] -vldrw.u32 Q4, [r14, #(4 * 6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r11, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r3 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #(4 * 14)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r8 -vldrw.u32 Q5, [r14, #(4 * 10)] -vmla.s16 Q0, Q3, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r6 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(-472)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r2 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r5 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r4 -vstrw.u32 Q2, [r0,#(8)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 14)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 18)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 22)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 26)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 30)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 34)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 38)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 42)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 46)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 50)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 54)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 58)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -70)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 62)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -66)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 66)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 70)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vstrw.u32 Q1, [r14,#(-248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 2)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(8)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 74)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 6)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(24)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 78)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -114)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 10)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(40)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 82)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -110)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 14)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(56)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 86)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -106)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 18)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(72)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 90)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -102)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 22)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(88)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 94)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -98)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 26)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(104)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 98)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -94)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 30)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(120)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 102)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -90)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 34)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(136)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 106)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -86)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 38)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(152)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 110)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -82)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 42)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(168)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 114)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -78)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 46)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(184)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 118)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -74)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 50)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(200)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r14, #(4 * 122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -6)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -70)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 54)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(216)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q3, [r14, #(4 * 126)] -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -2)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r12, #(4 * -66)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r12,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r11, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r10 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r3 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 58)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(232)] -vmla.s16 Q1, Q2, r8 -vldrw.u32 Q5, [r12, #(4 * -122)] -vmla.s16 Q4, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q7, [r14, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r12,#(264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r2 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 2)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r5 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r12, #(4 * -62)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r12,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q2, [r14, #(4 * -66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(-264)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r11, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r11,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * 6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r11, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r10 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r3 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r0, #(4 * 62)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r0,#(248)] -vmla.s16 Q1, Q2, r8 -vmla.s16 Q6, Q2, r9 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r7 -vldrw.u32 Q3, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r6 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r12, #(4 * 70)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r12,#(280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r2 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r5 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r12,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r4 -vldrw.u32 Q1, [r14, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(-248)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r11, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r11,#(-216)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_full_768.s b/tests/toom/auto/poly_u16_toom4_inv_full_768.s deleted file mode 100644 index e3e8e89..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_full_768.s +++ /dev/null @@ -1,2303 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_768_mve, %function -.global poly_u16_toom4_inv_768_mve -poly_u16_toom4_inv_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r7, #45 -mov r6, #-8 -mov r5, #43691 -mov r4, #16 -mov r3, #30 -mov r2, #61167 -mov r1, #-65 -vldrw.u32 Q0, [r11, #(4 * 78)] -vldrw.u32 Q1, [r14, #(4 * 6)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 66)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -114)] -vldrw.u32 Q4, [r12, #(4 * -54)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r1 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r11, #(4 * 82)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r6 -vldrw.u32 Q5, [r12, #(4 * -50)] -vmla.s16 Q0, Q3, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(24)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r4 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-456)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 10)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r3 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-216)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r2 -vstrw.u32 Q2, [r0,#(264)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r11,#(312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -46)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -42)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -38)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -34)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -30)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -26)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -22)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -18)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -14)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r11, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -10)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r11, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * -6)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r11,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r11,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -58)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -54)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -50)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 66)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(264)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -114)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 6)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(24)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 18)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 70)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(280)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -110)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 10)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(40)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 22)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(296)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -106)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 14)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(56)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 26)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(312)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -102)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 18)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(72)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 30)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -98)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 22)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(88)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 34)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -94)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 26)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(104)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-120)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -90)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 30)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(120)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-104)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -86)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 34)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(136)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-88)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -82)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-72)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -78)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-56)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -74)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-40)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -70)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r11, #(4 * 126)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r11,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -66)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -122)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -62)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -118)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -58)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(504)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -114)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -54)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -110)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -106)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -102)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -98)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -94)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-376)] -vmla.s16 Q1, Q2, r6 -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q3, [r12, #(4 * 38)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * -82)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * -22)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(440)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_full_832.s b/tests/toom/auto/poly_u16_toom4_inv_full_832.s deleted file mode 100644 index 5963703..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_full_832.s +++ /dev/null @@ -1,2493 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_832_mve, %function -.global poly_u16_toom4_inv_832_mve -poly_u16_toom4_inv_832_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -add r11, r12, #1008 -add r10, r11, #1008 -add r9, r10, #1008 -mov r8, #-64 -mov r7, #45 -mov r6, #-8 -mov r5, #43691 -mov r4, #16 -mov r3, #30 -mov r2, #61167 -mov r1, #-65 -vldrw.u32 Q0, [r10, #(4 * -94)] -vldrw.u32 Q1, [r14, #(4 * 38)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 82)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r11, #(4 * -50)] -vldrw.u32 Q4, [r12, #(4 * -6)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r10, #(4 * 114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r1 -vsub.u16 Q3, Q3, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r10, #(4 * -90)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r6 -vldrw.u32 Q5, [r12, #(4 * -2)] -vmla.s16 Q0, Q3, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(152)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r4 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r11,#(-200)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r14, #(4 * 42)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r3 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r12,#(-24)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r2 -vstrw.u32 Q2, [r0,#(328)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r10,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -46)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 2)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -42)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r10, #(4 * 122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 6)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -38)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r10, #(4 * 126)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 10)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -34)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -122)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 14)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -118)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 18)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -114)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 22)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(248)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -110)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 26)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(264)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -106)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 30)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(280)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -102)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 34)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(296)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -98)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 38)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(312)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 126)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 42)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(328)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r0,#(504)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -122)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * -2)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 46)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(344)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-488)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -118)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 2)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 50)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(360)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-472)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -114)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 6)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 54)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(376)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-456)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -110)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 58)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(392)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-440)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -106)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 62)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(408)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-424)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -102)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 66)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(424)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-408)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -98)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 70)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(440)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-392)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -94)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 74)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-376)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 78)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-360)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 82)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * 126)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-344)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-40)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 86)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r14,#(504)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -122)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-328)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(-24)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 90)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-488)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -118)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(-8)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 94)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-472)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r12,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r10,#(8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 98)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vstrw.u32 Q1, [r12,#(-456)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r11,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -110)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r12,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vstrw.u32 Q0, [r14,#(-280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r10,#(24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -22)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 14)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 102)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-24)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -94)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -106)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -50)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 38)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(152)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -18)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 18)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 106)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * -2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(-8)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -90)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -102)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -46)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 42)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(168)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -14)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 22)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 110)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 2)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(8)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -86)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -98)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -42)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 46)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(184)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r10, #(4 * 122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r10,#(488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -10)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 26)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 114)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 6)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -82)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -94)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -38)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 50)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(200)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r10, #(4 * 126)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r10,#(504)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -6)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 30)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 10)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -78)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -90)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -34)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 54)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(216)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -122)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-488)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -2)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 34)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r12, #(4 * 122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 14)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -74)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -86)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -30)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 58)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(232)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -118)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-472)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 2)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 38)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r12, #(4 * 126)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 18)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -70)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -82)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -26)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 62)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(248)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -114)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-456)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 6)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * -6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -122)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 22)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -66)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -78)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -22)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 66)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(264)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -110)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-440)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 10)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * -2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -118)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 26)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -62)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -74)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -18)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 70)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(280)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -106)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-424)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 14)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -114)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 30)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -58)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -70)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -14)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 74)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(296)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -102)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-408)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 18)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r0,#(488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -110)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 34)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -54)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -66)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -10)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 78)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(312)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -98)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-392)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 22)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r0, #(4 * 126)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r0,#(504)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -106)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 38)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -50)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -62)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * -6)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 82)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(328)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -94)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-376)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 26)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -122)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-488)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -102)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 42)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -46)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -58)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * -2)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 86)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(344)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -90)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-360)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 30)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 66)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -118)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-472)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -98)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 46)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -42)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 2)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 90)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(360)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -86)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-344)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 34)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 70)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -114)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-456)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -94)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 50)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -38)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 6)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 94)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(376)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -82)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-328)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 38)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 74)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -110)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-440)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -90)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 54)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -34)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 10)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 98)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(392)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -78)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-312)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 42)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 78)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -106)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-424)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -86)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 58)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -30)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 14)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 102)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(408)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -74)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-296)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 46)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 82)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -102)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-408)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -82)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 62)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -26)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 18)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 106)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(424)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -70)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-280)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r11, #(4 * 126)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 50)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 86)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -98)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-392)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -78)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 66)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(264)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -22)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 22)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 110)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(440)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -66)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-264)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 54)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 90)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -94)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-376)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -74)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 70)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -18)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -30)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 26)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 114)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -62)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-248)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 58)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 94)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -90)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-360)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -70)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 74)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -14)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -26)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 30)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -58)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 62)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 98)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -86)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-344)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -66)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 78)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -10)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -22)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 34)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -54)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 66)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 102)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -82)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-328)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -62)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 82)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -6)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-24)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -18)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 38)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r14, #(4 * 126)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r14,#(504)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -50)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 70)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r10, #(4 * 106)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -78)] -vadd.u16 Q7, Q7, Q3 -vstrw.u32 Q7, [r14,#(-312)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q3, [r11, #(4 * -58)] -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 86)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * -2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(-8)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -14)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q2, [r11, #(4 * 42)] -vadd.u16 Q2, Q2, Q5 -vstrw.u32 Q2, [r11,#(168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r12, #(4 * -122)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-488)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -46)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r14, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * 74)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r9, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r8 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r1 -vsub.u16 Q2, Q2, Q6 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r10, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q7, [r14, #(4 * -74)] -vadd.u16 Q7, Q7, Q5 -vstrw.u32 Q7, [r14,#(-296)] -vmla.s16 Q1, Q2, r6 -vldrw.u32 Q5, [r11, #(4 * -54)] -vmla.s16 Q4, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q7, [r12, #(4 * 90)] -vadd.u16 Q7, Q7, Q1 -vstrw.u32 Q7, [r12,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q7, [r10, #(4 * 2)] -vadd.u16 Q7, Q7, Q2 -vstrw.u32 Q7, [r10,#(8)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r12, #(4 * -10)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r3 -vneg.s16 Q3, Q3 -vldrw.u32 Q2, [r11, #(4 * 46)] -vadd.u16 Q2, Q2, Q3 -vstrw.u32 Q2, [r11,#(184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q2, [r12, #(4 * -118)] -vadd.u16 Q2, Q2, Q0 -vstrw.u32 Q2, [r12,#(-472)] -vsub.u16 Q4, Q4, Q0 -vldrw.u32 Q0, [r9, #(4 * -42)] -vadd.u16 Q0, Q0, Q4 -vstrw.u32 Q0, [r9,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r14, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r10, #(4 * -98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * 78)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r9, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r8 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r1 -vsub.u16 Q2, Q2, Q4 -vadd.u16 Q1, Q1, Q1 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vldrw.u32 Q4, [r14, #(4 * -70)] -vadd.u16 Q4, Q4, Q3 -vstrw.u32 Q4, [r14,#(-280)] -vmla.s16 Q1, Q2, r6 -vmla.s16 Q6, Q2, r7 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r5 -vldrw.u32 Q3, [r12, #(4 * 94)] -vadd.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r12,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r4 -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r10, #(4 * 6)] -vadd.u16 Q3, Q3, Q2 -vstrw.u32 Q3, [r10,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r3 -vneg.s16 Q5, Q5 -vldrw.u32 Q1, [r11, #(4 * 50)] -vadd.u16 Q1, Q1, Q5 -vstrw.u32 Q1, [r11,#(200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r2 -vldrw.u32 Q1, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vstrw.u32 Q1, [r12,#(-456)] -vsub.u16 Q6, Q6, Q0 -vldrw.u32 Q0, [r9, #(4 * -38)] -vadd.u16 Q0, Q0, Q6 -vstrw.u32 Q0, [r9,#(-152)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_half_256.s b/tests/toom/auto/poly_u16_toom4_inv_half_256.s deleted file mode 100644 index 7b0cf8c..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_half_256.s +++ /dev/null @@ -1,340 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_256_mve, %function -.global poly_u16_toom4_inv_half_256_mve -poly_u16_toom4_inv_half_256_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -mov r14, #-64 -mov r12, #45 -mov r11, #-8 -mov r10, #43691 -mov r9, #16 -mov r8, #30 -mov r7, #61167 -mov r6, #-65 -mov r5, #36409 -mov r4, #1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vldrw.u32 Q1, [r0, #(4 * -62)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -94)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r0, #(4 * 2)] -vldrw.u32 Q4, [r0, #(4 * -30)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r0, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r6 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r0, #(4 * 38)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r11 -vldrw.u32 Q5, [r0, #(4 * -26)] -vmla.s16 Q0, Q3, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-248)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r9 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r0,#(8)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r5 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * -58)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r8 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(-120)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r7 -vstrw.u32 Q2, [r0,#(-376)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r0,#(136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -90)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 6)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * 42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #(4 * -22)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-232)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(24)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -54)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-360)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -86)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 10)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #(4 * 46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #(4 * -18)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-216)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(40)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -50)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-344)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -82)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 14)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * 50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #(4 * -14)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-200)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(56)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -46)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-328)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 18)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #(4 * 54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #(4 * -10)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-184)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(72)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -42)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-312)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 22)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r0, #(4 * 58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q3, [r0, #(4 * -6)] -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-168)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(88)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -38)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-296)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 26)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r0, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r14 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r6 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r0, #(4 * 62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vldrw.u32 Q5, [r0, #(4 * -2)] -vmla.s16 Q4, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-152)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(104)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r5 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * -34)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r8 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-280)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r0,#(232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * 30)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r0, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r14 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r6 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r4 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r11 -vmla.s16 Q6, Q2, r12 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r10 -vstrw.u32 Q1, [r0,#(-136)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r9 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r0,#(120)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r5 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r8 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r7 -vstrw.u32 Q0, [r0,#(-264)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r0,#(248)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_half_512.s b/tests/toom/auto/poly_u16_toom4_inv_half_512.s deleted file mode 100644 index df3c8ab..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_half_512.s +++ /dev/null @@ -1,661 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_512_mve, %function -.global poly_u16_toom4_inv_half_512_mve -poly_u16_toom4_inv_half_512_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -mov r12, #-64 -mov r11, #45 -mov r10, #-8 -mov r9, #43691 -mov r8, #16 -mov r7, #30 -mov r6, #61167 -mov r5, #-65 -mov r4, #36409 -mov r3, #1 -vldrw.u32 Q0, [r14, #(4 * -58)] -vldrw.u32 Q1, [r0, #(4 * 2)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -62)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * -122)] -vldrw.u32 Q4, [r0, #(4 * 66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r14, #(4 * 6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r5 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * -54)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r10 -vldrw.u32 Q5, [r0, #(4 * 70)] -vmla.s16 Q0, Q3, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(8)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r8 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(-488)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r4 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 6)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r7 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r0,#(264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r6 -vstrw.u32 Q2, [r0,#(-248)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(-232)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -118)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 74)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(24)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-472)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 10)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-232)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-216)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -114)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 78)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(40)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-456)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 14)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-216)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-200)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -110)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 82)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(56)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-440)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 18)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-200)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-184)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -106)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 86)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(72)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-424)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 22)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-184)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-168)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -102)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 90)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(88)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-408)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 26)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-168)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-152)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -98)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 94)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(104)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-392)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 30)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(360)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-152)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-136)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -94)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 98)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(120)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-376)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 34)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(376)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-136)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-120)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -90)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 102)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(136)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-360)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 38)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(392)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-120)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-104)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -86)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 106)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(152)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-344)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 42)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(408)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-104)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-88)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -82)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 110)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(168)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-328)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 46)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(424)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-88)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-72)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -78)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 114)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(184)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-312)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 50)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(440)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-56)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -74)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * -6)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 118)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(200)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-296)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 54)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(456)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-40)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -70)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 58)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * -2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q3, [r0, #(4 * 122)] -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(216)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-280)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 58)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(472)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(-24)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -66)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r14, #(4 * 62)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r12 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r5 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 2)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vldrw.u32 Q5, [r0, #(4 * 126)] -vmla.s16 Q4, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(232)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-264)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r4 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 62)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r7 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r0,#(488)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(-8)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * -62)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r14, #(4 * 66)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r12 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r5 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r3 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r10 -vmla.s16 Q6, Q2, r11 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r9 -vstrw.u32 Q1, [r0,#(248)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r8 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(-248)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r4 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r7 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r0,#(504)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r6 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(8)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_half_768.s b/tests/toom/auto/poly_u16_toom4_inv_half_768.s deleted file mode 100644 index 164402e..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_half_768.s +++ /dev/null @@ -1,982 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_768_mve, %function -.global poly_u16_toom4_inv_half_768_mve -poly_u16_toom4_inv_half_768_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-64 -mov r10, #45 -mov r9, #-8 -mov r8, #43691 -mov r7, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r14, #(4 * 102)] -vldrw.u32 Q1, [r0, #(4 * 66)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -30)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * 6)] -vldrw.u32 Q4, [r14, #(4 * -90)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r12, #(4 * -54)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r14, #(4 * 106)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r9 -vldrw.u32 Q5, [r14, #(4 * -86)] -vmla.s16 Q0, Q3, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(264)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r7 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(24)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 70)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(-360)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [r0,#(-120)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r14,#(408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 10)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -50)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * 110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -82)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(40)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-344)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-104)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 14)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -46)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -78)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(56)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-328)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-88)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 18)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -42)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * 118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -74)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(72)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-312)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 22)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -38)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r14, #(4 * 122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -70)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(88)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-296)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 26)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -34)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r14, #(4 * 126)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -66)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(104)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-280)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r14,#(488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 30)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -30)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -122)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -62)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(120)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-264)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r14,#(504)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 34)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -26)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -118)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -58)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(136)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-488)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 38)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -22)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -114)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -54)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(152)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-472)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -18)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -110)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -50)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-456)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -14)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -106)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -46)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -10)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -42)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * -6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -38)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -34)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -30)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 126)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -26)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(504)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -22)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -18)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -14)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -10)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -6)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -2)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 2)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-232)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/auto/poly_u16_toom4_inv_half_832.s b/tests/toom/auto/poly_u16_toom4_inv_half_832.s deleted file mode 100644 index 3181d69..0000000 --- a/tests/toom/auto/poly_u16_toom4_inv_half_832.s +++ /dev/null @@ -1,1062 +0,0 @@ -.syntax unified -.type poly_u16_toom4_inv_half_832_mve, %function -.global poly_u16_toom4_inv_half_832_mve -poly_u16_toom4_inv_half_832_mve: -push {r4-r11,lr} -vpush {d8-d15} -add r0, r0, #504 -add r14, r0, #1008 -add r12, r14, #1008 -mov r11, #-64 -mov r10, #45 -mov r9, #-8 -mov r8, #43691 -mov r7, #16 -mov r6, #30 -mov r5, #61167 -mov r4, #-65 -mov r3, #36409 -mov r2, #1 -vldrw.u32 Q0, [r12, #(4 * -110)] -vldrw.u32 Q1, [r0, #(4 * 82)] -vadd.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r0, #(4 * -22)] -vsub.u16 Q2, Q2, Q1 -vldrw.u32 Q3, [r14, #(4 * 38)] -vldrw.u32 Q4, [r14, #(4 * -66)] -vsub.u16 Q4, Q4, Q3 -vldrw.u32 Q5, [r0, #(4 * -126)] -vshr.u16 Q4, Q4, #1 -vldrw.u32 Q6, [r12, #(4 * -6)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q3, r4 -vsub.u16 Q3, Q3, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q3, Q3, Q5 -vldrw.u32 Q6, [r12, #(4 * -106)] -vadd.u16 Q1, Q1, Q2 -vmla.s16 Q1, Q3, r9 -vldrw.u32 Q5, [r14, #(4 * -62)] -vmla.s16 Q0, Q3, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(328)] -vadd.u16 Q2, Q2, Q0 -vmla.s16 Q0, Q4, r7 -vsub.u16 Q3, Q3, Q1 -vstrw.u32 Q3, [r14,#(152)] -vshr.u16 Q0, Q0, #1 -vmul.u16 Q0, Q0, r3 -vneg.s16 Q2, Q2 -vldrw.u32 Q1, [r0, #(4 * 86)] -vadd.u16 Q4, Q4, Q0 -vmla.s16 Q2, Q0, r6 -vneg.s16 Q4, Q4 -vstrw.u32 Q4, [r14,#(-264)] -vshr.u16 Q2, Q2, #2 -vmul.u16 Q2, Q2, r5 -vstrw.u32 Q2, [r0,#(-88)] -vsub.u16 Q0, Q0, Q2 -vstrw.u32 Q0, [r12,#(-440)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 42)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -122)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * -2)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -102)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -58)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(168)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-248)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-72)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-424)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 46)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -118)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 2)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -98)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -54)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(184)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-232)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-56)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-408)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 50)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -114)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 6)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -94)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -50)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(200)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-216)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-40)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-392)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * -6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 54)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -110)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 10)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -90)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -46)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(216)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-200)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-24)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-376)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * -2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 58)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -106)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 14)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -86)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -42)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(232)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-184)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(-8)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-360)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 2)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 62)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -102)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 18)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -82)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -38)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(248)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-168)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(8)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-344)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 6)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 66)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -98)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 22)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -78)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -34)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(264)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-152)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(24)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-328)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 10)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 70)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -94)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 26)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -74)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -30)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(280)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(40)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-312)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 14)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 74)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -90)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 30)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -70)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -26)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(296)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(56)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-296)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 18)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 78)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -86)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 34)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -66)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -22)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(312)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r0, #(4 * 126)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(72)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-280)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 22)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 82)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -82)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 38)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -62)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -18)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r0,#(504)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(328)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -122)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(88)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-264)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 26)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 86)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -78)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 42)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -58)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -14)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-488)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(344)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -118)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(104)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-248)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 30)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 90)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -74)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 46)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -54)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -10)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-472)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(360)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -114)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(120)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-232)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 34)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 94)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -70)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 50)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -50)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * -6)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-456)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(376)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -110)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(136)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-216)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 38)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 98)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -66)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 54)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -46)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * -2)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-440)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(392)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -106)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(-24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(152)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-200)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 42)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 102)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -62)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 58)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -42)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 2)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-424)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(408)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -102)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(-8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(168)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-184)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 46)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 106)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -58)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 62)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -38)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 6)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-408)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(424)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -98)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(8)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(184)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-168)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 50)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 110)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -54)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 66)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -34)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 10)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-392)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(440)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -94)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(24)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(200)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-152)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 54)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -50)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 70)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -30)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 14)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-376)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -90)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(40)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(216)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-136)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 58)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -46)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 74)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -26)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 18)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-360)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -86)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(56)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(232)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-120)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 62)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -42)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 78)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -22)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 22)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-344)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -82)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(72)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(248)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-104)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 66)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r14, #(4 * 126)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -38)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 82)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -18)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 26)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-328)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r14,#(504)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -78)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(88)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(264)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-88)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 70)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -122)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -34)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 86)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vldrw.u32 Q4, [r12, #(4 * -14)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q3, [r14, #(4 * 30)] -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-312)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-488)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -74)] -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(104)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(280)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-72)] -vadd.u16 Q4, Q4, Q1 -vldrw.u32 Q0, [r0, #(4 * 74)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -118)] -vsub.u16 Q3, Q3, Q2 -vldrw.u32 Q5, [r0, #(4 * -30)] -vshr.u16 Q3, Q3, #1 -vldrw.u32 Q6, [r12, #(4 * 90)] -vsub.u16 Q1, Q1, Q6 -vmla.s16 Q1, Q5, r11 -vadd.u16 Q2, Q2, Q3 -vmla.s16 Q4, Q2, r4 -vsub.u16 Q2, Q2, Q6 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q5 -vldrw.u32 Q6, [r12, #(4 * -10)] -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vldrw.u32 Q5, [r14, #(4 * 34)] -vmla.s16 Q4, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-296)] -vadd.u16 Q0, Q0, Q4 -vmla.s16 Q4, Q3, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-472)] -vshr.u16 Q4, Q4, #1 -vmul.u16 Q4, Q4, r3 -vneg.s16 Q0, Q0 -vldrw.u32 Q1, [r14, #(4 * -70)] -vadd.u16 Q3, Q3, Q4 -vmla.s16 Q0, Q4, r6 -vneg.s16 Q3, Q3 -vstrw.u32 Q3, [r14,#(120)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(296)] -vsub.u16 Q4, Q4, Q0 -vstrw.u32 Q4, [r12,#(-56)] -vadd.u16 Q6, Q6, Q1 -vldrw.u32 Q0, [r0, #(4 * 78)] -vsub.u16 Q0, Q0, Q1 -vldrw.u32 Q2, [r12, #(4 * -114)] -vsub.u16 Q5, Q5, Q2 -vldrw.u32 Q3, [r0, #(4 * -26)] -vshr.u16 Q5, Q5, #1 -vldrw.u32 Q4, [r12, #(4 * 94)] -vsub.u16 Q1, Q1, Q4 -vmla.s16 Q1, Q3, r11 -vadd.u16 Q2, Q2, Q5 -vmla.s16 Q6, Q2, r4 -vsub.u16 Q2, Q2, Q4 -vmla.s16 Q1, Q1, r2 -vsub.u16 Q2, Q2, Q3 -vadd.u16 Q1, Q1, Q0 -vmla.s16 Q1, Q2, r9 -vmla.s16 Q6, Q2, r10 -vshr.u16 Q1, Q1, #3 -vmul.u16 Q1, Q1, r8 -vstrw.u32 Q1, [r14,#(-280)] -vadd.u16 Q0, Q0, Q6 -vmla.s16 Q6, Q5, r7 -vsub.u16 Q2, Q2, Q1 -vstrw.u32 Q2, [r12,#(-456)] -vshr.u16 Q6, Q6, #1 -vmul.u16 Q6, Q6, r3 -vneg.s16 Q0, Q0 -vadd.u16 Q5, Q5, Q6 -vmla.s16 Q0, Q6, r6 -vneg.s16 Q5, Q5 -vstrw.u32 Q5, [r14,#(136)] -vshr.u16 Q0, Q0, #2 -vmul.u16 Q0, Q0, r5 -vstrw.u32 Q0, [r0,#(312)] -vsub.u16 Q6, Q6, Q0 -vstrw.u32 Q6, [r12,#(-40)] -vpop {d8-d15} -pop {r4-r11,lr} -bx lr \ No newline at end of file diff --git a/tests/toom/toom.mk b/tests/toom/toom.mk index 776d6b1..34fc2f6 100644 --- a/tests/toom/toom.mk +++ b/tests/toom/toom.mk @@ -11,45 +11,22 @@ TOOM_PLATFORMS += m85-an555 TOOM_SOURCES += main.c # Assembly sources required for this test -# TODO: not all these are required; delete the other ones? -# TODO: should move those to the asm dir -TOOM_ASMS += auto/poly_u16_mul_64_toom4_mve.s -TOOM_ASMS += auto/poly_u16_mul_192_toom3_mve.s -TOOM_ASMS += auto/poly_u16_mul_256_toom4_mve.s -TOOM_ASMS += auto/poly_u16_mul_512_toom4_mve.s -TOOM_ASMS += auto/poly_u16_mul_768_toom3_mve.s -TOOM_ASMS += auto/poly_u16_mul_768_toom4_mve.s -TOOM_ASMS += auto/poly_u16_mul_832_toom4_mve.s -TOOM_ASMS += auto/poly_u16_toom3_fwd_192.s -TOOM_ASMS += auto/poly_u16_toom3_fwd_768.s -TOOM_ASMS += auto/poly_u16_toom3_inv_full_192.s -TOOM_ASMS += auto/poly_u16_toom3_inv_full_768.s -TOOM_ASMS += auto/poly_u16_toom3_inv_half_192.s -TOOM_ASMS += auto/poly_u16_toom3_inv_half_768.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_bottom.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_packed_oop.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_top_oop.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256_dual_top.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_256.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_512.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_768.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_832.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s -TOOM_ASMS += auto/poly_u16_toom4_fwd_oop_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_dual_bottom_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_dual_bottom_oop_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_dual_top_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_dual_top_oop_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_full_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_full_512.s -TOOM_ASMS += auto/poly_u16_toom4_inv_full_768.s -TOOM_ASMS += auto/poly_u16_toom4_inv_full_832.s -TOOM_ASMS += auto/poly_u16_toom4_inv_half_256.s -TOOM_ASMS += auto/poly_u16_toom4_inv_half_512.s -TOOM_ASMS += auto/poly_u16_toom4_inv_half_768.s -TOOM_ASMS += auto/poly_u16_toom4_inv_half_832.s \ No newline at end of file +TOOM_ASM_DIR=../../asm/auto/poly/toom4 +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_mul_256_toom4_mve.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_bottom.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x1_oop.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_karatsuba_x2_oop.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_packed_limbs_oop.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_top_oop.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256_dual_top.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_karatsuba_x1_oop_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_karatsuba_x2_oop_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_fwd_oop_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_dual_bottom_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_dual_bottom_oop_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_dual_packed_limbs_oop_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_dual_top_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_dual_top_oop_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_full_256.s +TOOM_ASMS += $(TOOM_ASM_DIR)/poly_u16_toom4_inv_half_256.s From e208dfeede889b31c0b32f3762e3862c21296c69 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 15:31:12 +0800 Subject: [PATCH 25/32] add missing unpack test --- Makefile | 1 + tests/unpack/auto/unpack_22_2048_96.s | 261 ------------- tests/unpack/auto/unpack_22_4096_192.s | 495 ------------------------ tests/unpack/auto/unpack_22_4224_192.s | 505 ------------------------- tests/unpack/main.c | 2 +- tests/unpack/unpack.mk | 17 + 6 files changed, 19 insertions(+), 1262 deletions(-) delete mode 100644 tests/unpack/auto/unpack_22_2048_96.s delete mode 100644 tests/unpack/auto/unpack_22_4096_192.s delete mode 100644 tests/unpack/auto/unpack_22_4224_192.s create mode 100644 tests/unpack/unpack.mk diff --git a/Makefile b/Makefile index 41fdbdf..c8dea1b 100644 --- a/Makefile +++ b/Makefile @@ -26,6 +26,7 @@ include tests/poly/poly.mk include tests/sqmag/sqmag.mk include tests/toom/toom.mk include tests/transpose/transpose.mk +include tests/unpack/unpack.mk testname = $(shell echo $(1) | tr '[a-z]' '[A-Z]' | tr '-' '_') testdir = $(addprefix $(2),tests/$(1)/) diff --git a/tests/unpack/auto/unpack_22_2048_96.s b/tests/unpack/auto/unpack_22_2048_96.s deleted file mode 100644 index 05b0fea..0000000 --- a/tests/unpack/auto/unpack_22_2048_96.s +++ /dev/null @@ -1,261 +0,0 @@ - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.syntax unified -.type unpack_22_2048_96, %function -.global unpack_22_2048_96 -unpack_22_2048_96: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -add r1, r1, #256 -add r0, r0, #384 -movs.n r2, #0 -movs.n r3, #0 -// Overhead 2 -strd r2, r3, [r0,#-8]! -vldrw.u32 Q0, [r1, #-16]! -vshlc Q0, r3, #2 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #16 -str r2, [r0, #-4]! -vshlc Q1, r4, #6 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #6 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #12 -str r3, [r0, #-4]! -vshlc Q0, r2, #10 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #10 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #8 -str r4, [r0, #-4]! -vshlc Q1, r3, #14 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #14 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #4 -str r2, [r0, #-4]! -vshlc Q0, r4, #18 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #18 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #18 -str r2, [r0, #-4]! -vldrw.u32 Q0, [r1, #-16]! -vshlc Q0, r4, #4 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #4 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #14 -str r3, [r0, #-4]! -vshlc Q1, r2, #8 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #8 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #10 -str r4, [r0, #-4]! -vshlc Q0, r3, #12 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #12 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #6 -str r2, [r0, #-4]! -vshlc Q1, r4, #16 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #16 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #2 -str r3, [r0, #-4]! -vshlc Q0, r2, #20 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #20 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #20 -str r3, [r0, #-4]! -vshlc Q1, r2, #2 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #2 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #16 -str r4, [r0, #-4]! -vshlc Q0, r3, #6 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #6 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #12 -str r2, [r0, #-4]! -vshlc Q1, r4, #10 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #10 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #8 -str r3, [r0, #-4]! -vshlc Q0, r2, #14 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #14 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #4 -str r4, [r0, #-4]! -vshlc Q1, r3, #18 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #18 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -str r2, [r0, #-4]! -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 253 -// Instruction count: 248 \ No newline at end of file diff --git a/tests/unpack/auto/unpack_22_4096_192.s b/tests/unpack/auto/unpack_22_4096_192.s deleted file mode 100644 index 3fcef96..0000000 --- a/tests/unpack/auto/unpack_22_4096_192.s +++ /dev/null @@ -1,495 +0,0 @@ - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.syntax unified -.type unpack_22_4096_192, %function -.global unpack_22_4096_192 -unpack_22_4096_192: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -add r1, r1, #512 -add r0, r0, #768 -movs.n r2, #0 -movs.n r3, #0 -// Overhead 5 -strd r2, r3, [r0,#-8]! -strd r2, r3, [r0,#-8]! -str r2, [r0,#-4]! -vldrw.u32 Q0, [r1, #-16]! -vshlc Q0, r3, #4 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #14 -str r2, [r0, #-4]! -vshlc Q1, r4, #8 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #8 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #10 -str r3, [r0, #-4]! -vshlc Q0, r2, #12 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #12 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #6 -str r4, [r0, #-4]! -vshlc Q1, r3, #16 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #16 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #2 -str r2, [r0, #-4]! -vshlc Q0, r4, #20 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #20 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #20 -str r2, [r0, #-4]! -vshlc Q1, r4, #2 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #2 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #16 -str r3, [r0, #-4]! -vshlc Q0, r2, #6 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #6 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #12 -str r4, [r0, #-4]! -vshlc Q1, r3, #10 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #10 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #8 -str r2, [r0, #-4]! -vshlc Q0, r4, #14 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #14 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #4 -str r3, [r0, #-4]! -vshlc Q1, r2, #18 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #18 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #18 -str r3, [r0, #-4]! -vldrw.u32 Q1, [r1, #-16]! -vshlc Q1, r2, #4 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #4 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #14 -str r4, [r0, #-4]! -vshlc Q0, r3, #8 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #8 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #10 -str r2, [r0, #-4]! -vshlc Q1, r4, #12 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #12 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #6 -str r3, [r0, #-4]! -vshlc Q0, r2, #16 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #16 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #2 -str r4, [r0, #-4]! -vshlc Q1, r3, #20 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #20 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #20 -str r4, [r0, #-4]! -vshlc Q0, r3, #2 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #2 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #16 -str r2, [r0, #-4]! -vshlc Q1, r4, #6 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #6 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #12 -str r3, [r0, #-4]! -vshlc Q0, r2, #10 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #10 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #8 -str r4, [r0, #-4]! -vshlc Q1, r3, #14 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #14 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #4 -str r2, [r0, #-4]! -vshlc Q0, r4, #18 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #18 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #18 -str r2, [r0, #-4]! -vldrw.u32 Q0, [r1, #-16]! -vshlc Q0, r4, #4 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #4 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #14 -str r3, [r0, #-4]! -vshlc Q1, r2, #8 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #8 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #10 -str r4, [r0, #-4]! -vshlc Q0, r3, #12 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #12 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #6 -str r2, [r0, #-4]! -vshlc Q1, r4, #16 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #16 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #2 -str r3, [r0, #-4]! -vshlc Q0, r2, #20 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #20 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #20 -str r3, [r0, #-4]! -vshlc Q1, r2, #2 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #2 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #16 -str r4, [r0, #-4]! -vshlc Q0, r3, #6 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #6 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #12 -str r2, [r0, #-4]! -vshlc Q1, r4, #10 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #10 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #8 -str r3, [r0, #-4]! -vshlc Q0, r2, #14 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #14 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #4 -str r4, [r0, #-4]! -vshlc Q1, r3, #18 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #18 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -str r2, [r0, #-4]! -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 487 -// Instruction count: 482 \ No newline at end of file diff --git a/tests/unpack/auto/unpack_22_4224_192.s b/tests/unpack/auto/unpack_22_4224_192.s deleted file mode 100644 index a3592dc..0000000 --- a/tests/unpack/auto/unpack_22_4224_192.s +++ /dev/null @@ -1,505 +0,0 @@ - -/// -/// This assembly code has been auto-generated. -/// Don't modify it directly. -/// - -.syntax unified -.type unpack_22_4224_192, %function -.global unpack_22_4224_192 -unpack_22_4224_192: -// Save GPRs -push {r4-r11,lr} -// Save MVE vector registers -vpush {d8-d15} -add r1, r1, #528 -add r0, r0, #768 -movs.n r2, #0 -movs.n r3, #0 -// Overhead 0 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q0, r3, #22 -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #18 -str r4, [r0, #-4]! -vldrw.u32 Q1, [r1, #-16]! -vshlc Q1, r3, #4 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #4 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #14 -str r2, [r0, #-4]! -vshlc Q0, r4, #8 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #8 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #10 -str r3, [r0, #-4]! -vshlc Q1, r2, #12 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #12 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #6 -str r4, [r0, #-4]! -vshlc Q0, r3, #16 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #16 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #2 -str r2, [r0, #-4]! -vshlc Q1, r4, #20 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #20 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #20 -str r2, [r0, #-4]! -vshlc Q0, r4, #2 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #2 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #16 -str r3, [r0, #-4]! -vshlc Q1, r2, #6 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #6 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #12 -str r4, [r0, #-4]! -vshlc Q0, r3, #10 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #10 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #8 -str r2, [r0, #-4]! -vshlc Q1, r4, #14 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #14 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #4 -str r3, [r0, #-4]! -vshlc Q0, r2, #18 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #18 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #18 -str r3, [r0, #-4]! -vldrw.u32 Q0, [r1, #-16]! -vshlc Q0, r2, #4 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #4 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #14 -str r4, [r0, #-4]! -vshlc Q1, r3, #8 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #8 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #10 -str r2, [r0, #-4]! -vshlc Q0, r4, #12 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #12 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #6 -str r3, [r0, #-4]! -vshlc Q1, r2, #16 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #16 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #2 -str r4, [r0, #-4]! -vshlc Q0, r3, #20 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #20 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #20 -str r4, [r0, #-4]! -vshlc Q1, r3, #2 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #2 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #16 -str r2, [r0, #-4]! -vshlc Q0, r4, #6 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #6 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #12 -str r3, [r0, #-4]! -vshlc Q1, r2, #10 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #10 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #8 -str r4, [r0, #-4]! -vshlc Q0, r3, #14 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #14 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #4 -str r2, [r0, #-4]! -vshlc Q1, r4, #18 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #18 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #18 -str r2, [r0, #-4]! -vldrw.u32 Q1, [r1, #-16]! -vshlc Q1, r4, #4 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r2, #22 -eor r4, r4, r3, LSL #4 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #14 -str r3, [r0, #-4]! -vshlc Q0, r2, #8 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #8 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #10 -str r4, [r0, #-4]! -vshlc Q1, r3, #12 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #12 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #6 -str r2, [r0, #-4]! -vshlc Q0, r4, #16 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #16 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #2 -str r3, [r0, #-4]! -vshlc Q1, r2, #20 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #20 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #20 -str r3, [r0, #-4]! -vshlc Q0, r2, #2 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r3, #22 -eor r2, r2, r4, LSL #2 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #16 -str r4, [r0, #-4]! -vshlc Q1, r3, #6 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r4, #22 -eor r3, r3, r2, LSL #6 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #12 -str r2, [r0, #-4]! -vshlc Q0, r4, #10 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r2, #22 -eor r4, r4, r3, LSL #10 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #8 -str r3, [r0, #-4]! -vshlc Q1, r2, #14 -vldrw.u32 Q0, [r1, #-16]! -vshlc Q1, r3, #22 -eor r2, r2, r4, LSL #14 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #22 -str r4, [r0, #-4]! -vshlc Q1, r3, #22 -str r2, [r0, #-4]! -vshlc Q1, r4, #22 -str r3, [r0, #-4]! -vshlc Q1, r2, #4 -str r4, [r0, #-4]! -vshlc Q0, r3, #18 -vldrw.u32 Q1, [r1, #-16]! -vshlc Q0, r4, #22 -eor r3, r3, r2, LSL #18 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -vshlc Q0, r3, #22 -str r2, [r0, #-4]! -vshlc Q0, r4, #22 -str r3, [r0, #-4]! -vshlc Q0, r2, #22 -str r4, [r0, #-4]! -str r2, [r0, #-4]! -// Restore MVE vector registers -vpop {d8-d15} -// Restore GPRs -pop {r4-r11,lr} -bx lr - -// Line count: 497 -// Instruction count: 492 \ No newline at end of file diff --git a/tests/unpack/main.c b/tests/unpack/main.c index 4af6a62..0a9c196 100644 --- a/tests/unpack/main.c +++ b/tests/unpack/main.c @@ -98,6 +98,6 @@ int main(void) ret = test_unpack(); if( ret != 0 ) return( 1 ); - + debug_printf( "ALL GOOD!\n" ); return( 0 ); } diff --git a/tests/unpack/unpack.mk b/tests/unpack/unpack.mk new file mode 100644 index 0000000..9a76b4f --- /dev/null +++ b/tests/unpack/unpack.mk @@ -0,0 +1,17 @@ +# Test name - needs to match the directory name +TESTS += unpack + +# All further variables must be prefixed with the capitalized test name + +# Platforms this test should run on (matching the directory name in envs/) +UNPACK_PLATFORMS += m55-an547 +UNPACK_PLATFORMS += m85-an555 + +# C sources required for this test +UNPACK_SOURCES += main.c + +# Assembly sources required for this test +UNPACK_ASM_DIR=../../asm/auto/unpack/ +UNPACK_ASMS += $(UNPACK_ASM_DIR)unpack_22_2048_96.s +UNPACK_ASMS += $(UNPACK_ASM_DIR)unpack_22_4096_192.s +UNPACK_ASMS += $(UNPACK_ASM_DIR)unpack_22_4224_192.s From 90e486dfaa40f8515096f94c01b20de2e1345b81 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 15:36:26 +0800 Subject: [PATCH 26/32] separate build artifact by target elf to work around collisions --- envs/m55-an547/Makefile | 2 +- envs/m85-an555/Makefile | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/envs/m55-an547/Makefile b/envs/m55-an547/Makefile index 38fb522..7ddbf87 100644 --- a/envs/m55-an547/Makefile +++ b/envs/m55-an547/Makefile @@ -4,7 +4,7 @@ CC = arm-none-eabi-gcc LD := $(CC) SRC_DIR=./src -BUILD_DIR=./build +BUILD_DIR=./build/$(TARGET) COMMON_INC=../common/inc/ ENV_INC=./inc/ diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 838eed4..4a1607d 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -4,7 +4,7 @@ CC = arm-none-eabi-gcc LD := $(CC) SRC_DIR=./src -BUILD_DIR=./build +BUILD_DIR=./build/$(TARGET) COMMON_INC=../common/inc/ ENV_INC=./inc/ From d3d51b16107654343af9f80542686a9eb66e55fd Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Thu, 18 Jul 2024 15:43:29 +0800 Subject: [PATCH 27/32] remove more duplicate assembly --- Makefile | 2 +- asm/manual/fx_fft/base_symbolic.s | 93 +- tests/flt-fft/base_concrete.s.old | 73 - tests/flt-fft/base_ref.s | 73 - .../floatingpoint_radix4_fft_opt_M55.s | 194 - .../floatingpoint_radix4_fft_opt_M85.s | 190 - tests/flt-fft/flt-fft.mk | 7 +- tests/intmulntt/intmulntt.mk | 2 +- tests/intmulntt/montgomery.s | 3647 ----------------- tests/montgomery/montgomery.mk | 3 +- tests/montgomery/montgomery.s | 3640 ---------------- tests/ntt-1024/montgomery.s | 3647 ----------------- tests/ntt-1024/ntt-1024.mk | 2 +- tests/ntt-192/montgomery.s | 3647 ----------------- tests/ntt-192/ntt-192.mk | 2 +- tests/ntt-384/montgomery.s | 3647 ----------------- tests/ntt-384/ntt-384.mk | 2 +- tests/ntt-768/montgomery.s | 3647 ----------------- tests/ntt-768/ntt-768.mk | 2 +- tests/poly/montgomery.s | 3640 ---------------- tests/poly/poly.mk | 2 +- 21 files changed, 13 insertions(+), 26149 deletions(-) mode change 100644 => 120000 asm/manual/fx_fft/base_symbolic.s delete mode 100644 tests/flt-fft/base_concrete.s.old delete mode 100644 tests/flt-fft/base_ref.s delete mode 100644 tests/flt-fft/floatingpoint_radix4_fft_opt_M55.s delete mode 100644 tests/flt-fft/floatingpoint_radix4_fft_opt_M85.s delete mode 100644 tests/intmulntt/montgomery.s delete mode 100644 tests/montgomery/montgomery.s delete mode 100644 tests/ntt-1024/montgomery.s delete mode 100644 tests/ntt-192/montgomery.s delete mode 100644 tests/ntt-384/montgomery.s delete mode 100644 tests/ntt-768/montgomery.s delete mode 100644 tests/poly/montgomery.s diff --git a/Makefile b/Makefile index c8dea1b..710618e 100644 --- a/Makefile +++ b/Makefile @@ -9,7 +9,7 @@ include tests/fx-fft/fx-fft.mk include tests/helloworld/helloworld.mk include tests/intmulntt/intmulntt.mk include tests/karatsuba/karatsuba.mk -#include tests/montgomery/montgomery.mk +# include tests/montgomery/montgomery.mk include tests/ntt-192/ntt-192.mk include tests/ntt-256/ntt-256.mk include tests/ntt-384/ntt-384.mk diff --git a/asm/manual/fx_fft/base_symbolic.s b/asm/manual/fx_fft/base_symbolic.s deleted file mode 100644 index 23aeb67..0000000 --- a/asm/manual/fx_fft/base_symbolic.s +++ /dev/null @@ -1,92 +0,0 @@ - .syntax unified - .type fixedpoint_radix4_fft_symbolic, %function - .global fixedpoint_radix4_fft_symbolic - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - -.macro load_data - vldrw.s32 qA, [inA] - vldrw.s32 qB, [inB] - vldrw.s32 qC, [inC] - vldrw.s32 qD, [inD] -.endm - -.macro load_twiddles - vldrw.s32 qTw1, [pW1], #16 - vldrw.s32 qTw2, [pW2], #16 - vldrw.s32 qTw3, [pW3], #16 -.endm - -.macro store_data - vstrw.32 qA, [inA], #16 - vstrw.32 qB, [inB], #16 - vstrw.32 qC, [inC], #16 - vstrw.32 qD, [inD], #16 -.endm - -.macro cmul_fx out, in0, in1 - vqdmlsdh.s32 \out, \in0, \in1 - vqdmladhx.s32 \out, \in0, \in1 -.endm - - .text - .align 4 -fixedpoint_radix4_fft_symbolic: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pW0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.p2align 2 -fixedpoint_radix4_fft_loop_start: - vldrw.s32 q1, [r0] - vldrw.s32 q5, [r3] - vldrw.s32 q4, [r4] - vldrw.s32 q6, [r5] - vhadd.s32 q0, q1, q4 - vhadd.s32 q7, q5, q6 - vhsub.s32 q2, q1, q4 - vhsub.s32 q6, q5, q6 - vhadd.s32 q1, q0, q7 - vhsub.s32 q4, q0, q7 - vldrw.s32 q0, [r7] , #16 - vqdmlsdh.s32 q7, q0, q4 - vqdmladhx.s32 q7, q0, q4 - vhcadd.s32 q0, q2, q6, #270 - vldrw.s32 q5, [r6] , #16 - vqdmlsdh.s32 q4, q5, q0 - vqdmladhx.s32 q4, q5, q0 - vhcadd.s32 q5, q2, q6, #90 - vldrw.s32 q0, [r8] , #16 - vqdmlsdh.s32 q6, q0, q5 - vqdmladhx.s32 q6, q0, q5 - vstrw.u32 q1, [r0] , #16 - vstrw.u32 q7, [r3] , #16 - vstrw.u32 q4, [r4] , #16 - vstrw.u32 q6, [r5] , #16 - le lr, fixedpoint_radix4_fft_loop_start - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/asm/manual/fx_fft/base_symbolic.s b/asm/manual/fx_fft/base_symbolic.s new file mode 120000 index 0000000..46f13d6 --- /dev/null +++ b/asm/manual/fx_fft/base_symbolic.s @@ -0,0 +1 @@ +../../../slothy/examples/opt/fx_r4_fft/base_symbolic.s \ No newline at end of file diff --git a/tests/flt-fft/base_concrete.s.old b/tests/flt-fft/base_concrete.s.old deleted file mode 100644 index c124cf7..0000000 --- a/tests/flt-fft/base_concrete.s.old +++ /dev/null @@ -1,73 +0,0 @@ - .syntax unified - .type floatingpoint_radix4_fft_base, %function - .global floatingpoint_radix4_fft_base - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing symbolic registers, but not - // yet reordering. - - .text - .align 4 -floatingpoint_radix4_fft_base: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pw0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.p2align 2 -flt_radix4_fft_loop_start: - vldrw.32 q6, [r0] - vldrw.32 q0, [r4] - vadd.f32 q4, q6, q0 - vldrw.32 q7, [r3] - vsub.f32 q2, q6, q0 - vldrw.32 q0, [r5] - vadd.f32 q6, q7, q0 - vsub.f32 q3, q7, q0 - vadd.f32 q0, q4, q6 - vstrw.u32 q0, [r0] , #16 - vsub.f32 q6, q4, q6 - vldrw.32 q0, [r6] , #16 - vcmul.f32 q1, q0, q6, #0 - vcmla.f32 q1, q0, q6, #270 - vstrw.u32 q1, [r3] , #16 - vcadd.f32 q6, q2, q3, #270 - vldrw.32 q0, [r7] , #16 - vcmul.f32 q1, q0, q6, #0 - vcmla.f32 q1, q0, q6, #270 - vstrw.u32 q1, [r4] , #16 - vcadd.f32 q7, q2, q3, #90 - vldrw.32 q6, [r8] , #16 - vcmul.f32 q0, q6, q7, #0 - vcmla.f32 q0, q6, q7, #270 - vstrw.u32 q0, [r5] , #16 - le lr, flt_radix4_fft_loop_start - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/flt-fft/base_ref.s b/tests/flt-fft/base_ref.s deleted file mode 100644 index 9940c18..0000000 --- a/tests/flt-fft/base_ref.s +++ /dev/null @@ -1,73 +0,0 @@ - .syntax unified - .type floatingpoint_radix4_fft_base, %function - .global floatingpoint_radix4_fft_base - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing ref registers, but not - // yet reordering. - - .text - .align 4 -floatingpoint_radix4_fft_base: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pw0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.p2align 2 -flt_radix4_fft_loop_start: - vldrw.32 q1, [r0] - vldrw.32 q0, [r4] - vadd.f32 q3, q1, q0 - vldrw.32 q5, [r3] - vsub.f32 q2, q1, q0 - vldrw.32 q0, [r5] - vadd.f32 q1, q5, q0 - vsub.f32 q6, q5, q0 - vadd.f32 q0, q3, q1 - vstrw.u32 q0, [r0] , #16 - vsub.f32 q1, q3, q1 - vldrw.32 q0, [r6] , #16 - vcmul.f32 q4, q0, q1, #0 - vcmla.f32 q4, q0, q1, #270 - vstrw.u32 q4, [r3] , #16 - vcadd.f32 q1, q2, q6, #270 - vldrw.32 q0, [r7] , #16 - vcmul.f32 q4, q0, q1, #0 - vcmla.f32 q4, q0, q1, #270 - vstrw.u32 q4, [r4] , #16 - vcadd.f32 q5, q2, q6, #90 - vldrw.32 q1, [r8] , #16 - vcmul.f32 q0, q1, q5, #0 - vcmla.f32 q0, q1, q5, #270 - vstrw.u32 q0, [r5] , #16 - le lr, flt_radix4_fft_loop_start - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/flt-fft/floatingpoint_radix4_fft_opt_M55.s b/tests/flt-fft/floatingpoint_radix4_fft_opt_M55.s deleted file mode 100644 index 2126d24..0000000 --- a/tests/flt-fft/floatingpoint_radix4_fft_opt_M55.s +++ /dev/null @@ -1,194 +0,0 @@ - .syntax unified - .type floatingpoint_radix4_fft_opt_M55, %function - .global floatingpoint_radix4_fft_opt_M55 - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing symbolic registers, but not - // yet reordering. - - .text - .align 4 -floatingpoint_radix4_fft_opt_M55: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pW0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.macro load_data - vldrw.32 qA, [inA] - vldrw.32 qB, [inB] - vldrw.32 qC, [inC] - vldrw.32 qD, [inD] -.endm - -.macro load_twiddles - vldrw.s32 qTw1, [pW1], #16 - vldrw.s32 qTw2, [pW2], #16 - vldrw.s32 qTw3, [pW3], #16 -.endm - -.macro store_data - vstrw.32 qA, [inA], #16 - vstrw.32 qB, [inB], #16 - vstrw.32 qC, [inC], #16 - vstrw.32 qD, [inD], #16 -.endm - -.macro cmul_flt out, in0, in1 - vcmul.f32 \out, \in0, \in1, #0 - vcmla.f32 \out, \in0, \in1, #270 -.endm - - vldrw.32 q2, [r5] // *.. - // gap // ... - vldrw.32 q6, [r3] // .*. - // gap // ... - vldrw.32 q1, [r4] // ..* - - // original source code - // vldrw.32 q2, [r5] // *.. - // vldrw.32 q6, [r3] // .*. - // vldrw.32 q1, [r4] // ..* - - sub lr, lr, #1 -.p2align 2 -flt_radix4_fft_loop_start: - vadd.f32 q0, q6, q2 // ........*................ - // gap // ......................... - vsub.f32 q6, q6, q2 // ..........*.............. - vldrw.32 q4, [r0] // *........................ - vadd.f32 q5, q4, q1 // .......*................. - vldrw.s32 q3, [r6] , #16 // ....*.................... - vadd.f32 q2, q5, q0 // ...........*............. - vstrw.u32 q2, [r0] , #16 // .....................*... - vsub.f32 q7, q5, q0 // ............*............ - // gap // ......................... - vcmul.f32 q0, q3, q7, #0 // ...............*......... - vldrw.32 q2, [r5, #16] // ...e..................... - vcmla.f32 q0, q3, q7, #270 // ................*........ - vstrw.u32 q0, [r3] , #16 // ......................*.. - vsub.f32 q1, q4, q1 // .........*............... - // gap // ......................... - vcadd.f32 q5, q1, q6, #90 // ..............*.......... - vldrw.s32 q0, [r8] , #16 // ......*.................. - vcadd.f32 q7, q1, q6, #270 // .............*........... - vldrw.s32 q4, [r7] , #16 // .....*................... - vcmul.f32 q3, q4, q7, #0 // .................*....... - vldrw.32 q6, [r3] // .e....................... - vcmla.f32 q3, q4, q7, #270 // ..................*...... - vldrw.32 q1, [r4, #16] // ..e...................... - vcmul.f32 q7, q0, q5, #0 // ...................*..... - vstrw.u32 q3, [r4] , #16 // .......................*. - vcmla.f32 q7, q0, q5, #270 // ....................*.... - vstrw.u32 q7, [r5] , #16 // ........................* - - // original source code - // vldrw.32 qA, [r0] // ................|.*...................... - // vldrw.32 qB, [r3] // .........e......|.................e...... - // vldrw.32 qC, [r4] // ...........e....|...................e.... - // vldrw.32 qD, [r5] // e...............|........e............... - // vldrw.s32 qTw1, [r6], #16 // ................|...*.................... - // vldrw.s32 qTw2, [r7], #16 // .......*........|...............*........ - // vldrw.s32 qTw3, [r8], #16 // .....*..........|.............*.......... - // vadd.f32 qSm0, qA, qC // ................|..*..................... - // vadd.f32 qSm1, qB, qD // ................*........................ - // vsub.f32 qDf0, qA, qC // ...*............|...........*............ - // vsub.f32 qDf1, qB, qD // ................|*....................... - // vadd.f32 qA, qSm0, qSm1 // ................|....*................... - // vsub.f32 qBp, qSm0, qSm1 // ................|......*................. - // vcadd.f32 qCp, qDf0, qDf1, #270 // ......*.........|..............*......... - // vcadd.f32 qDp, qDf0, qDf1, #90 // ....*...........|............*........... - // vcmul.f32 qB, qTw1, qBp, #0 // ................|.......*................ - // vcmla.f32 qB, qTw1, qBp, #270 // .*..............|.........*.............. - // vcmul.f32 qC, qTw2, qCp, #0 // ........*.......|................*....... - // vcmla.f32 qC, qTw2, qCp, #270 // ..........*.....|..................*..... - // vcmul.f32 qD, qTw3, qDp, #0 // ............*...|....................*... - // vcmla.f32 qD, qTw3, qDp, #270 // ..............*.|......................*. - // vstrw.32 qA, [r0], #16 // ................|.....*.................. - // vstrw.32 qB, [r3], #16 // ..*.............|..........*............. - // vstrw.32 qC, [r4], #16 // .............*..|.....................*.. - // vstrw.32 qD, [r5], #16 // ...............*|.......................* - - le lr, flt_radix4_fft_loop_start - vadd.f32 q5, q6, q2 // *..................... - vldrw.32 q3, [r0] // ..*................... - vsub.f32 q0, q6, q2 // .*.................... - vldrw.s32 q4, [r6] , #16 // ....*................. - vadd.f32 q7, q3, q1 // ...*.................. - vldrw.s32 q2, [r8] , #16 // .............*........ - vsub.f32 q3, q3, q1 // ...........*.......... - vldrw.s32 q1, [r7] , #16 // ...............*...... - vadd.f32 q6, q7, q5 // .....*................ - vstrw.u32 q6, [r0] , #16 // ......*............... - vsub.f32 q5, q7, q5 // .......*.............. - // gap // ...................... - vcadd.f32 q7, q3, q0, #90 // ............*......... - // gap // ...................... - vcadd.f32 q6, q3, q0, #270 // ..............*....... - // gap // ...................... - vcmul.f32 q3, q4, q5, #0 // ........*............. - // gap // ...................... - vcmla.f32 q3, q4, q5, #270 // .........*............ - vstrw.u32 q3, [r3] , #16 // ..........*........... - vcmul.f32 q5, q2, q7, #0 // ..................*... - // gap // ...................... - vcmla.f32 q5, q2, q7, #270 // ....................*. - vstrw.u32 q5, [r5] , #16 // .....................* - vcmul.f32 q5, q1, q6, #0 // ................*..... - // gap // ...................... - vcmla.f32 q5, q1, q6, #270 // .................*.... - vstrw.u32 q5, [r4] , #16 // ...................*.. - - // original source code - // vadd.f32 q0, q6, q2 // *..................... - // vsub.f32 q6, q6, q2 // ..*................... - // vldrw.32 q4, [r0] // .*.................... - // vadd.f32 q5, q4, q1 // ....*................. - // vldrw.s32 q3, [r6] , #16 // ...*.................. - // vadd.f32 q2, q5, q0 // ........*............. - // vstrw.u32 q2, [r0] , #16 // .........*............ - // vsub.f32 q7, q5, q0 // ..........*........... - // vcmul.f32 q0, q3, q7, #0 // .............*........ - // vcmla.f32 q0, q3, q7, #270 // ..............*....... - // vstrw.u32 q0, [r3] , #16 // ...............*...... - // vsub.f32 q1, q4, q1 // ......*............... - // vcadd.f32 q5, q1, q6, #90 // ...........*.......... - // vldrw.s32 q0, [r8] , #16 // .....*................ - // vcadd.f32 q7, q1, q6, #270 // ............*......... - // vldrw.s32 q4, [r7] , #16 // .......*.............. - // vcmul.f32 q3, q4, q7, #0 // ...................*.. - // vcmla.f32 q3, q4, q7, #270 // ....................*. - // vcmul.f32 q7, q0, q5, #0 // ................*..... - // vstrw.u32 q3, [r4] , #16 // .....................* - // vcmla.f32 q7, q0, q5, #270 // .................*.... - // vstrw.u32 q7, [r5] , #16 // ..................*... - - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/flt-fft/floatingpoint_radix4_fft_opt_M85.s b/tests/flt-fft/floatingpoint_radix4_fft_opt_M85.s deleted file mode 100644 index da4b39a..0000000 --- a/tests/flt-fft/floatingpoint_radix4_fft_opt_M85.s +++ /dev/null @@ -1,190 +0,0 @@ - .syntax unified - .type floatingpoint_radix4_fft_opt_M85, %function - .global floatingpoint_radix4_fft_opt_M85 - - - inA .req r0 - pW0 .req r1 // Use the same twiddle data for TESTING ONLY - sz .req r2 - - inB .req r3 - inC .req r4 - inD .req r5 - - pW1 .req r6 - pW2 .req r7 - pW3 .req r8 - - // NOTE: - // We deliberately leave some aliases undefined - // SLOTHY will fill them in as part of a 'dry-run' - // merely concretizing symbolic registers, but not - // yet reordering. - - .text - .align 4 -floatingpoint_radix4_fft_opt_M85: - push {r4-r12,lr} - vpush {d0-d15} - - add inB, inA, sz - add inC, inB, sz - add inD, inC, sz - - add pW1, pW0, sz - add pW2, pW1, sz - add pW3, pW2, sz - - lsr lr, sz, #4 - wls lr, lr, end - -.macro load_data - vldrw.32 qA, [inA] - vldrw.32 qB, [inB] - vldrw.32 qC, [inC] - vldrw.32 qD, [inD] -.endm - -.macro load_twiddles - vldrw.s32 qTw1, [pW1], #16 - vldrw.s32 qTw2, [pW2], #16 - vldrw.s32 qTw3, [pW3], #16 -.endm - -.macro store_data - vstrw.32 qA, [inA], #16 - vstrw.32 qB, [inB], #16 - vstrw.32 qC, [inC], #16 - vstrw.32 qD, [inD], #16 -.endm - -.macro cmul_flt out, in0, in1 - vcmul.f32 \out, \in0, \in1, #0 - vcmla.f32 \out, \in0, \in1, #270 -.endm - - vldrw.32 q5, [r5] // *. - // gap // .. - vldrw.32 q6, [r3] // .* - - // original source code - // vldrw.32 q5, [r5] // *. - // vldrw.32 q6, [r3] // .* - - sub lr, lr, #1 -.p2align 2 -flt_radix4_fft_loop_start: - vsub.f32 q4, q6, q5 // ..........*.............. - vldrw.32 q1, [r0] // *........................ - vadd.f32 q0, q6, q5 // ........*................ - vldrw.32 q6, [r4] // ..*...................... - vsub.f32 q3, q1, q6 // .........*............... - vldrw.s32 q7, [r8] , #16 // ......*.................. - vcadd.f32 q2, q3, q4, #90 // ..............*.......... - // gap // ......................... - vcadd.f32 q5, q3, q4, #270 // .............*........... - vcmul.f32 q3, q7, q2, #0 // ...................*..... - vadd.f32 q6, q1, q6 // .......*................. - vldrw.s32 q1, [r6] , #16 // ....*.................... - vcmla.f32 q3, q7, q2, #270 // ....................*.... - vldrw.s32 q7, [r7] , #16 // .....*................... - vstrw.u32 q3, [r5] , #16 // ........................* - // gap // ......................... - vsub.f32 q2, q6, q0 // ............*............ - vcmul.f32 q3, q7, q5, #0 // .................*....... - vadd.f32 q6, q6, q0 // ...........*............. - vcmul.f32 q0, q1, q2, #0 // ...............*......... - vstrw.u32 q6, [r0] , #16 // .....................*... - vcmla.f32 q3, q7, q5, #270 // ..................*...... - vldrw.32 q5, [r5] // ...e..................... - vcmla.f32 q0, q1, q2, #270 // ................*........ - vstrw.u32 q3, [r4] , #16 // .......................*. - vldrw.32 q6, [r3, #16] // .e....................... - vstrw.u32 q0, [r3] , #16 // ......................*.. - - // original source code - // vldrw.32 qA, [r0] // .....|*....................... - // vldrw.32 qB, [r3] // ...e.|......................e. - // vldrw.32 qC, [r4] // .....|..*..................... - // vldrw.32 qD, [r5] // e....|...................e.... - // vldrw.s32 qTw1, [r6], #16 // .....|.........*.............. - // vldrw.s32 qTw2, [r7], #16 // .....|...........*............ - // vldrw.s32 qTw3, [r8], #16 // .....|....*................... - // vadd.f32 qSm0, qA, qC // .....|........*............... - // vadd.f32 qSm1, qB, qD // .....|.*...................... - // vsub.f32 qDf0, qA, qC // .....|...*.................... - // vsub.f32 qDf1, qB, qD // .....*........................ - // vadd.f32 qA, qSm0, qSm1 // .....|...............*........ - // vsub.f32 qBp, qSm0, qSm1 // .....|.............*.......... - // vcadd.f32 qCp, qDf0, qDf1, #270 // .....|......*................. - // vcadd.f32 qDp, qDf0, qDf1, #90 // .....|.....*.................. - // vcmul.f32 qB, qTw1, qBp, #0 // .....|................*....... - // vcmla.f32 qB, qTw1, qBp, #270 // .*...|....................*... - // vcmul.f32 qC, qTw2, qCp, #0 // .....|..............*......... - // vcmla.f32 qC, qTw2, qCp, #270 // .....|..................*..... - // vcmul.f32 qD, qTw3, qDp, #0 // .....|.......*................ - // vcmla.f32 qD, qTw3, qDp, #270 // .....|..........*............. - // vstrw.32 qA, [r0], #16 // .....|.................*...... - // vstrw.32 qB, [r3], #16 // ....*|.......................* - // vstrw.32 qC, [r4], #16 // ..*..|.....................*.. - // vstrw.32 qD, [r5], #16 // .....|............*........... - - le lr, flt_radix4_fft_loop_start - vadd.f32 q7, q6, q5 // ..*.................... - vldrw.32 q0, [r0] // .*..................... - vsub.f32 q3, q6, q5 // *...................... - vldrw.32 q2, [r4] // ...*................... - vsub.f32 q1, q0, q2 // ....*.................. - vldrw.s32 q4, [r7] , #16 // ............*.......... - vcadd.f32 q5, q1, q3, #270 // .......*............... - // gap // ....................... - vcadd.f32 q6, q1, q3, #90 // ......*................ - vcmul.f32 q3, q4, q5, #0 // ...............*....... - vadd.f32 q1, q0, q2 // .........*............. - vldrw.s32 q2, [r8] , #16 // .....*................. - vcmla.f32 q3, q4, q5, #270 // ...................*... - vldrw.s32 q4, [r6] , #16 // ..........*............ - vstrw.u32 q3, [r4] , #16 // .....................*. - // gap // ....................... - vsub.f32 q5, q1, q7 // ..............*........ - vcmul.f32 q0, q2, q6, #0 // ........*.............. - vadd.f32 q3, q1, q7 // ................*...... - vcmul.f32 q1, q4, q5, #0 // .................*..... - vstrw.u32 q3, [r0] , #16 // ..................*.... - vcmla.f32 q0, q2, q6, #270 // ...........*........... - // gap // ....................... - vstrw.u32 q0, [r5] , #16 // .............*......... - vcmla.f32 q1, q4, q5, #270 // ....................*.. - // gap // ....................... - vstrw.u32 q1, [r3] , #16 // ......................* - - // original source code - // vsub.f32 q4, q6, q5 // ..*.................... - // vldrw.32 q1, [r0] // .*..................... - // vadd.f32 q0, q6, q5 // *...................... - // vldrw.32 q6, [r4] // ...*................... - // vsub.f32 q3, q1, q6 // ....*.................. - // vldrw.s32 q7, [r8] , #16 // ..........*............ - // vcadd.f32 q2, q3, q4, #90 // .......*............... - // vcadd.f32 q5, q3, q4, #270 // ......*................ - // vcmul.f32 q3, q7, q2, #0 // ...............*....... - // vadd.f32 q6, q1, q6 // .........*............. - // vldrw.s32 q1, [r6] , #16 // ............*.......... - // vcmla.f32 q3, q7, q2, #270 // ...................*... - // vldrw.s32 q7, [r7] , #16 // .....*................. - // vstrw.u32 q3, [r5] , #16 // ....................*.. - // vsub.f32 q2, q6, q0 // ..............*........ - // vcmul.f32 q3, q7, q5, #0 // ........*.............. - // vadd.f32 q6, q6, q0 // ................*...... - // vcmul.f32 q0, q1, q2, #0 // .................*..... - // vstrw.u32 q6, [r0] , #16 // ..................*.... - // vcmla.f32 q3, q7, q5, #270 // ...........*........... - // vcmla.f32 q0, q1, q2, #270 // .....................*. - // vstrw.u32 q3, [r4] , #16 // .............*......... - // vstrw.u32 q0, [r3] , #16 // ......................* - - -end: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr \ No newline at end of file diff --git a/tests/flt-fft/flt-fft.mk b/tests/flt-fft/flt-fft.mk index 838868d..7cf2af1 100644 --- a/tests/flt-fft/flt-fft.mk +++ b/tests/flt-fft/flt-fft.mk @@ -11,7 +11,8 @@ FLT_FFT_PLATFORMS += m85-an555 FLT_FFT_SOURCES += main.c # Assembly sources required for this test -FLT_FFT_ASMS += base_ref.s -FLT_FFT_ASMS += floatingpoint_radix4_fft_opt_M55.s -FLT_FFT_ASMS += floatingpoint_radix4_fft_opt_M85.s +FLT_FFT_DIR = ../../asm/manual/flt_fft +FLT_FFT_ASMS += $(FLT_FFT_DIR)/base_ref.s +FLT_FFT_ASMS += $(FLT_FFT_DIR)/floatingpoint_radix4_fft_opt_M55.s +FLT_FFT_ASMS += $(FLT_FFT_DIR)/floatingpoint_radix4_fft_opt_M85.s diff --git a/tests/intmulntt/intmulntt.mk b/tests/intmulntt/intmulntt.mk index 5624209..e46cedd 100644 --- a/tests/intmulntt/intmulntt.mk +++ b/tests/intmulntt/intmulntt.mk @@ -12,7 +12,7 @@ INTMULNTT_SOURCES += main.c # Assembly sources required for this test INTMULNTT_ASMS += ../../asm/manual/crt/crt.s -INTMULNTT_ASMS += montgomery.s +INTMULNTT_ASMS += ../../asm/manual/montgomery/montgomery.s INTMULNTT_ASM_DIR = ../../asm/auto/ntt_384 INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_bitrev.s INTMULNTT_ASMS += $(INTMULNTT_ASM_DIR)/ntt_384_u32_88299073_4883425_incomplete_good_oop_half_input.s diff --git a/tests/intmulntt/montgomery.s b/tests/intmulntt/montgomery.s deleted file mode 100644 index 196b8a6..0000000 --- a/tests/intmulntt/montgomery.s +++ /dev/null @@ -1,3647 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -#if defined(MODULUS_Q16) - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -#endif /* MODULUS_Q16 */ - -#if defined(MODULUS_Q32) - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -#endif /* MODULUS_Q32 */ diff --git a/tests/montgomery/montgomery.mk b/tests/montgomery/montgomery.mk index 7c8c893..5d376c8 100644 --- a/tests/montgomery/montgomery.mk +++ b/tests/montgomery/montgomery.mk @@ -11,5 +11,4 @@ MONTGOMERY_PLATFORMS += m85-an555 MONTGOMERY_SOURCES += main.c # Assembly sources required for this test -MONTGOMERY_ASMS += montgomery.s - +MONTGOMERY_ASMS += ../../asm/manual/montgomery/montgomery.s diff --git a/tests/montgomery/montgomery.s b/tests/montgomery/montgomery.s deleted file mode 100644 index 5ba02ef..0000000 --- a/tests/montgomery/montgomery.s +++ /dev/null @@ -1,3640 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/ntt-1024/montgomery.s b/tests/ntt-1024/montgomery.s deleted file mode 100644 index 196b8a6..0000000 --- a/tests/ntt-1024/montgomery.s +++ /dev/null @@ -1,3647 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -#if defined(MODULUS_Q16) - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -#endif /* MODULUS_Q16 */ - -#if defined(MODULUS_Q32) - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -#endif /* MODULUS_Q32 */ diff --git a/tests/ntt-1024/ntt-1024.mk b/tests/ntt-1024/ntt-1024.mk index f83721c..720ebac 100644 --- a/tests/ntt-1024/ntt-1024.mk +++ b/tests/ntt-1024/ntt-1024.mk @@ -12,7 +12,7 @@ NTT_1024_SOURCES += main.c # Assembly sources required for this test NTT_1024_ASM_DIR = ../../asm/auto/ntt_1024 -NTT_1024_ASMS += montgomery.s +NTT_1024_ASMS += ../../asm/manual/montgomery/montgomery.s NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_complete.s NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_bitrev_skipfirst.s NTT_1024_ASMS += $(NTT_1024_ASM_DIR)/ntt_1024_u32_33564673_286215_incomplete_bitrev.s diff --git a/tests/ntt-192/montgomery.s b/tests/ntt-192/montgomery.s deleted file mode 100644 index 196b8a6..0000000 --- a/tests/ntt-192/montgomery.s +++ /dev/null @@ -1,3647 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -#if defined(MODULUS_Q16) - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -#endif /* MODULUS_Q16 */ - -#if defined(MODULUS_Q32) - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -#endif /* MODULUS_Q32 */ diff --git a/tests/ntt-192/ntt-192.mk b/tests/ntt-192/ntt-192.mk index ac4c5d2..298f850 100644 --- a/tests/ntt-192/ntt-192.mk +++ b/tests/ntt-192/ntt-192.mk @@ -12,7 +12,7 @@ NTT_192_SOURCES += main.c # Assembly sources required for this test NTT_192_ASM_DIR = ../../asm/auto/ntt_192 -NTT_192_ASMS += montgomery.s +NTT_192_ASMS += ../../asm/manual/montgomery/montgomery.s NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_33556993_27792935_incomplete_good_bitrev.s NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_33556993_27792935_incomplete_good.s NTT_192_ASMS += $(NTT_192_ASM_DIR)/ntt_192_u32_45387457_16877098_incomplete_good_bitrev.s diff --git a/tests/ntt-384/montgomery.s b/tests/ntt-384/montgomery.s deleted file mode 100644 index 196b8a6..0000000 --- a/tests/ntt-384/montgomery.s +++ /dev/null @@ -1,3647 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -#if defined(MODULUS_Q16) - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -#endif /* MODULUS_Q16 */ - -#if defined(MODULUS_Q32) - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -#endif /* MODULUS_Q32 */ diff --git a/tests/ntt-384/ntt-384.mk b/tests/ntt-384/ntt-384.mk index 44d9767..d93de54 100644 --- a/tests/ntt-384/ntt-384.mk +++ b/tests/ntt-384/ntt-384.mk @@ -12,7 +12,7 @@ NTT_384_SOURCES += main.c # Assembly sources required for this test NTT_384_ASM_DIR = ../../asm/auto/ntt_384 -NTT_384_ASMS += montgomery.s +NTT_384_ASMS += ../../asm/manual/montgomery/montgomery.s NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_33556993_15047299_incomplete_good_bitrev.s NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_33556993_15047299_incomplete_good.s NTT_384_ASMS += $(NTT_384_ASM_DIR)/ntt_384_u32_45387457_923104_incomplete_good_bitrev.s diff --git a/tests/ntt-768/montgomery.s b/tests/ntt-768/montgomery.s deleted file mode 100644 index 196b8a6..0000000 --- a/tests/ntt-768/montgomery.s +++ /dev/null @@ -1,3647 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -#if defined(MODULUS_Q16) - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -#endif /* MODULUS_Q16 */ - -#if defined(MODULUS_Q32) - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -#endif /* MODULUS_Q32 */ diff --git a/tests/ntt-768/ntt-768.mk b/tests/ntt-768/ntt-768.mk index 5e934f4..122c6ad 100644 --- a/tests/ntt-768/ntt-768.mk +++ b/tests/ntt-768/ntt-768.mk @@ -14,7 +14,7 @@ NTT_768_SOURCES += main.c # Assembly sources required for this test NTT_768_ASM_DIR = ../../asm/auto/ntt_768 -NTT_768_ASMS += montgomery.s +NTT_768_ASMS += ../../asm/manual/montgomery/montgomery.s NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_bitrev.s NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_double.s NTT_768_ASMS += $(NTT_768_ASM_DIR)/ntt_768_u32_33556993_299353_incomplete_good_bitrev.s diff --git a/tests/poly/montgomery.s b/tests/poly/montgomery.s deleted file mode 100644 index 5ba02ef..0000000 --- a/tests/poly/montgomery.s +++ /dev/null @@ -1,3640 +0,0 @@ -/* - * Copyright (c) 2021 Arm Limited - * SPDX-License-Identifier: MIT - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in all - * copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#include "montgomery_const.h" - - .syntax unified - -.type twisted_cyclic_mul_acc_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_acc_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_acc_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 /* vmulh requires vector operand */ - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - dst_vect .req q5 // Overlapping with res_lo - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start: - - vldrw.s32 dst_vect, [dst] - vadd.s32 res_hi, res_hi, dst_vect - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - le loop_cnt, twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_acc_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi_old, res_hi_old, l_b0 - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vldrw.s32 l_b0, [dst] - vadd.s32 res_hi, res_hi, l_b0 - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq res_lo - .unreq res_hi - - .unreq dst_vect - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq loop_cnt - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - -.type twisted_cyclic_mul_deg4_u32_mve_alt, %function -.global twisted_cyclic_mul_deg4_u32_mve_alt -.align 4 -twisted_cyclic_mul_deg4_u32_mve_alt: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - wls loop_cnt, loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_end -twisted_cyclic_mul_deg4_u32_mve_alt_loop_start: - - vstrw.s32 res_hi, [dst], #+16 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - le loop_cnt, twisted_cyclic_mul_deg4_u32_mve_alt_loop_start - -twisted_cyclic_mul_deg4_u32_mve_alt_loop_end: - - /* Defer storing of last result */ - res_hi_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vstrw.s32 res_hi_old, [dst], #+16 - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_mve_expand, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - twiddle .req r4 - twiddle_twisted .req r5 - - q_off_rev .req q0 - q_in .req q1 - tmp .req q3 - res .req q2 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - mod_q .req r3 - - consts .req r4 - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - - mov loop_cnt, #(VECTOR_LENGTH/4-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - add.w src, src, #+16 - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vldrw.32 q_in, [src, q_off_rev, UXTW #2] - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - le loop_cnt, 1b -2: - - vqrdmulh.s32 res, q_in, twiddle - vstrw.32 q_in, [dst], #+32 - vmul.u32 tmp, q_in, twiddle_twisted - vqrdmlah.s32 res, tmp, mod_q - vstrw.32 res, [dst, #-16] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in - .unreq tmp - .unreq res - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double: - push {r4-r12,lr} - vpush {d0-d15} - - loop_cnt .req r14 - - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q6 - tmp .req q2 - resA .req q4 - resB .req q5 - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - - consts .req r7 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(10*4 + 8*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 resB, q_in0, twiddle - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - wls loop_cnt, loop_cnt, 2 - .align 2 -1: - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in0, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vqrdmulh.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vmul.u32 tmp, q_in1, twiddle_twisted - vqrdmlah.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vqrdmulh.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.u32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vqrdmlah.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq loop_cnt - - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett, %function -.global twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett -.align 4 -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts: - .byte 3 - .byte 2 - .byte 1 - .byte 0 - -twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett: - push {r4-r11,lr} - vpush {d8-d11} - - dst .req r0 - src .req r1 - twiddle_table .req r2 - twiddle_fix_ptr .req r3 - consts .req r7 - mod_q .req r4 - twiddle .req r5 - twiddle_twisted .req r6 - twiddle_fix .req r7 - twiddle_fix_twisted .req r8 - loop_cnt .req r14 - - q_off_rev .req q0 - q_in0 .req q1 - q_in1 .req q5 - tmp .req q2 - resA .req q3 - resB .req q4 - - adr consts, twisted_cyclic_mul_deg4_u32_mve_expand_double_barrett_consts - vldrb.u32 q_off_rev, [consts] - .unreq consts - - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - ldr mod_q, [sp, #(9*4 + 2*16)] - ldrd twiddle_fix, twiddle_fix_twisted, [twiddle_fix_ptr] - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vmul.s32 resB, q_in0, twiddle - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - add.w src, src, #+16 - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - mov loop_cnt, #((VECTOR_LENGTH/8)-1) - .align 2 - wls loop_cnt, loop_cnt, 2 -1: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in0, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in0, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in1, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in0, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in0, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - le loop_cnt, 1b -2: - vstrw.32 resB, [dst, #+16] - vmul.s32 resB, q_in1, twiddle - vstrw.32 resA, [dst], #+32 - vqrdmulh.s32 tmp, q_in1, twiddle_twisted - vmla.s32 resB, tmp, mod_q - vldrw.32 q_in0, [src, q_off_rev, UXTW #2] - vmul.s32 resA, q_in1, twiddle_fix - ldrd twiddle, twiddle_twisted, [twiddle_table], #+8 - vqrdmulh.s32 tmp, q_in1, twiddle_fix_twisted - add.w src, src, #+16 - vmla.s32 resA, tmp, mod_q - vstrw.32 resB, [dst, #+16] - vstrw.32 resA, [dst], #+32 - - vpop {d8-d11} - pop {r4-r11,pc} - - .unreq loop_cnt - .unreq mod_q - .unreq twiddle - .unreq twiddle_twisted - .unreq q_off_rev - .unreq q_in0 - .unreq q_in1 - .unreq tmp - .unreq resA - .unreq resB - .unreq dst - .unreq src - .unreq twiddle_table - -.type twisted_cyclic_mul_deg4_u32_mve_simple, %function -.global twisted_cyclic_mul_deg4_u32_mve_simple -.align 4 -twisted_cyclic_mul_deg4_u32_mve_simple: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r3 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - wls loop_cnt, loop_cnt, 2 -1: - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vldrw.u32 l_b2, [in_B, #(-32 + 4 )] - vldrw.u32 l_b1, [in_B, #(-32 + 8 )] - vldrw.u32 l_b0, [in_B, #(-32 + 12)] - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - vstrw.s32 res_hi, [dst], #+16 - le loop_cnt, 1b -2: - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_add_sub_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_mve -twisted_cyclic_mul_deg4_u32_add_sub_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_rev_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_rev_mve -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr: - .byte 3*4 - .byte 2*4 - .byte 1*4 - .byte 0*4 -twisted_cyclic_mul_deg4_u32_add_sub_rev_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - q_rev .req q3 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q4 //q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi0 .req q6 - res_hi1 .req q1 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp .req r5 - adr tmp, twisted_cyclic_mul_deg4_u32_add_sub_rev_mve_rev_addr - vldrb.u32 q_rev, [tmp] - vadd.u32 q_rev, q_rev, in_A - .unreq tmp - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vldrw.u32 l_a, [q_rev] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - // From this point onwards, l_b3 and l_b2 are never used - // at the same time. Use the same register for them - .unreq l_b3 - l_b3 .req l_b2 - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - .align 2 - - - wls loop_cnt, loop_cnt, 2 -1: - - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi1, res_hi1, res_lo - - // Add/sub with result from previous iteration - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - tmp .req l_b2 // Currently unused - vadd.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_hi0, res_hi1 - vstrw.s32 tmp, [dst], #+16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi0[3], res_hi0[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [q_rev, #+16]! - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi0[2], res_hi0[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi0, res_hi0, res_lo - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi1[3], res_hi1[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi1[2], res_hi1[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi1, res_hi1, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - vsub.s32 mod_q_vect, res_hi0, res_hi1 - vstrw.32 mod_q_vect, [dst], #+16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi0 - .unreq res_hi1 - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - -.type twisted_cyclic_mul_deg4_u32_add_sub_split_mve, %function -.global twisted_cyclic_mul_deg4_u32_add_sub_split_mve -.align 4 -twisted_cyclic_mul_deg4_u32_add_sub_split_mve: - push {r4-r12,lr} - vpush {d0-d15} - - sub sp, sp, #16 - - mod_q .req r11 - mod_q_inv .req r12 - mod_q_vect .req q7 - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - res_lo .req q5 - res_hi .req q6 - res_old .req q5 // Overlaps with res_lo deliberately - - in_A .req r0 - in_B .req r1 - dst .req r2 - dst_h .req r3 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/8)-2) - - add dst_h, dst, #(4*VECTOR_LENGTH/2) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r8 - res1_hi .req r9 - res2_lo .req r6 - res2_hi .req r7 - res0_lo .req r10 - res0_hi .req r11 - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 16)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - vldrw.u32 l_a, [in_A], #+16 - vldrw.u32 l_b3, [in_B], #+32 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B, #(-32-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B, #(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vldrw.u32 l_b2, [in_B, #(-16-12)] - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B, #(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #+16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #+16 - .unreq tmp - - wls loop_cnt, loop_cnt, 2 -1: - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - vstrw.s32 res_hi, [sp] - - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - - // Add/sub with result from previous iteration - vldrw.s32 res_old, [sp] - tmp .req q1 // == l_b3 (currently unused) - vadd.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst], #16 - vsub.s32 tmp, res_old, res_hi - vstrw.s32 tmp, [dst_h], #16 - .unreq tmp - - le loop_cnt, 1b - -2: - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vldrw.u32 l_b3, [in_B], #+32 - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vldrw.u32 l_a, [in_A], #+16 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vldrw.u32 l_b1, [in_B,#(-16-8)] - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vldrw.u32 l_b2, [in_B,#(-16-12)] - vmul.u32 res_lo, res_lo, mod_q_inv - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmulh.s32 res_lo, res_lo, mod_q_vect - vldrw.u32 l_b0, [in_B,#(-16-4)] - vsub.s32 res_hi, res_hi, res_lo - - /* Defer storing of last result */ - .unreq res_old - res_old .req q6 - .unreq res_hi - .unreq l_b1 - res_hi .req q3 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - vldrw.u32 l_b0, [in_B,#(-16-4)] - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmov res_lo[2], res_lo[0], res0_lo, res2_lo - vmov res_hi[2], res_hi[0], res0_hi, res2_hi - vmul.u32 res_lo, res_lo, mod_q_inv - vmulh.s32 res_lo, res_lo, mod_q_vect - vsub.s32 res_hi, res_hi, res_lo - - // Don't need mod_q_vect anymore - vadd.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst], #16 - vsub.s32 mod_q_vect, res_old, res_hi - vstrw.32 mod_q_vect, [dst_h], #16 - - add sp, sp, #16 - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res_lo - .unreq res_hi - .unreq res_old - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res0_lo - .unreq res0_hi - - .unreq mod_q_inv - .unreq mod_q_vect - - -.type twisted_cyclic_mul_deg4_u32_long_mve_v1, %function -.global twisted_cyclic_mul_deg4_u32_long_mve_v1 -.align 4 -twisted_cyclic_mul_deg4_u32_long_mve_v1: - push {r4-r11,lr} - vpush {d0-d9} - - l_a .req q0 - l_b3 .req q1 - l_b2 .req q2 - l_b1 .req q3 - l_b0 .req q4 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - loop_cnt .req r14 - mov loop_cnt, #((VECTOR_LENGTH/4)) - - res3_lo .req r4 - res3_hi .req r5 - res1_lo .req r6 - res1_hi .req r7 - res2_lo .req r8 - res2_hi .req r9 - res0_lo .req r10 - res0_hi .req r11 - - wls loop_cnt, loop_cnt, 2 -1: - - vldrw.u32 l_a, [in_A], #+16 /* (a0, a1, a2, a3) */ - - vldrw.u32 l_b3, [in_B], #+32 /* (b3, b2, b1, b0) */ - vldrw.u32 l_b0, [in_B,#(-32+3*4)] /* (b0, zeta*b3, zeta*b2, zeta*b1) */ - vldrw.u32 l_b1, [in_B,#(-32+2*4)] /* (b1, b0, zeta*b3, zeta*b2) */ - vldrw.u32 l_b2, [in_B,#(-32+1*4)] /* (b2, b1, b0, zeta*b3) */ - - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b0 - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b1 - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b2 - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b3 - - //strd res0_lo, res1_lo, [dst], #8 - //strd res2_lo, res3_lo, [dst], #8 - //strd res0_hi, res1_hi, [dst], #8 - //strd res2_hi, res3_hi, [dst], #8 - - strd res0_lo, res0_hi, [dst], #8 - strd res1_lo, res1_hi, [dst], #8 - strd res2_lo, res2_hi, [dst], #8 - strd res3_lo, res3_hi, [dst], #8 - - le loop_cnt, 1b -2: - - vpop {d0-d9} - pop {r4-r11,lr} - - bx lr - - .unreq l_a - .unreq l_b3 - .unreq l_b2 - .unreq l_b1 - .unreq l_b0 - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq res0_lo - .unreq res0_hi - .unreq res1_lo - .unreq res1_hi - .unreq res2_lo - .unreq res2_hi - .unreq res3_lo - .unreq res3_hi - -.type twisted_cyclic_mul_deg4_u32_mve, %function -.global twisted_cyclic_mul_deg4_u32_mve -twisted_cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - /* Preparation -- amortizes when looping */ - - mod_q .req r12 - mod_q_inv .req r14 - mod_q_vect .req q4 /* vmulh requires vector operand */ - - ldrd mod_q, mod_q_inv, [r2] - vdup.s32 mod_q_vect, mod_q - .unreq mod_q - - tw1 .req r10 - tw2 .req r11 - tw3 .req r12 - - l_a .req q0 - l_b .req q1 - - res_lo .req q2 - res_hi .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* Input A */ - vldrw.u32 l_b, [in_B], #+16 - vmov tw1, tw3, l_b[3], l_b[1] - vldrw.u32 l_a, [in_A], #+16 - - /* Assume b-input is already reversed */ - - /* Extract second half of twisted b into GPRs */ - - vmov.s32 tw2, l_b[2] - - res3_lo .req r4 - res3_hi .req r5 - res2_lo .req r6 - res2_hi .req r7 - - /* TODO: - * For twisted multiplication, add Montgomery multiplication here. - * Adds 3 instructions. */ - - /* (a0,a1,a2,a3) * (b3,b2,b1,b0) = c3 */ - vmlaldav.s32 res3_lo, res3_hi, l_a, l_b - - /* Shift zeta*b3 into b vector, giving (b2,b1,b0,zeta*b3) */ - vshlc l_b, tw3, #32 - .unreq tw3 - - /* (a0,a1,a2,a3) * (b2,b1,b0,zeta*b3) = c2 */ - vmlaldav.s32 res2_lo, res2_hi, l_a, l_b - - /* Shift zeta*b2 into b vector, giving (b1,b0,zeta*b3, zeta*b2) */ - vshlc l_b, tw2, #32 - .unreq tw2 - - res1_lo .req r8 - res1_hi .req r9 - - /* (a0,a1,a2,a3) * (b1,b0,zeta*b3,zeta*b2) */ - vmlaldav.s32 res1_lo, res1_hi, l_a, l_b - - /* Move low and high results into result vector */ - vmov res_lo[3], res_lo[1], res1_lo, res3_lo - vmov res_hi[3], res_hi[1], res1_hi, res3_hi - - .unreq res3_lo - .unreq res3_hi - .unreq res1_lo - .unreq res1_hi - - res0_lo .req r8 - res0_hi .req r9 - - /* Shift zeta*b1 into b vector, giving (b0,zeta*b3,zeta*b2,zeta*b1) */ - vshlc l_b, tw1, #32 - .unreq tw1 - - /* (a0,a1,a2,a3) * (b0,zeta*b3,zeta*b2,zeta*b1) = c0 */ - vmlaldav.s32 res0_lo, res0_hi, l_a, l_b - - /* PRELOAD FOR NEXT ITERATION? */ - - /* Move low results into result vector */ - vmov res_lo[2], res_lo[0], res2_lo, res0_lo - - /* Montgomery 1 */ - vmul.u32 res_lo, res_lo, mod_q_inv - /* Move high results into result vector */ - vmov res_hi[2], res_hi[0], res2_hi, res0_hi - /* Montgomery 2 */ - vmulh.s32 res_lo, res_lo, mod_q_vect - /* Montgomery 3 */ - vsub.s32 res_hi, res_hi, res_lo - - /* Store results */ - vstrw.s32 res_hi, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - - .unreq in_A - .unreq in_B - .unreq dst - - .unreq mod_q_inv - .unreq mod_q_vect - -.type cyclic_mul_deg4_u32_mve, %function -.global cyclic_mul_deg4_u32_mve -cyclic_mul_deg4_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #0x0F0F - vmsr p0, r10 - - mod_q .req r10 - mod_q_inv .req r9 - - ldr mod_q, [r2,#0] /* Modulus */ - ldr mod_q_inv, [r2,#4] - - l_a0 .req q1 - l_a1 .req q2 - l_b0 .req q3 - l_b1 .req q4 - - r_a0 .req q0 - r_a1 .req q1 - r_b0 .req q2 - r_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - /* q1 = ((a0,a2),(a4,a6)), q2=((a1,a3),(a5,a7)) */ - vld20.u32 {l_a0,l_a1}, [in_A] - vld21.u32 {l_a0,l_a1}, [in_A]! - - /* q3 = ((b0,b2),(b4,b6)), q4=((b1,b3),(b5,b7)) */ - vld20.u32 {l_b0,l_b1}, [in_B] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Compute product in two vectors q4, q5 */ - - /* Can use q6, q7 for temporary data; need at least - * one temporary vector per subproduct. */ - - /* - * Ballpark estimates: - * - 4 = 2x2 VLD2x to load current polynomials - * - 2 = 2x VST2x to store result - * - 4 = 4x VCADD to get q0-q3 into (+1,-1)-evaluated form - * - 16 = 4x4 Vector Multiplications, 4 per subproduct - * - 4 = 4x1 VHSUB for hi-part correction in Montgomery reduction - * In fact, use VSUB for first time each target vector is - * used, and VHSUB for the second time. - * - 2 = 2x VCADD for interpolation of result -- - * Note that we don't need to do this in every - * subproduct. - * - * Total: 32 instructions - * - * Pretty promising... if it pipelines well and we have enough - * vector registers. - */ - - /* Transform input into evaluated form */ - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 - - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 - - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 - - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 - - /* Subproduct 1: a0*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1 - * - Temporary allocations: 1 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - /* - * OPTIMIZATION: - * - * - We have two free vectors at this point -- - * could use this for a late store of the results - * of a previous iteration, residing in {q6, q7}. - * - * - Perform a late evaluation of r_a0, r_b1 here. - * - */ - - dst1 .req q5 - tmp .req q4 - - vmul.u32 tmp, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - vqdmulh.s32 dst1, r_a0, r_b1 /* Initialize dst1 with high part */ - vmul.u32 tmp, tmp, r_b1 /* Twisted low product */ - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq tmp - - /* Subproduct 2: a1*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1 - * - Temporary allocations: 2 - * - Final allocations: a0, a1, b0, b1, dst1 - */ - - tmp0 .req q6 - tmp1 .req q4 - - vqdmulh.s32 tmp1, r_a1, r_b0 /* Write high-product into temporary */ - vmul.u32 tmp0, q1, mod_q_inv /* Twist one factor using temporary tmp */ - vmul.u32 tmp0, tmp0, r_b0 /* Twisted low product */ - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst1, tmp1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 - .unreq tmp1 - - /* Finalize dst1 */ - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 - - /* Subproduct 3: a1*b1 */ - - /* - * Vector register allocation state: - * - Initially: a0, a1, b0, b1, dst1_final - * - Temporary allocations: 0 - * - Final allocations: a0, b0, dst1_final, dst0 - */ - - dst0 .req q4 - - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - vmul.u32 r_a1, r_a1, mod_q_inv /* Can overwrite a1 now */ - vmul.u32 r_a1, r_a1, r_b1 /* Twisted low product */ - - .unreq r_b1 - - vqdmulh.s32 r_a1, r_a1, mod_q /* High product */ - vsub.s32 dst0, r_a1, dst0 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - - .unreq r_a1 - - vpst - vnegt.s32 dst0, dst0 - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initially: a0, b0, dst1_final, dst0 - * - Temporary allocations: 1 - * - Final allocations: dst1_final, dst0 - */ - - tmp .req q5 - - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - vmul.u32 r_a0, r_a0, r_b0 /* Twisted low product */ - - .unreq r_b0 - - vmul.u32 r_a0, r_a0, mod_q_inv /* Can overwrite a0 now */ - vqdmlah.s32 dst0, r_a0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - vhsub.s32 dst0, tmp, dst0 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp - - /* Finalize dst0 */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - .unreq dst0_final - .unreq dst1_final - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq r_a0 - -.type cyclic_mul_deg4_u32_alt_mve, %function -.global cyclic_mul_deg4_u32_alt_mve -cyclic_mul_deg4_u32_alt_mve: - push {r4-r12,lr} - vpush {d0-d15} - - l_a0 .req q0 - l_a1 .req q1 - l_b0 .req q2 - l_b1 .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - cnt .req r4 - - dst0_last_final .req q6 - dst1_last_final .req q7 - - mod_q .req r10 - mod_q_inv .req r9 - pred_helper .req r8 - - vld20.u32 {l_a0,l_a1}, [in_A] - mov pred_helper, #0x0F0F - vld21.u32 {l_a0,l_a1}, [in_A]! - vmsr p0, pred_helper - - vld20.u32 {l_b0,l_b1}, [in_B] - ldr mod_q_inv, [r2,#4] - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), - * dst0 (q3) - */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - ldr mod_q, [r2,#0] - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Montgomery twist */ - mov cnt, #((VECTOR_LENGTH)/8-1) /* Interleave initialization of - * loop counter */ - vqdmulh.s32 tmp1, tmp1, mod_q /* Montgomery high product fix */ - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initial high product */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initial high product */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Fix high product */ - /* Defer halving for later */ - /* Store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Montgomery low product twist */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* Montgomery high product fix */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Montgomery twist */ - - /* Subproduct 3: a0*b1 */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* - * Vector register allocation state: - * - Initial allocations: r_b1 (q4), r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * - Temporary allocations: 1 (q5) - * - Final allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - vmul.u32 tmp0, tmp0, r_b1 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmlah.s32 dst1, tmp0, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q5 - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vqdmulh.s32 tmp, r_a0, r_b0 /* Write high-product into temporary */ - - /* LOAD r_a1 into q5 here..., - * freeing up q1 as a temporary */ - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - /* Use q1 for the result here, freeing both r_a0 and r_b0=q7 */ - vmul.u32 tmp0, r_a0, r_b0 /* Twisted low product */ - /* Can overwrite rb0 now */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - - vmul.u32 tmp0, tmp0, mod_q_inv - - l_b0 .req q2 - l_b1 .req q3 - /* Preload for next iteration */ - vld20.u32 {l_b0,l_b1}, [in_B] - - vqdmlah.s32 dst0, tmp0, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp0 // q1 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp // q4 - .unreq dst0_old - - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - nop - wls lr, cnt, cyclic_mul_deg4_u32_alt_mve_loop_end - -cyclic_mul_deg4_u32_alt_mve_loop_start: - - nop - - /* Subproduct 1: a1*b1 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2) - * T: tmp (q1) - * F: r_a1 (q5), r_b1 (q4), l_a0 (q0), l_b0 (q2), dst0 (q3) - */ - - tmp .req q1 - vmul.u32 tmp, r_a1, mod_q_inv - - r_b1 .req q4 - vcadd.i32 r_b1, l_b1, l_b1, #90 - .unreq l_b1 // q3 - - tmp1 .req q3 - - vmul.u32 tmp1, tmp, r_b1 /* Twisted low product */ - - vst20.s32 {dst0_last_final,dst1_last_final}, [dst] - - vqdmulh.s32 tmp1, tmp1, mod_q /* High product */ - - vst21.s32 {dst0_last_final,dst1_last_final}, [dst]! - .unreq dst0_last_final // q6 - .unreq dst1_last_final // q7 - - dst0 .req q6 - vqdmulh.s32 dst0, r_a1, r_b1 /* Initialize dst0 with high part */ - - r_b0 .req q7 - vcadd.i32 r_b0, l_b0, l_b0, #90 - .unreq l_b0 // q2 - - /* Subproduct 2: a1*b0 - * - * I: r_a1 (q5), r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q3) - * T: 1 (q5) - * F: r_b1 (q4), l_a0 (q0), r_b0 (q7), dst0 (q6), dst1 (q2) - */ - - dst1 .req q2 - vqdmulh.s32 dst1, r_a1, r_b0 /* Initialize dst1 with high part */ - .unreq r_a1 // q5 - - dst0_old .req q6 - .unreq dst0 - dst0 .req q6 - - vsub.s32 dst0, tmp1, dst0_old /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp1 - .unreq dst0_old // q6 - - vmul.u32 tmp, tmp, r_b0 /* Twisted low product */ - - vpst - vnegt.s32 dst0, dst0 - - vqdmulh.s32 tmp, tmp, mod_q /* High product */ - - r_a0 .req q3 - vcadd.i32 r_a0, l_a0, l_a0, #90 - .unreq l_a0 // q0 - - tmp0 .req q5 - vmul.u32 tmp0, r_a0, mod_q_inv /* Twist one factor using temporary tmp */ - - vsub.s32 dst1, tmp, dst1 /* Correct high product */ - /* Defer halving for later */ - /* Actually store _negative_ of result */ - .unreq tmp // q1 - - /* Subproduct 3: a0*b1 - * - * I: r_b1 (q4), r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) - * T: 1 (q5) - * F: r_a0 (q3), r_b0 (q7), dst0 (q6), dst1 (q2) pre_l_a0 (q0), pre_l_a1 (q1) - */ - - tmp1 .req q0 - vmul.u32 tmp1, tmp0, r_b1 - - - vqdmlah.s32 dst1, tmp1, mod_q /* High product, accumulate onto dst1, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp1 // q0 - - l_a0 .req q0 - l_a1 .req q1 - /* Preload for next iteration */ - vld20.u32 {l_a0,l_a1}, [in_A] - - vqdmulh.s32 r_b1, r_a0, r_b1 /* Can overwrite r_b1 here */ - - /* Preload for next iteration */ - vld21.u32 {l_a0,l_a1}, [in_A]! - - vhsub.s32 dst1, r_b1, dst1 /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, dst1 contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq r_b1 // q4 - - /* Finalize dst1 - * - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1 (q2) - * preloaded l_a0 (q0), l_a1 (q1) - * - Final allocations: r_a0 (q5), r_b0 (q7), - * dst0 (q3), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - /* Subproduct 4: a0*b0 */ - - /* - * Vector register allocation state: - * - Initial allocations: r_a0 (q3), r_b0 (q7), - * dst0 (q6), dst1_final (q7) - * preloaded l_a0 (q0), l_a1 (q1) - * - Temporary allocations: 1 (q4) - * - Final allocations: dst1_final (q7) , dst0 (q4) - * preloaded l_a0 (q0), l_a1 (q1) - */ - - tmp .req q4 - vmul.u32 tmp, tmp0, r_b0 /* Twisted low product */ - .unreq tmp0 - - r_a1 .req q5 - vcadd.i32 r_a1, l_a1, l_a1, #90 - .unreq l_a1 // q1 - - tmp0 .req q1 - vqdmulh.s32 tmp0, r_a0, r_b0 /* Write high-product into temporary */ - .unreq r_a0 // q3 - .unreq r_b0 // q7 - - dst1_final .req q7 - vcadd.s32 dst1_final, dst1, dst1, #270 - .unreq dst1 // q2 - - vqdmlah.s32 dst0, tmp, mod_q /* High product, accumulate onto tmp, - * which stores the _negative_ of the - * subproduct 1. */ - .unreq tmp // q4 - - /* Preload for next iteration */ - l_b0 .req q2 - l_b1 .req q3 - vld20.u32 {l_b0,l_b1}, [in_B] - - dst0_old .req q6 - .unreq dst0 - dst0 .req q1 - vhsub.s32 dst0, tmp0, dst0_old /* Correct high product */ - /* Late halving, encompassing also the - * first subproduct. */ - /* Note that, so far, tmp contained - * -pre + high_correct. - * After this step, it's - * high - ( -pre + high_correct ) - * = pre + high - high_correct, - * which is what we want. */ - - .unreq tmp0 // q1 - .unreq dst0_old - - /* Preload for next iteration */ - vld21.u32 {l_b0,l_b1}, [in_B]! - - /* Finalize dst0 - * - * - Initial allocations: dst1_final (q7) , dst0 (q5) - * - Final allocations: dst0_final (q6), dst1_final (q7) - */ - dst0_final .req q6 - vcadd.s32 dst0_final, dst0, dst0, #270 - .unreq dst0 // q1 - - le lr, cyclic_mul_deg4_u32_alt_mve_loop_start - -cyclic_mul_deg4_u32_alt_mve_loop_end: - - /* Store results */ - vst20.s32 {dst0_final, dst1_final}, [dst] - vst21.s32 {dst0_final, dst1_final}, [dst]! - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a0 - .unreq l_b0 - .unreq l_b1 - .unreq r_a1 - - .unreq cnt - -.type montgomery_pt_u32_odd_mve, %function -.global montgomery_pt_u32_odd_mve -montgomery_pt_u32_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 4) - - ldr mod_q, [in_B], #+4 /* Modulus */ - ldr mod_q_inv, [in_B], #+4 /* Inverse */ - - wls lr, cnt, montgomery_pt_u32_odd_mve_loop_end - -montgomery_pt_u32_odd_mve_loop_start: - - vldrw.s32 l_a, [in_A], #+16 - vmul.u32 l_at, l_a, mod_q_inv - vldrw.s32 l_b, [in_B], #+16 - vqrdmulh.s32 tmp0, l_a, l_b - vmul.u32 tmp1, l_at, l_b - vqrdmlah.s32 tmp0, tmp1, mod_q - vstrw.s32 tmp0, [dst], #+16 - - le lr, montgomery_pt_u32_odd_mve_loop_start - -montgomery_pt_u32_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_u32_mve, %function -.global montgomery_pt_u32_mve -.align 4 -montgomery_pt_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_u32_mve_loop_end - -montgomery_pt_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_u32_mve_loop_start - -montgomery_pt_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_acc_u32_mve, %function -.global montgomery_pt_acc_u32_mve -.align 4 -montgomery_pt_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - old .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - res .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Correction term */ - vqdmulh.s32 tmp1, tmp1, mod_q - - wls lr, cnt, montgomery_pt_acc_u32_mve_loop_end - -montgomery_pt_acc_u32_mve_loop_start: - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Preload l_a for the next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Preload b */ - vldrw.s32 l_b, [in_B], #+16 - - /* Compute correction */ - vqdmulh.s32 tmp1, tmp1, mod_q - - /* Late store-accumulate from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - le lr, montgomery_pt_acc_u32_mve_loop_start - -montgomery_pt_acc_u32_mve_loop_end: - - /* - * Last iteration - */ - - /* Twisted low multiply */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction term from last iteration */ - vhsub.s32 res, tmp0, tmp1 - - /* High multiply */ - vqdmulh.s32 tmp0, l_a, l_b - - /* Late store from last iteration */ - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - /* Can't do anything about the following sequence - * which doesn't pipeline well - but it's only one iteration. */ - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - vqdmulh.s32 tmp1, tmp1, mod_q - vhsub.s32 res, tmp0, tmp1 - vldrw.s32 old, [dst] - vadd.s32 res, res, old - vstrw.s32 res, [dst], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.text -.type montgomery_pt_round_acc_u32_mve, %function -.global montgomery_pt_round_acc_u32_mve -.align 4 -montgomery_pt_round_acc_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - oldA .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - params .req r3 - - tmp0 .req q4 - tmp1 .req q5 - oldB .req q7 - - l_at .req q6 - - tmp_params .req r8 - mov tmp_params, params - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r8 - mov cnt, #((VECTOR_LENGTH / 8) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - wls lr, cnt, montgomery_pt_round_acc_u32_mve_loop_end - -montgomery_pt_round_acc_u32_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - le lr, montgomery_pt_round_acc_u32_mve_loop_start - -montgomery_pt_round_acc_u32_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst, #+16] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - vstrw.s32 oldB, [dst] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A - .unreq in_B - .unreq dst - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x2_mve, %function -.global montgomery_pt_round_acc_u32_x2_mve -.align 4 -montgomery_pt_round_acc_u32_x2_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - dst0 .req r2 - dst1 .req r3 - - in_B .req r4 - ldr in_b, [sp, #(10*4 + 8*16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4 + 8*16 + 4)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x2_mve_loop_end - -montgomery_pt_round_acc_u32_x2_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - le cnt, montgomery_pt_round_acc_u32_x2_mve_loop_start - -montgomery_pt_round_acc_u32_x2_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - - /* Correction */ - vadd.s32 oldA, tmp0, oldA - - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - -.text -.type montgomery_pt_round_acc_u32_x4_mve, %function -.global montgomery_pt_round_acc_u32_x4_mve -.align 4 -montgomery_pt_round_acc_u32_x4_mve: - push {r4-r12,lr} // Amount of data: 40 Bytes - vpush {d0-d15} // Amount of data: 128 bytes - // Total: 168 Bytes - - mod_q .req r10 - mod_q_inv .req r9 - - /* q0 still unused */ - l_a .req q1 - l_b .req q2 - tmp0 .req q3 - tmp1 .req q4 - l_at .req q5 - oldA .req q6 - oldB .req q7 - - in_A0 .req r0 - in_A1 .req r1 - in_A2 .req r2 - in_A3 .req r3 - dst0 .req r4 - dst1 .req r5 - dst2 .req r6 - dst3 .req r7 - - in_B .req r12 - - /* Load arguments from stack */ - ldrd dst0, dst1, [sp, #(10*4+8*16+0 )] - ldrd dst2, dst3, [sp, #(10*4+8*16+8 )] - ldr in_b, [sp, #(10*4+8*16+16)] - - tmp_params .req r8 - ldr tmp_params, [sp, #(10*4+8*16+20)] - ldrd mod_q, mod_q_inv, [tmp_params] - .unreq tmp_params - - cnt .req r14 - mov cnt, #((VECTOR_LENGTH / 4) - 2) - - /* - * First iteration - */ - - /* Load a-input */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - wls cnt, cnt, montgomery_pt_round_acc_u32_x4_mve_loop_end - -montgomery_pt_round_acc_u32_x4_mve_loop_start: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - - le cnt, montgomery_pt_round_acc_u32_x4_mve_loop_start - -montgomery_pt_round_acc_u32_x4_mve_loop_end: - - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Load b-input */ - vldrw.s32 l_b, [in_B], #+16 - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A1], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst0] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A2], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst1] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst0], #+16 - /* Twist a (already loaded)*/ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A3], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldA, [dst2] - /* Correction term */ - vqrdmlah.s32 oldA, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldB, [dst1], #+16 - /* Twist a */ - vmul.u32 l_at, l_a, mod_q_inv - /* Correction */ - vadd.s32 oldA, tmp0, oldA - /* High multiply */ - vqrdmulh.s32 tmp0, l_a, l_b - /* Preload a for next iteration */ - vldrw.s32 l_a, [in_A0], #+16 - /* Twisted low multiply */ - vmul.u32 tmp1, l_at, l_b - /* Load old value to accumulate onto */ - vldrw.s32 oldB, [dst3] - /* Correction term */ - vqrdmlah.s32 oldB, tmp1, mod_q - /* Store old result */ - vstrw.s32 oldA, [dst2], #+16 - /* Correction from last iteration */ - vadd.s32 oldB, tmp0, oldB - /* Store old result */ - vstrw.s32 oldB, [dst3], #+16 - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - .unreq l_a - .unreq l_b - .unreq oldA - .unreq in_A0 - .unreq in_A1 - .unreq in_A2 - .unreq in_A3 - .unreq in_B - .unreq dst0 - .unreq dst1 - .unreq dst2 - .unreq dst3 - .unreq tmp0 - .unreq tmp1 - .unreq oldB - .unreq l_at - .unreq cnt - - -.type montgomery_pt_u16_odd_mve, %function -.global montgomery_pt_u16_odd_mve -montgomery_pt_u16_odd_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mod_q .req r10 - mod_q_inv .req r9 - - l_a .req q1 - l_b .req q2 - l_d .req q3 - - in_A .req r0 - in_B .req r1 - dst .req r2 - - tmp0 .req q4 - tmp1 .req q5 - - l_at .req q6 - - cnt .req r8 - mov cnt, #(VECTOR_LENGTH / 8) - - ldrh mod_q, [in_B], #+2 /* Modulus */ - ldrh mod_q_inv, [in_B], #+2 /* Inverse */ - - wls lr, cnt, montgomery_pt_u16_odd_mve_loop_end - -montgomery_pt_u16_odd_mve_loop_start: - - vldrh.s16 l_a, [in_A], #+16 - vmul.u16 l_at, l_a, mod_q_inv - vldrh.s16 l_b, [in_B], #+16 - vqrdmulh.s16 tmp0, l_a, l_b - vmul.u16 tmp1, l_at, l_b - vqrdmlah.s16 tmp0, tmp1, mod_q - vstrh.s16 tmp0, [dst], #+16 - - le lr, montgomery_pt_u16_odd_mve_loop_start - -montgomery_pt_u16_odd_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type montgomery_u16_core_mve, %function -.global montgomery_u16_core_mve -montgomery_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(-MODULUS_Q16) /* Modulus */ - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0] - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - /* High product */ - vqdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqdmulh.s16 q0, q0, r10 - vsub.s16 q1, q1, q0 - - /* Store result */ - vstrh.s16 q1, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - -.type montgomery_u16_round_mve, %function -.global montgomery_u16_round_mve -montgomery_u16_round_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - mov r10, #(-3329) /* Modulus */ - mov r8, #8 /* Iterations */ - - /* Half of the even scalar to multiply with */ - ldrh r4, [r1,#0] - /* Precomputed product of scalar and Montgomery constant */ - ldrh r5, [r1,#2] - - wls lr, r8, montgomery_u16_round_mve_loop_end -montgomery_u16_round_mve_loop_start: - - /* Vector of uint16 values to be multiplied */ - vldrh.s16 q0, [r0], #16 - - /* High product */ - vqrdmulh.s16 q1, q0, r4 - /* Adjusted low product */ - vmul.u16 q0, q0, r5 - - /* Double-Multiply with modulus */ - vqrdmlah.s16 q1, q0, r10 - - /* Store result */ - vstrh.s16 q1, [r2], #16 - - le lr, montgomery_u16_round_mve_loop_start -montgomery_u16_round_mve_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - - bx lr - - -.type cyclic_mul_u16_core_mve, %function -.global cyclic_mul_u16_core_mve -cyclic_mul_u16_core_mve: - push {r4-r12,lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Load polynomials to multiply - * - * Lanes come in pairs representing real and imaginary parts. - */ - vldrh.s16 q0, [r0] - vldrh.s16 q1, [r1] - - /* Step 1: - * - * Apply evaluation at -1, +1: - * k[X]/(X^2 - 1) -> k[X]/(X+1) x k[X]/(X-1) - * - * Concretely: - * (a,b) |-> (a-b, a+b) - * - * This can be implemented as a rotate-and-add - * operation, treating (a,b) as a complex number - * a+bi, and noticing that a rotation by 90 - * gives i(a+bi) = -b + ai, so - * a+bi + i(a+bi) = (a-b) + (a+b)i - * - * This rotate-90-and-add can is a single - * instruction in MVE. - */ - vcadd.i16 q0, q0, q0, #90 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - - /* Montgomery multiplications - * - * 1x mul-high - * 1x mul-low - * 1x mul-high - * 1x subtract - * - * Needs 1x free temporary vector register - */ - vqdmulh.s16 q0, q0, q1 - vmul.u16 q1, q2, q1 - /*vmul.u16 q0, q0, r9*/ - vqdmulh.s16 q1, q1, r10 - /* Now we've actually computed twice the desired result, - * but we can compensate by using vhsub */ - vhsub.s16 q0, q0, q1 - - /* - * Finally, interpolation step: - * (eval(-1)=x,eval(+1)=y) |-> 1/2 (y-x) + 1/2 (x+y) - * - * This can be done as a single VCHADD, with - * rotate by 270: -i(a+bi) = b - ai - * - * We can't naively use vhcadd here because the - * multiplication by 1/2 is modulo q. - */ - vcadd.s16 q0, q0, q0, #270 - - vstrh.s16 q0, [r2] - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - -.type cyclic_mul_u16_mve, %function -.global cyclic_mul_u16_mve -cyclic_mul_u16_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - /* Number of inner iterations */ - mov r4, #(VECTOR_LENGTH/16 - 1) - - vldrh.s16 q0, [r0], #16 - vcadd.i16 q0, q0, q0, #90 - vldrh.s16 q1, [r1], #16 - vmul.u16 q2, q0, r9 - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vstrh.s16 q4, [r2] - vmul.u16 q1, q2, q1 - vldrh.s16 q3, [r0], #16 - vqdmulh.s16 q1, q1, r10 - vcadd.i16 q3, q3, q3, #90 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - wls lr, r4, cyclic_mul_u16_loop_end -cyclic_mul_u16_loop_start: - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vcadd.i16 q1, q1, q1, #90 - vqdmulh.s16 q0, q0, q1 - vldrh.s16 q3, [r0], #16 - vmul.u16 q1, q2, q1 - vcadd.i16 q3, q3, q3, #90 - vqdmulh.s16 q1, q1, r10 - vldrh.s16 q4, [r1], #16 - vhsub.s16 q0, q0, q1 - vmul.u16 q5, q3, r9 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - le lr, cyclic_mul_u16_loop_start -cyclic_mul_u16_loop_end: - - vcadd.i16 q4, q4, q4, #90 - vqdmulh.s16 q3, q3, q4 - vldrh.s16 q0, [r0], #16 - vmul.u16 q4, q5, q4 - vcadd.i16 q0, q0, q0, #90 - vqdmulh.s16 q4, q4, r10 - vldrh.s16 q1, [r1], #16 - vhsub.s16 q3, q3, q4 - vmul.u16 q2, q0, r9 - vcadd.s16 q4, q3, q3, #270 - vstrh.s16 q4, [r2], #16 - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr - - -.type cyclic_mul_u16_multi_naive_mve, %function -.global cyclic_mul_u16_multi_naive_mve -cyclic_mul_u16_multi_naive_mve: - push {r4-r12, lr} - vpush {d0-d15} - - mov r10, #(MODULUS_Q16) - movw r9, #:lower16:MODULUS_Q16_INV_U16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vldrh.s16 q0, [r0], #16 - vldrh.s16 q1, [r1], #16 - vcadd.i16 q2, q0, q0, #90 - vmul.u16 q3, q2, r9 - vcadd.i16 q4, q1, q1, #90 - vqdmulh.s16 q0, q2, q4 - vmul.u16 q1, q3, q4 - vqdmulh.s16 q1, q1, r10 - vhsub.s16 q0, q0, q1 - vcadd.s16 q1, q0, q0, #270 - vstrh.s16 q1, [r2], #16 - - vpop {d0-d15} - pop {r4-r12, lr} - bx lr - -.type cyclic_mul_u32_mve, %function -.global cyclic_mul_u32_mve -cyclic_mul_u32_mve: - push {r4-r12,lr} - vpush {d0-d15} - - movw r10, #:lower16:MODULUS_Q32 - movt r10, #:upper16:MODULUS_Q32 - - ldr r9, [r2] - mov r3, #(VECTOR_LENGTH / 4) /* Number of iterations */ - wls lr, r3, cyclic_mul_u32_loop_end -cyclic_mul_u32_loop_start: - vldrw.s32 q1, [r0], #16 - vcadd.i32 q0, q1, q1, #90 - vldrw.s32 q2, [r1], #16 - vcadd.i32 q1, q2, q2, #90 - vqdmulh.s32 q2, q0, q1 - vmul.u32 q0, q0, r9 - vmul.u32 q1, q0, q1 - vqdmulh.s32 q1, q1, r10 - vhsub.s32 q2, q2, q1 - vcadd.s32 q1, q2, q2, #270 - vstrw.s32 q1, [r2], #16 - le lr, cyclic_mul_u32_loop_start -cyclic_mul_u32_loop_end: - - vpop {d0-d15} - pop {r4-r12,lr} - bx lr diff --git a/tests/poly/poly.mk b/tests/poly/poly.mk index a89bf56..eca65ac 100644 --- a/tests/poly/poly.mk +++ b/tests/poly/poly.mk @@ -12,7 +12,7 @@ POLY_SOURCES += main.c # Assembly sources required for this test POLY_ASM_DIR = ./auto -POLY_ASMS += montgomery.s +POLY_ASMS += ../../asm/manual/montgomery/montgomery.s POLY_ASMS += ../../asm/manual/karatsuba/karatsuba.s POLY_ASMS += ../../asm/manual/schoolbook/poly_u16_32.s POLY_ASMS += ../../asm/manual/schoolbook/poly_u16_32_acc.s From b1ec15b03f56b0dae08afffb1b6214bf1903f277 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 19 Jul 2024 17:06:37 +0800 Subject: [PATCH 28/32] fix m85 --- .gitignore | 1 + envs/m85-an555/Makefile | 9 ++++++--- envs/m85-an555/src/platform/gcc_arm_sse_310.ld | 17 ++++++++++++----- 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/.gitignore b/.gitignore index 954f187..caf5e08 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ **/build/ **/*.elf +**/*.bin diff --git a/envs/m85-an555/Makefile b/envs/m85-an555/Makefile index 4a1607d..4870360 100644 --- a/envs/m85-an555/Makefile +++ b/envs/m85-an555/Makefile @@ -1,6 +1,7 @@ # Makefile for images for AN555 - -CC = arm-none-eabi-gcc +PREFIX = arm-none-eabi +CC = $(PREFIX)-gcc +OBJCOPY = $(PREFIX)-objcopy LD := $(CC) SRC_DIR=./src @@ -72,6 +73,8 @@ $(OBJECTS_ASM): $(BUILD_DIR)/%.o: % $(TARGET): $(OBJECTS) $(LDSCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJECTS) + $(OBJCOPY) -S -Obinary $@ $@.bin + .PHONY: build build: $(TARGET) @@ -84,5 +87,5 @@ check: clean: - rm -f *.elf + rm -f *.elf *.bin rm -rf $(BUILD_DIR) \ No newline at end of file diff --git a/envs/m85-an555/src/platform/gcc_arm_sse_310.ld b/envs/m85-an555/src/platform/gcc_arm_sse_310.ld index e82fddb..a81a4ef 100644 --- a/envs/m85-an555/src/platform/gcc_arm_sse_310.ld +++ b/envs/m85-an555/src/platform/gcc_arm_sse_310.ld @@ -39,8 +39,15 @@ __ROM_SIZE = 0x00020000; RAM Size (in Bytes) <0x0-0xFFFFFFFF:8> -----------------------------------------------------------------------------*/ -__RAM_BASE = 0x31000000; -__RAM_SIZE = 0x00010000; +/* This is DTCM */ +__RAM_BASE = 0x30000000; +__RAM_SIZE = 0x00008000; + + + +/* This is internal SRAM which may be slower */ +//__RAM_BASE = 0x21000000; +//__RAM_SIZE = 0x00010000; /*--------------------- Stack / Heap Configuration ---------------------------- Stack / Heap Configuration @@ -48,8 +55,8 @@ __RAM_SIZE = 0x00010000; Heap Size (in Bytes) <0x0-0xFFFFFFFF:8> -----------------------------------------------------------------------------*/ -__STACK_SIZE = 0x00000400; -__HEAP_SIZE = 0x00000C00; +__STACK_SIZE = 0x00006000; +__HEAP_SIZE = 0x00000000; /* *-------------------- <<< end of configuration section >>> ------------------- @@ -296,7 +303,7 @@ SECTIONS __StackTop = .; } > RAM PROVIDE(__stack = __StackTop); - + /* ARMv8-M stack sealing: to use ARMv8-M stack sealing uncomment '.stackseal' section */ From e014bdc4e041d9ca985e5b40f9e71c6445b4c52b Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 19 Jul 2024 17:22:27 +0800 Subject: [PATCH 29/32] M85: switch over to SRAM, so we do not run into RAM limitations --- envs/m85-an555/src/platform/gcc_arm_sse_310.ld | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/envs/m85-an555/src/platform/gcc_arm_sse_310.ld b/envs/m85-an555/src/platform/gcc_arm_sse_310.ld index a81a4ef..4a01223 100644 --- a/envs/m85-an555/src/platform/gcc_arm_sse_310.ld +++ b/envs/m85-an555/src/platform/gcc_arm_sse_310.ld @@ -40,14 +40,14 @@ __ROM_SIZE = 0x00020000; -----------------------------------------------------------------------------*/ /* This is DTCM */ -__RAM_BASE = 0x30000000; -__RAM_SIZE = 0x00008000; +//__RAM_BASE = 0x30000000; +//__RAM_SIZE = 0x00008000; /* This is internal SRAM which may be slower */ -//__RAM_BASE = 0x21000000; -//__RAM_SIZE = 0x00010000; +__RAM_BASE = 0x21000000; +__RAM_SIZE = 0x00010000; /*--------------------- Stack / Heap Configuration ---------------------------- Stack / Heap Configuration From f06b3a060440a0802dcb9932434d74aa328cdc24 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Fri, 19 Jul 2024 17:27:00 +0800 Subject: [PATCH 30/32] remove currently unsupported platforms --- envs/core/Cortex-M55.axf | Bin 235876 -> 0 bytes envs/core/Makefile | 89 - envs/core/build/test_common/dummy | 0 envs/core/build/test_src/auto/dummy | 0 envs/core/build/test_src/external/dummy | 0 envs/core/build/test_src/manual/dummy | 0 envs/core/inc/hal_env.h | 6 - envs/core/inc/test_inc | 1 - envs/core/src/ARMCM55.h | 136 - envs/core/src/cachel1_armv7.h | 411 -- envs/core/src/cmsis_compiler.h | 283 -- envs/core/src/cmsis_gcc.h | 2211 --------- envs/core/src/cmsis_version.h | 39 - envs/core/src/core_cm55.h | 4289 ----------------- envs/core/src/hal.c | 52 - envs/core/src/mps3-an547.mk | 56 - envs/core/src/mps3.ld | 312 -- envs/core/src/mpu_armv8.h | 352 -- envs/core/src/pmu_armv8.h | 337 -- envs/core/src/semihosting.c | 78 - envs/core/src/startup_ARMCM55.c | 165 - envs/core/src/system_ARMCM55.c | 151 - envs/core/src/system_ARMCM55.h | 64 - envs/core/src/test_common | 1 - envs/core/src/test_src | 1 - envs/core/src/uart.c | 101 - envs/core/src/uart.h | 34 - envs/fvp-corstone300-mps2/.gitignore | 5 - envs/fvp-corstone300-mps2/Makefile | 180 - envs/fvp-corstone300-mps2/Makefile.rej | 18 - envs/fvp-corstone300-mps2/README.md | 100 - envs/fvp-corstone300-mps2/inc/hal_env.h | 6 - envs/fvp-corstone300-mps2/src/armclang.sct | 63 - envs/fvp-corstone300-mps2/src/gcc_arm.ld | 316 -- envs/fvp-corstone300-mps2/src/hal.c | 46 - envs/fvp-corstone300-mps2/src/scatter_tmp.sct | 60 - envs/fvp-corstone300-mps2/src/test_common | 1 - envs/fvp-corstone300-mps3/.gitignore | 4 - envs/fvp-corstone300-mps3/Makefile | 182 - envs/fvp-corstone300-mps3/README.md | 100 - envs/fvp-corstone300-mps3/inc/hal_env.h | 6 - envs/fvp-corstone300-mps3/inc/test_inc | 1 - envs/fvp-corstone300-mps3/src/armclang.sct | 63 - envs/fvp-corstone300-mps3/src/gcc_arm.ld | 316 -- envs/fvp-corstone300-mps3/src/hal.c | 46 - envs/fvp-corstone300-mps3/src/handlers.c | 41 - envs/fvp-corstone300-mps3/src/test_common | 1 - 47 files changed, 10724 deletions(-) delete mode 100755 envs/core/Cortex-M55.axf delete mode 100644 envs/core/Makefile delete mode 100644 envs/core/build/test_common/dummy delete mode 100644 envs/core/build/test_src/auto/dummy delete mode 100644 envs/core/build/test_src/external/dummy delete mode 100644 envs/core/build/test_src/manual/dummy delete mode 100644 envs/core/inc/hal_env.h delete mode 120000 envs/core/inc/test_inc delete mode 100644 envs/core/src/ARMCM55.h delete mode 100644 envs/core/src/cachel1_armv7.h delete mode 100644 envs/core/src/cmsis_compiler.h delete mode 100644 envs/core/src/cmsis_gcc.h delete mode 100644 envs/core/src/cmsis_version.h delete mode 100644 envs/core/src/core_cm55.h delete mode 100644 envs/core/src/hal.c delete mode 100644 envs/core/src/mps3-an547.mk delete mode 100644 envs/core/src/mps3.ld delete mode 100644 envs/core/src/mpu_armv8.h delete mode 100644 envs/core/src/pmu_armv8.h delete mode 100644 envs/core/src/semihosting.c delete mode 100644 envs/core/src/startup_ARMCM55.c delete mode 100644 envs/core/src/system_ARMCM55.c delete mode 100644 envs/core/src/system_ARMCM55.h delete mode 120000 envs/core/src/test_common delete mode 120000 envs/core/src/test_src delete mode 100644 envs/core/src/uart.c delete mode 100644 envs/core/src/uart.h delete mode 100644 envs/fvp-corstone300-mps2/.gitignore delete mode 100644 envs/fvp-corstone300-mps2/Makefile delete mode 100644 envs/fvp-corstone300-mps2/Makefile.rej delete mode 100644 envs/fvp-corstone300-mps2/README.md delete mode 100644 envs/fvp-corstone300-mps2/inc/hal_env.h delete mode 100644 envs/fvp-corstone300-mps2/src/armclang.sct delete mode 100644 envs/fvp-corstone300-mps2/src/gcc_arm.ld delete mode 100644 envs/fvp-corstone300-mps2/src/hal.c delete mode 100644 envs/fvp-corstone300-mps2/src/scatter_tmp.sct delete mode 120000 envs/fvp-corstone300-mps2/src/test_common delete mode 100644 envs/fvp-corstone300-mps3/.gitignore delete mode 100644 envs/fvp-corstone300-mps3/Makefile delete mode 100644 envs/fvp-corstone300-mps3/README.md delete mode 100644 envs/fvp-corstone300-mps3/inc/hal_env.h delete mode 120000 envs/fvp-corstone300-mps3/inc/test_inc delete mode 100644 envs/fvp-corstone300-mps3/src/armclang.sct delete mode 100644 envs/fvp-corstone300-mps3/src/gcc_arm.ld delete mode 100644 envs/fvp-corstone300-mps3/src/hal.c delete mode 100644 envs/fvp-corstone300-mps3/src/handlers.c delete mode 120000 envs/fvp-corstone300-mps3/src/test_common diff --git a/envs/core/Cortex-M55.axf b/envs/core/Cortex-M55.axf deleted file mode 100755 index 5023fbe5a357cae347211d849b20748825e11612..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 235876 zcmeFa3s_Xu-ao$f?70C041ziW;s9Py#73|pQ5lDEi(;B(r;EKgXx5-f({nM%tpmU+SSl#*fD#V#7Cj$s($|0g2}gx-BHgpd*VUh^tTh@2=!5Iw#UNDmT5+ZPlb zVMyT-B3^3rqnEt%@=^2sL=|}*{ph8lHyG_}k;FY z?<|t2t_aV1gA-g=1QNTF3+2KyBPRBURPeE$5VVrYwc~?3zd<`zPJ&5rCP|Q0YBE*) zpmsU`uAnD_iJZ@FmGgNXQh}N2RGt6Sddky?fkeN~M#%6IU$vYNeNrtyCF!zu<&-92 z=nQwn_9?tNr06o~S9DpPUUXSKqv*0G&v98@;kc|hs})Rz+D4OF*ZAOIoiJ%-pN2`p z9GA6di#Qu0aD@?#TqfFbI@{)#U(!`Xp-n58G0sfKP3UZ!+yni4G;#^u`)|0U?ME(a z(h24zm*sh({P=wvu8;+$rWsF8;q~Cl78BNxz53P0`K4OY(j;U~9V^ zJ;Qc+D6mc2N1XczbI0zge-~qN#rCAy*o*v5g z+_(3VcIDoS{1htZPqjbQeyitJ&na?>+{fO>vS-*chbs?PKCXUTU9GRyzb1c8zJgi7 zc<|4R&#ZgnS$FttHA5bw}hKkt^C)v^(Sux!2?Me1=cs znZ`5sCf%E~>iDYT>@VyuLu5l_%RS3I^StxC!Lnf4ojvdD8GJPOXg;6M8zKx5>)G|J ztWDN-kU7ZAY@6BkZv4CPimw%4?+M!zw(G#I17~GtWj}TN)DeWw7v3+t8POTh^Xul< zo&Wv(@1He));xoq!5)$yk`I*)m7VaM@Z1u6OKjFHS+`{O%1t)veV>4jeh~%fMd-+FR|d8dk&3BlCz#u9EM=KW|@e-{JD%@^=Qk zGw7K7n0&c+xmTmmC@NbkTi?0mom*aUzT(_6e9Lg%e%=0;6fY@`w;gXgE1VT7nM&s2 z+aA6x{b2gRa&Ni!SlF?!ypFt%YPOo)f=>rN(`%;J^dH!N;IRI~`kMxs2DNxwyzw3J z9nZ7Rv-h^&+rHz$W%Pj@l40ZH6Pbl`dRw5 zr?jWYYGgG_WJ_e3?V0UzrChnHbye#&-8S9qQ?pOqo_2fMvVF_;Wwm9seXsal5iN_B zJ*a$8dF0z8-?pc=r|yr~AF*uAvN3J)HhDBY3_kaG?(wu6+6@a67AE8j${EyF)mGJ; z>CL<%e?{I~=q)&84%wc-J%N!)kx4iMHO|nh0 zRD5>+yz}Rjb1CNrL=K2N*7I1;FnO5V(dua3-Lbo)QD_uY395w9FG9a~rS_HD2ZRTN zTa<+SI*O1V+6Cc>YC*_S5E40pkZ(H#pYkV)oxRV}1NIWJ8Pl1oRSVE3_1>wyWK}Y~U zD=>CD=9zs~5Y}RR=)9~V+tWL1a?gfI4O(Gx zrnYgix;N^2duPq>jk+GfWNVMc$qC^Nk9fnqvxkNI`X%`KJ&JyhiTxtbFQRdBrq4b` zr^9Z|$iEyd+Q$t}8=lhc*88vd*m8=t6&Fx{xV`lPqg>+J|gl39x^Ep zt1io7)2T1j<@H8;Z|@`5%)RQeI*Uyuz@~?M#ZQzKT@i@I!J>q6_Z@8p~&y9x9)xqaRO8Xo;?hc6z{y}y4=@Sm7dgE`fC@b&9M`CVOp{CeMAerfyXQI|LU8qxS* zZGtNWh!LR~Lm(%kNQ9(8#G`mg=92ikfFlM=h{ebnV5v=#o^eVktgZuIHn_l0af z1L>4+KX2~l$^0U%{-4V;WVj*<@(X|8FREeY=UCI$txhIHD@;7l7qSUCUk`seS=%@< z0ddFYMVI^cqkSNU0%ZEln8t~jRIX!$i5IcH)y0?lSM+UobWB8J<^f%cv2ALfEw&U_eiUF-*%;9(Q=NQ`|j-}-(HkHqL2 z&2*j+9b&mC?_;2A%#6Z(HpDBSJ1nNrc!18SYcyR1T|p<9GNDH^y6KRv=&}~`WMeJ5 z^}W5;5Xi>FFyESHL|`3##5&EbeHyaggiZ-)TMfMtVz8#Z4YOh*u;#wr*;`|TOsr`V zVvvaRh(=>!xYvqx8Npxn80gRpv<-tko{dJV+7oTjg6Tj{u}vekK_AaXG@3R=HO%@o z0(?e+CiJER;|@_>)HY6n-N<=SD@@uH?wxrEx|57G&WAp{rW3dW7&k`O$XTPjGee*! z72tE24K#tZI;tTDdt76qfeG93KLpSpfuRaVLasYaAwvTt_YUtHP8k0h21jMSFphFcjwx+Usoes@F z{EKVs48*@PphIIK8}F{AGNx<1I{^ALCQ>l{K=lE9G(cy5#h$H%enQ?Rg~0Ayg#Ijt zu6*7@n7A4EQp{IFug3HeGBK~bqE}<4mFh3{KnG$Th%g`hTm>n*SqxrusS``Yb?S z--HgXrdUH(GqLZ(pkq|u=TljRKE4Tk7HA*pSNb+Qg7sutA=`(vLZ%w~u@(BfzGuUu zuv5@;U;y2kA=(3B614|!LZ8ziBddEsSE0|4fk~U7cNLxb3_mn;4CX!n9~pu4E#c8{u1V$ zgFb3%OB!HD)X>u>sm^1)un*LhWlx4)pM^{ufL+1dGvOODVaH}|)xkegz1Km;Kxb=@ z#z_}@H$3(t{KkPWU<4f$D6h~{iuZh~@5o!h^P>&$o3OFjhmgxcHy)!h3k)X>)4?Xe zpJM-J$G`@xhfM@8x*3qw5ct@NX!N6c3z-IQ)`n=OvOFW_QUuFia*5?J|Gix-h{ukf~P|i zYpi#*A7i=))CT!+rgKu9>HYxU1nkct;H?Jk4ZsR}aCQMd$0$co*g#OK}LHbM^vF$k>ctu&;Ctln2;N*!l7Mz)KK# zc`*iafR~4=*GX~M8e);-{wa`wxb0z<>le%#=F1^#YE)gbDvF?LnkKw0J{uZ zKkWnfG|~1@ED#GEpt48f5o)_{W;ge>3PSFef)I|lKjjZWIF9_{8RQC6asTRhK^Tbq z<7Xrd3CI~{1`)Cvc|)&X1>r^H8Q&paNkjhdd{-Lw;0(`lHBelF)Byr681p zuLsc98#GFgH;o0|$_7EW3-la%%!&HndJ=L!_+w8C!WguFjP@$5FK)je>;nA@KVUxe zFYZT3EECIQ;j>fNDV%CO)!H|xZ_pmi9!*3?M8~rI%l3znFfv{?UN&1fTiNJo^t?~r zCv$tv?d2Tf9MgKN_1J-q10DC{b3}GTHc387KDcgh9n$CX2cr%~O&T(3$QJn)x!@5z z=kQU|&oJe%*N(q-eATg4$K*fBf67 ztCRmI|8Yb0hUz)fgEyAMk_1J9VsYEzw!sa98^$r? zm@ksQNPhdfx4+xWZf2LZFKs^~J0sID8YZfLRR5i4cAmK<`j+U)fs+H-cDDT_SU>OZzTsciHd8bYuFPiZ>N0-W0D_;ZnrQ?Li0kK6Ppu1uKT#oEH}$#_{{dq_B_Kr!*&er7`$-M!agPgyFE`;oey^XI>4vjT}5vjo6~=jBvvh8|?1CJ`+Vu^JZtxPw61%&!jp3o1Hm7 zV&hEY{FwJ{-II;WOu>Jm=7_pm}6L(>P+;(|qPTYkFjhQ=h;^mz= zaThl58~&WQYn&gW8+jUk&6%!W7U(x0{@=?3&1q>&j5xNzm(!AT_-e$ua`?W97a;=eadKPp2`vFSpi+xi!tPX)KPpkn{deb8GDk8c%l5twC4Jtuf~e z-}!xBOWwB5$b~)uL6)saM@GDwM%y$aMl^?vdeSh8JgGo#st)tcMyvX4U$1pO){Hr_ ze?{z9LSsVYaEQ}py@*_E6GqLaxfOEY$%xMjkrN|_q;WdUB_e1HzKO=GG}fjw)0rSI zd3Z)M26CJi=Cw|SEJIIl9?_niN99?Zw;>xblnMEcbv5SA$2^lU?08WSz%PxRT1X%lkzLd>@wbKx4h9&@bsBY}C=BWC(sCrnJF zagwg_F9EPWArze+h>;*ei4+kUCt*(188d+xa&hF4I3J3^^DXGbuQZpXYX*-rhC@7R zR8w7oj(tG$S>(bi5ldY}`&P_{^OuQ*bUwt!z(zL%F<=1Y5&ULRK7j>zh{kz`DUtGw z_8iJLjp;D|49XAad+{u-kj{^1RGATt4 zERC~I#FaRcqj6{^=1#}FxW=Yqp7d^~SQXdDZ~nN=ABXwlIe&cRkC*<_xb8oV=l;_; zPKw{8_)Uu6{{8sPwu@|I`J*0|JGDc{+4kCYl5N|XTWzFbFZW+=dG3Poz8W0+T&>x) zWINfm$nu<3naT2Ik0L9|>aglK89$_*RH$#?3p)J|64KI}=tFnf#NH}r1e=QysW}S% z=Qdxjy%%jp`F6|z9%P`Dk9QSpdr8GmjP=(&+gWFzbvp~hdGD}{wB};IFmChbESGDc zWesNS-81BpzeslYsUdUWyf&i0=>+l`>#y-!-VKKncSOuDYm z;BI}&I{R3z>(jeipY@%6{(4;>MYleyI{Oq}*QdGdD$lu{eO|w=&u`uOqv~_kL2d4Q*v>y#JCGbI$RHW z@!pHR2=CUs-tD5&?5%-J;G4~oPmHu&mFtU;SL31=7%F=aSv@C~TEerAxC;$o6Ej*C zuwT*=Bwaf(o;R@JmJpPKCuZGN)2CEf6sg=Cd9^(2ew3#0LW+)yhOQJr&l2zs zkZspX-R5T!WtASQEU~A^1x4I9uZ|;pg1b9yr=Paj<^Q7mw;6$b>FrJofGEQ9oBFi0?G%KXXnwG7p7F>Vr09Omsa@uku z7x?9jNu$z55*w>9V});>KJ{W>;@t&u{sxfv^ zxixChJ;sweJTHW2c%Hub3hJsnFZiU_@0qTlv_zy-2ug+b7)y6}R)Eryo396yR{V2H zx!GFElNFS#_ZS!FglFVt-+XQL(7Wz*l4`fJS9zMk6B#b{+gf>iBq$!KB}_T}8E>nT z$@A-2#WA#)>g4K-Iz<3_E%#W}B7*8`5d5twt#g*W7q_zu#`C+1x2v};%DyAJ%>o3o z&9JSkQ_U>K0S&bhI?e+t0KQRF9DYwe)?5na@m^Gv*h-=GfwW=}&S5UhL_Sci$mXii+{Es74NBTq7c-Wb0 z_>Sv&p47>+CucelIW6>Y?Z5N!?Z5MJ?!WW#?!WVK@4xf$@4xe{z<>9z*6@L+`x>_1 zWDS1@?mk;ht@L=%$7Uz`1?td(DPOUt@mts%f>wImy zwj1Yn`#4wT=A05ahfI#6*1hX)c`l@;>uxi+>AbUFCb{q@SMTQLLUw#T(Ka3aH{~SO z5N@W@o(uWux{E(p8-h#2thyXH9nYoauDYzuJye6Uuw{2(|;5!a)JG@WbfzMf58 z>%rTD@T&ivJo$9M-bxHtwSeAT)dPBWRTJplRb8NWK6#=tN$-5}=IG?j;p2_QN&i#cI`i&3EO%HJSq0a= z&ODlC?f%SIU5!{TF4C5UxGxQHpDZO8ku6DV7NgegS@O~@*;s7dZFPg*eK-Zsaa!yX zrFb=tVd6lijcs=9VA4EGG&D9UM07=CEJTcmGt*#YI^#xvoJqZpKT5kUMit{sUtH=R z<%>4`W4pzu-LezvarN+9p@>MHv-ZgJx~c@47Y0`^M$8dzp6>I(g?%^;+V%@+KshZt zEhjXw;Y0vx_6WH-S}`}GKJuiv@4hG=Z%dGea6N)+yEx;wwK8Tg&^y^Xn(N6O!(AMg z$T7UOHJtNh0qjYdV}+aPosOe)w(mvsH)RD-^+jg6IVXFUa&|5a-)p&IPVEV53iJdt zhnqk2sK+Im`*Ppbs>jjGgLx1+SdP`InPZ3!8UD08oTE9gj!OV#M16n1kv64CX;Th$ z_42?8^@<=++k7LnUH>7qPeCo`pHf?W1GQ^&$lPoOyAqCU#Fx)0DikP@4}swgC(D|_A-z}`N_5wp2Bdm zV~4M%`(DtK3gmJ5L0ox0_AURD))3I5d-6%E2Bm+tD@yE&a@;4af#~Dg7iH(ZXuD?C zar$mMbMu)fecw#i?vZlyjkF2>yWn5de@^@Aj@_L0&YRG-?V=WdNyO=o)KB?aaU~dQsNJIlg%QIjh~{bS|;x1Ef)vj!Ua0MW2X=7Xj-bwVL=}j4a;-w*@e}j}!uKrV63%rSa8=^> z9(np5k8@A)^oyR{Ar`g-A*Pv5JQzuR7PL2mmZz&}?DSyFNDE19sPf<`L-)vJ%aybLIx$t8$mzjf=Am*V?MSZYD{!&Xiw-?}4` zX&+tk^y*_MA9L&CX#BmuScM}x32^|)dM!ihu?^abTq}O50~7DrQmv|o#ozl2V%4)nIs0oM#m&$_j?SPxTBTm zDQ~4>kCzWOHB8g#@ArpO+yspm#+9OE3r#*HGUYD_riI-K=8A!a2 zQ#EPE(sC5STGB$lo;v}>i~#g%HhR+^5L{?O`M>E71vDfRpmUc0y7fb!j)D>h%V;;;Vg^^$AW`{GU4JLbCeP8QcY`-b%n5!c%jT>a>#oa(rC zml?bAnfqqz4Zm)^0`k>sW%)1Ht#>jgpLF|n_(ym59X|JOueVU-%&)b|o35AcXZQX6 zrH$_I<@W3McZZ0>8<62?UF$uE{Wafoy$;W{`}?;W*88g{^H+8E7f_zrx!!5ltXKQf zjk+5xTB!n9se(=`H52lzy=J|u+aA7&JeORz-iJFak>FY=?h*ASLprA1WaMy9hwB$n z4{@^T`bOx$HBRiuq?`gkeoh~q*= z+$Q=c8p9~iH*$x+$3{1aqh;!AB1X`Xka+gQkel$7cpc71+NfOJT1n^BigRlHbAqCF z#0VOJm=u*LNA%9@aIF&L;7c3P6Xm}Nv_xT7=8CYs;1VdDnWED27@_qt8}f$$v~dzxk*AMtH<| z{zbobm*L-{2m93V@6lWEZ_Vp*edfKE9(~t@#)pFI?sgf6GnWh+;gLlA(t3Obp4EPK zLv1Q*2X$QEHwm?CZm5kzE#tkuZ*SDjzoC{s8BP}j*Hm1ya6OIdZ@AV8^bHhWteLy} zsoiAo=M_Xhg(qW`)#?h85MA*E%6Rr}z+m)-16!( zwX%5nHjL4fT|br~iBV`t=+@$^zuPpcJ_FAHecy!deM66{Z2{fd{Nw03cfcER{8bT> zm|3;Lw1&$*v7>ECb4J;Ir=D4F#5rI4>qdPW;|$H1QWrh8U)#&xIL#*>=bwzH*Rx7b z2yd!WnTky66VdfDnbIlTE<0RgB2BmFkKuYmzO!S8$H_AuZ5;80c+GV(^LyiHPq~~p zg=E>`_r{T?=zJ|_tWz4_sSlBpfqlzP8%g}vwIpLW&bj!%uQUamAW4Mpu#thNWel9J z{^PcSKgqDHs+FdFe3Na7$=K9rGjJs7k4p3!UtJ^HE&NK?K@yGp8@A<i*iYYu3fSn2`G_fakJ4q4GIv&vT4$xkaE zAo4?zWyY=^rxMozWJfTESC7j(t_QAa<6HG1%m$Ba?bvaXxNkVw5oJ}@IQd~Fb$7i? zZmrjZkfa5bkx|*`6$gHpHkHe1Bm?_p0LLds)zJNiBoL;JZHjFpe}enT{o9BW{JUN^ zhGqN%s03<&%cG_`r3u*l`Kl>Ilm+n74bN9iG7-<=l}4^4YB}V>`)7!!SKQb-*;u7V z)wIIo;~o-}${b!{ia*YA%Oklu!VRby#wBpB3ri!j>sgl3_v3y7{W~M-UHA(*9LwMCmS^d# z`{%M{Ysc$36~EhyH*#Q4%UU(`@7|2naA7JNH~i=AbMD>p8Gcg_rz;^$e`eUb5VgUbGxyDw74kZb zPu;s=9^n^y&ezcIXFXz{iqu`1AM3K@TwiN&+2*xB%J@vi&T(Ug=towM8!ub$6>pp6`Wr|_)_wI{v>bURS)uzWE{V4nES{c=G^VhX< zGvPfQ3g`W#VdmGhst3NV4fq@3w}Za=uK282P9697qxBX_=_3mr=O6Ri4P#!#7|mV1 z#WA#hy#+H}Gv-@y%-8PzHs4Hv&k#Nr*`*@$dv~eXGtqwI-Df<0@+mkl|F}K+TkvU!F{bL=9<^bJjA2MPnGu4d9G! z^~+OQDxcZ_`geAWunglqsbMom;4Y#~6Q5=|j&}|(Vr_I^86pS_I1qOG>S&q4p3`<( zZbm)DftIgymesoyxVzBmmsvGr_KD8%RBqn@X6PeIG~s8q%7G~D^E3L;_T6Zw^xeDS zr)W4O_@C6V$7=kuapAqzsHlFE2JoyR(Hw1j=ZbCG&MUSVs#7Eu{*|BVEw?6`*P7&& z^e(jn6!vDDXxZ##48fI2QDIqotOjlz&nWJ-L|S`9MPkkUI~eryQ80M-SPLRU6MsFE z0sEquaF2z)FT5{OZTT6P=%a}qNC++9EkTd+;Ot8GE+2=U_UES{Pa67;Q?2;a8pkt? zqMp&5cFPTxaq&FM5`LxaunL)@A^GSuw|Xc}k;sxwF+dL7z;gFzpeeiCTB*_=?PI31op!4Y1*V0Sr`;O;NsHa$qn>kDWzB-|v`ptP zyXB+keBT`@yP|($O5N1>%)=`KSmzytb*9Sp6SvNDzdCiT>4C}zu|`(50-n=hU2Aex z9k+r9!!pc&#^U2(Q6~q7RkMx;nGM`&@c6x@*aBo_b1e0D`FhR%EXZ7?Qi+&si3~UY zd`+)}iF++QaAIyhJ}}F`^}!kI74NE=h{V7X(-Qj~4+>(PPk^)e;H<2sk9k!M3k_kN z({cYn4P6u7Q@a;@k-3(fyPm9JVoS74o2=8+U->XkdHe@ZS%gu~V$|vy zW7D*|?Iz{%l^WLh0&1T}Z2|P?#f2}%<%u+XJ2J6T9)>~=`s3YHs_SH6AxEyHK6|W| znZ}F`f$v~OGl|L*JMk<}&oL=X<-YOCjJcw|^Hu3qa!YL$-7h|$R3z29GswVvt~-U$ zP6{fKLM)eg9gRkj#zm2ayMRC!6oFzK!$Jl_tVB+8(qfa8zoJ)XK(2nt!vrY{468@_VdQ*Y1JD&p@WmZU-KJ zCyDVq{!UoaA=_JsU4H=0EYHp>?Roq7FY{*w5C9nKp@W;wq~xS>XWh^LzHpq8-)1Y&GO(`!CLnxh`*x?0zJx2tBK!yWpCEoIKpycabec6<}j>tLq5)okzc)^ z|6zSl&;zK8=o}y9?{wjA~F1=;ywg;@SJj(<|g$^Oj7j0z7 zH?{JH`%h9^W8>|%DRosECT>E2#>r?_7@xHM#XV|4V1}~(M@D_eQeo*Atn?F>l~~)$t%(<^)pDm=7IGnh>nkGgki`T%f#JQ6i#iRP z+wdSTe5SSMg%trZ=N+n4@cW+MK4Z>77ank{@5q7%me{nI6&ab~O4UMZd>z9sv`irB zEhD$CLtc5?g$YFMl&kxg+K>sTYp8ziyzoOa{wza6Ud!~Y2h5*{8jx#Qe__9yYeU33UzgI_GNKbFG7hYEtAm?p! z!Rb`dH}pLIhu=4V<3s1~0cbQ z<~%&lvoFR4)GW0AwT8;iQf_P=VZ#woOsF!Bj6V`&Wi60N%Svm3&|AEFAR<>LKAr zR;tP>v8U|T%Kc6`dpPn`x4!Qj8Sa*)>2a;Ek&iragc<3l(t0>(5xXF3QO1ImiwYJ{ zY72I~q{^(E7DI-lSAM^P<(O(_)W{>On@M~_)mp_4O*(OKN9fRTb&3JF#@%+}9j6x9 z>qA(lRyLFmRpLA#`X6Q!q|#I?yH}h1)Hrt)ER8IJ*M!RXPz_rz>k(f05qP8#nmi@k zY;wz|+fAM-4bRGlmVLh?+DvDA9W@K~KEj1_WN1rOYYodqa1}L?+&f%#4U;Ljp8Skd zoWtJG^1{w?$WDT%#ralt^F3~PB4ej|BioLrZuKnA%}7P)PW1$`t6;m@D&L%M_4ira z%8GsH*_tmFHXv7BzFm!cmGGlmGbQ2HzybbD zSBXqCV7p;2l~`TtFUURxS75uZj+Wa!KI(^SmLfNhWigx%G1(S8y{2)y48Oq>1$#3@ zwqQKH51OEvsL9gq)L4(!4q*u1(quhM+y)QJWJH_l8S+)_5QCm4Mxc`AUvE;&)cce2 zUYSJWsSvA%Q*fV`-ERGR*)cb@wP~8dGjKksnN+74ifhuwdW|fu zj+OPO`>}_-f{u_IS}MPE5=K_X$Z-1E^rf?(jMN_{;yUoRoYgCT*bK+M-e1k+K1ay2=4ugqukZLbzD3E zF6!Hhyq1yMb|X(BiNc?^ZlinmD^9UqQH*u|EkB+c$@xZ3-M;sdTKv1r41eIW{vS>i z>B;%yt(4x$f?((C>ajKS*A~TpNrb=jHxkFh-@bK~`}AEvUN5%?lqgEnc5TVQ+Mqbr z5zfn4S*3iPY`DCbT^GhL4N_Gp*U8tuRZjxMGxp}~1p{ASQZ}G8*tw9wnN2dz5fNJq zT%#H?pyYsCt|iUx0=nb5Wdj_2Ka}Bl8hYquap*y1!Y$X6=96NNa;FVv_EVQA9O}}I z^|B!G%)EK^64r4_L7a8@Y6m@|j5cG8VSr=c8lO~==2_xs*709ByHz^mrEf&ysYCO@ z0{OZ#waWMvrjIHdy6HNVoeVYdXO@uo0rmh#d|ALcinZ7Ekw8&-{X(Eam4d8_{rHkB zmrC`S+K?WB_Bfto1U6_RscZ%|6ph@GwY_2D3cg{^im8`$K}!C78%Z$o*~@7A<}6>` zUa{`JWh9Qog9@YMKSa$~{%u*)zLR9+7Z%*LRbf}#7gh&4 zNZjpP;cug;Gz8k2cu0oUCly-A1(k~C3xTiG=PU{bPITs?UlAEbc-OKAqX~cTrAWx4 z0lzwbIkpu2yOy!fEoNY}gWrJAej3Fq|z zgdf(mq9bm3(LGpEAl6Dx#BFm4cv@&sI3B=!d-D@{8^ac3HH?wpncs&aX*#;9!A5>d zJ{eL{6@`p}N;Kiu6d2-zo6>u&a;FbhtRwO2wUnQHQ-ED-|FnYPz)6a{*jN+jpz#J7 zqAamBB`j5z2HNj7DNB6l%YopI^P?Am1LM~7zB-EPrt>4Y9?)7=c?+z)?IbC^rFjcUOmE@0^matS zTBf(uZxLs0vCU%b^Ex>|&T4-QWynNOl!)I81%VDG&L`1V5zGe8REE^(KLvlutue0< zamvfMSjQVm;;ey`oW}QwWvt_Vd}scRb*zK$s9Fa3T1Qa{TM%eho)2^^8o-uP89-f@ ziI)YfazA}npxqEJi(BRXtKt2uaC>6S=tZpKaoCtcOMSB13%IckD{6j3jS)NqpP!v% zY#t4a{F)L3OGWrmOW-p`ygHFd8Mua83#_EeRwON>)+ELeTxvJjnkE*6IeL`BCg5HB z?l$4=rLDn2I$99`g|vSN|ohR;vBEp=m>2*9fO%-|2X{?B3mZ; zA=o@pgz;|qc8u>&ntO@s%Q^3#c|`G`o6a2hFK7PVOA;Qex^~`k|8(B;C6xDD&!2H? zwlhNre}BGg$X#2*tHMi>F|hpujyzZ9a(JjwrP~!gS^9vIarUvP)^>SD{LbN8$om0Yr*Q>4m_%dq z^85jBz<0${|H9l#WlnBTB@K9mG?$6KCCtIxN~KJdMCA-SFpV_t5L^6`tKjh-Fr=g) z9~5t8Ze!BWSM*i`UZK}t{3S-wnn@XIUP8y<2^x6R({Xt&wi}Yu#WD3;2fi-2T;)`1 z1BX86B&>}eP+-80Q-44u4L)J4+TMIFW*t07^YHv2(Y|b|Ku(*ujte>?T9{egEzGc& zD7Haw&?snA#k=(Z3L14hQ4oa4*1@rk2N1JlFN=92*ztymJ;gdsBF0|1^yb!Tgh(sc z5o7+ZsQ2oFlv_=kDu!`nSn!zvjuBM%x6`OeZt=xTr1=rCKN+^EVoMaYZ~2>j_Kg%9 zo3|{v(Z0R4^NMK{)2o*E zd6BNBi|)Cat8>m5+q_$RGAFn$?fJ1&r=t)V8TqXIDBdU|eCv|Hb?Z1L4mnoVJ39i4 z1=rbf8rN#syehV|huG(bC4r^uxVwElN(9&U=y4qW&ybIOM0|B%i5L@PnVa|B5oqtf zBMbTOUFDCMrtnEDl__L)vYh$}0W>ZVTstuOj4JQSKR(=k?u7e(;&t&JUnVBFM(-ZI zC=kBBcsKQM6T`;fu=nhKT$uM=Zx%gZ`Z z6R4&_3!eV)7x-z_T6pNFFM9{tzo;cL!8LfdK34Ce1Bz&@OorH+a+cAO z%tAk;X^UcWF2&CZP`h>ibyW_HAITri#P?%O8#=+CEZ+kwQpo|JHMO6J&vbe zcYyPyZJ~G`s^ASB3j7Rj3*uy`H=tbM*0~jiAGo)<0G!CpidN#SDLLM@IaM_t)JWVz zS!XcYQ0|~xkN>wqyP*$w(7YM+M)3MGn3=})3jRDK&dBE>uAeGK`U-w$ySA)aMVwDB zlOw8Dosv5YaZE~<8FM6Njq_B|h)jk!=Yx7B8*Gnp1lg6p)3}UAO%(OgJ-!{7Qg`mT zKs%MH0eh%C39jh!TMEaGpYaF7>@1;llrXVkD$G zeILqKXIU!F67f-|C7NdH@+2((jM%2^Y8%_tM!n2qVykO2^-Og1z}Km#`S??x{-w4d zlZM2*%gccA`EtMB@3(c=`)S>9N*C$l2erQ8KPE$?r9B2X;7^mQa+I{{f>WCa?Ik$4Yda46n zk0E(}OpucaQn;C+K`3uwkGxE7QfAl(r~for_3OWMX1#l>R?Mr zqN&C9+`gtI1x&oI=nb3zL~{!J_J;9$_=d;!#BJW*kQK-`%ngj=WU-9B-)0(h5_`;M z!mGTLnvyf}uhO#$*n^TKaSlfmtqryt`Lz5PhoUsdZa2wJvnjTwh->k;>|tW6#UNYjT*D~H1+Xa#u zuj`qL>mwid$8x%JOdQ=c)u+@xX5kd%)ru}Dw^1qoa<9+2cwNp4U+JmyK8;UzdtD1Z ze;}IflGioTjkjcwSN19hDB5&R8LzOjrPSZi9j88Y1x|&=UWy9R*~^_A#t%leKqvJ~YEQoWwZfZS#=7KK@-PiO&K3jSi@#8H^{JSw-!TuZs z+)49KI2U_tKz!XF3+31m#Ii;7JVRqosqwAl-6b)lI5`;-R6?I^8=Ge>QLGEfF!Bcx z<32io?(BnzRJVYNkH+bmB&>?=4BgAUpKCA+ds<8PxcP5Of*pD;=ry3axLK!t9T^>0 z$(`gr=6b(LhK{aeM)fT1wN7i^UrTxnC>>bDB-2yCS*O)0>UGJB-l`7_P!t!B+f-4^ z(Wp7_%z)BNGm|_JneBiQjQQX`^h{nv+disz8xeI+ycOL0bctD0N`@RdO@{t>y7#Mc z?qM!y{SodZCm%$u; zV2|{8R?X5Vb1avNld1{G+%l@PqX!|1hCR~PoMw^}%$v~vzbt*I?bF{c^0f#4!8gJn z++RL$?V!!}u9`uDwmf;!pv?oShcf|TbG4S>Oio);4?GA ze|o1$JN&HsQqj(d} zi6Qm-%Wq%(0g6b zA^~Uo=WL{ZP$}MgsAdeza3L|S(r(h9JRF&YbG<87L5r9XEj7DggJvGl^)ai9 zZ@Ex~nv#nBPKJGKK%^P=ccHoj=dl+^1wFrb?9oRb4JzqftSQp2(Uxi)J=eintN<^;$PkC?Hy?q=`TAUtBPouqun<}231m(OKKk-p zpyP!_y*XI*L=${YAf5<$U4QOTIB35Zd)9f_*$jCV6aVM{DuemwDP~N{RfHD76Fg#| zIc%S8H2PbwNWT)XjqWS;5Gs3zD+#}0)%3Z~3q4aqz2L_+iqewB@6^cRBkYUM?ZK!j zj-FJ0h?3X!)8`~U0CEtuhk8DZovR6UH0u=h^79IuE;N)EVL#U3((aT`3g-ejhSOW= zvsh4%b#ZpAN>H3-(gJXx`=HD&OjL9o|j(qP)_(AMk!@DtKIqzs9P%GZJsU z>$yj`L`$3NF#gZW8=u3b?n=p`Z;D4*gUU+ho}}+`8+5AD9Wg)0=G^gd51t{aO1wE| zvE&#{8?x^xBl^_6uc+>?u;k#)+QF`>9?CIC)p$>w{>Iaj^r@>jy8399a>&njt0IoZ z=FpSs(j6P~$>3vEvKIYJh8;Zyhe?`YZoj6S?fhPQKbKE)Gv{GzB&& z%8t6p@N!d7B{Nz+YHswOReiJeUXrb(@2R_|bhs{;FSN$tFYatMX)2z!e1*4j$dKB~ zfHsBm7k44^3(k5OQ&3gW_@}H7jtj@{9%K&x*mR;Y2IoRQRGp~##r-nFm^4*K-8L=q zfk0osdOZKn%qGM3R2`^63n?0Zw#I;WtLVGR_-i!(AA4^C7uAvWjn-MZX_~fa1Qh|b z0T%?tM%*>AX`17L3T_z_o!F=(5I3SGF_XL_n;DHs1|^GvNsOA+B#O&S)Hs-EM$KXx zmjt6ml89)Nr4bCx+V}sQZWM{hyzlSc`+fKRzJpfRIaN<>|EhXwd+Iq-2yPAw&R-9- zKWE4J;2qD|NpexhT)j(|$*tdJIeh(@Y;Fu!6NK49kS0gk#%8&C8#UGW9ABua&i9$= z>TMF%bMZx%-A%Q>jycW3-1Ge_S16;j1Gp#k&(7Jnk+RgLjynuVi1m7PLCas=7Yd zY>YD3rsN*R^&#Iz)}XrH+1$60IVU(6zCAs+0dl&LuhqC74k%<1_auwXVArH=r2A5I ze%8?TWWXHTA3ala{;6St*Z0E);y1K6BRZMvFFNNnvxpo?Rlp)W?qTw-N>h&Nm9DYu z`n8;~$wKwN1=?}(q@aWUO!rRPvf*VNaaq)m(r!3_tlabJ;hlV!tAF;i(Y6 z4fk1CWTy?PLax|1CaYY9IqS6?NMrQDXIF~((E?-%X09e#f`15mgm>@_6)v*BlTC=R z^~7$Iu7HqVRhPKaa@~LViP}J4!uVc?TIg2R52r^N1$6#4Wv4G@kK?O8s|^&sa9Oio zzwRD-(UqE==v+0G`e8zyIbdNG(T|bohk;%YOxXWOOI$kt&A*)*yXUxVUDVV69^3Tz ztEtefIF>QzsuR>dnQN@mjyF922ty_Il=ZuOLxEp`MqXQ}=`1s{lU%~*&Y@5Dp|fWW z=k~f6w1q6(w>WJ<=R&IQOAR7UU~xO(Dc@o}^iQxpDoFN=#gPlJQ|;8`sO7!xWx1Tw zpMN*GQ-DzYv7nVsset=~=9>E4w~|#(e^m-3Tu)!qV4V}}Cpudia;0_Bvn7{Y`5UTz ze9uxHl62%bxiuV>>v*vN2>Jd}Hdq=hrAqgr?!RxKwqYnvEMXDP%X@Vq-MPUoKuW$D z{&%~GR&;K6JqsBr!RcuAEzXu+t4V#F+ILa5x;PML_WD9vi!kRmx%wh-wB)=Alw2%xDxvRkfN?9{Br z3|E8m{)q#mBq^qX8>+|defISptBHz5WT5R&!(HSDl5eFk(Y02hRDNyrLF(t+a*H2s z#(t0B%x}dy{E2DOZ0t&6ztQ=IoXV+uDlIckogg@ux00{u+ytLvke?AZg&=jl;%!KT z7v@JxU2uvvGN#@RNr3g|cgJJ}7GuY!0B;*0<@TK!X>4(Ouu~&ncP9&&(|V921X#ov zxFoa^GM*gd$vJo`k9=`A5<9h`bGGbe$?F1ih4bnneYCmuy*((Vl@{QCBI%>bHRlrxy5mgG#!)f{Ub< zY$&9*yy`~p_dmf5&JS%^Q`if88G>`Hw*;M#k5zR|@A+akL{<1}B_ziS+Viu+HAbce zSK3#}_34KBFZGq>xAkHDd4fnqNAL)m_w}EWQ_aTLLA&O>dRP zOD95p4LKd+w_bzYj_DV49Y)}>XNDR{1!*V$Jtgt$l z+K5MrVmP#HX*T&;1D(Bu>5%3$@*>d6$v;vPCioR~w}x1I0t?N0TD|yO#5xFTE5hZk zjl_vHa8+F#i7}bnR-paJt88bN9Kv+S1#qmGxQq16oM&>ha?O96I8vtW=2{-6F6>d# zUw$*VC~k_y51InRqQERjz65rc$A__XH{KmDh1hP53t^FdMLm%>!!omCzL_UAy3 zqjqjC$q|~J&*Z%$@x_SufLKI32fmvMB))KZ&+b;=^&_QOSf&2b)L+WM6X$%oaT?l_ zTF%QGjUQ)lXjzbYiZ(mP=Bf&2Nt!~=xw28UNmXEkoaTNv2Tq3Zmg9^45`lvAgFF^{ z_S?lWXV)L1((-;a}h!S9H3w2d)>Goy8`qd#^A^tjiz zIq#Coxs^rY%T*fcGu?3_t;LNg51;!6d`@ttw#LZ)X|5yFS6ide{#eFhR@MFeJ*=1> zTvz!vMzD_saM2yu(S)r1h+N-Sscj9(WrNJl=OFE{HJSK{;2eN^={V7qIqO#J3Nhz_ z9N$;(f*d#24!~90z!m#QbV7Y!4z;?`nLgEdLU1PJY~1J#!r?0;Y3CIZV<)f%DE;+^ z_o*GUmEGU1E|w()F*l$ZHai3II#X+^)#g##GS{l{{)HI${R zYV2cmF2t>zm|(0AAs=I8;oYLr7Obi-dlyo+k+WFpJ2_1a2vr# zYp3RPF7Pe5g4MLH@SOa6y-L5)BF&$Wj6HEfHdlR^n+i>gf^$bp!jB<-v@a|=XS!6t zWCZT@dpNaesNnno`vm%p{*nf=H5~9cnyY^q=R*_4V&I{@ z=E6TZ860~kICfDx|9`=MY~B3#Jbc;|(`b*a#5t7QK2uwLEI^2+gk;dDfx#5!YeP3oSHrh#@}V49$HXV^nZ% z9DHa@qSHFu`f1r$NiB)a=+u{aw6;}sTi=I1gw7vhzC>@u?;lDIG!swTg&j!bTEoTr z6>5Nkrf_;C>vxy5wz5AY8Y#V}-LgDXv=RNBIZwAz*{DU6--|8%D_;#My-aTOzCxbe zMk4HBIqc^MyQ$KMJ$&ZOZs8(S#h+I)=jxVS;10Rc%V_0aT=R{0*FSKO(u}L3JhJ|) z>J0C-<<#%Jx2uTzj;%|6m-@}B`X%qGi>Q_93%OCxCI@byxyr<5YE9r$xI6Ss6OF}! z(~LRpQy6)lYV?f7?)t3<4F$QE127IT=lj{h_4q_hQKwhikC^FlNjhPFh&lCgSq@!h z7=zLl;J!?=XD0Vxj=D%J)<;J!z*r4=657vM0JlXl$Dw&3aUI_c)Ej6nD!xqhK)oWo zMR2~3nStP3(?q_4^XDed=_t_&&OwcsAqnUmqkS>TG3TCatZFCHs@8XNkBVEptEJhu zhVvDitK_)mdE+|njf+lqv~obAUJE>X**e)bRSjr$h85<+M7fyE-$oq?c+ zrjLU`;PS(Slqwc|FSY&ypGlh}!$s;5%=vMvr$;bn9QEwlSF`umjPze(_p6EXMXxGW z7spC7ab}?|eCkRciDqIq=Ud3(q{|}C$A%uv%%HeTsCVMn1=%r~sv3On(!|FXvF7@J zwZ=i(fGE~sCi>^g6bqfNLna;be9!zO%uI6^LV$RU;9S-~2z$`k-D2B+3-3If#gDC9 zxHA|CjIDccCy?0`h&`z52S&zF=?v(f$oc5ksy0b7w?D84`KMB-3mYmof@{2U5aXWI zR<4neexPoy^n@mCezCLIvGsr5`PZt>#lD66(SO>;k6)TP7h@#F?mvwdW9zS>7kEYs zNbDa>lF}+y)h*cRshin5??(^PKS+HacOnjqty{h0NX;0*U4P;rFX@xb*B#-V#`Dl8 z(pJmRk9JUl*?vmh*@u1LQ`boCmv*~+!@2Oz8+4G=IDDWP^Ed;r%G$^JC)?H7Z8AVF z0rZ{N=PxPu8FE(M=ZU!+dqO7YAZl&>;USe2U-Dh13HJ?H)Ccme*tWGH>d|My7begi zk*qVp1l?4-pff9Se%hk&5_)?ZZ}S&4*LSCT9}(w$w&wIZyJc~6y?;}4eV2wP-09!8 zc9ZI}Gvwwg?>w=Xqh;w!x?n}6v9mrz(Oj>0QLgr$xr&=1&Gjvf}C?kyqm~R}~g}bM8|FjS{ zA;875ajU^Z`wTj%yH7T5mv)mj?(TCoGBb3K0nz|>Wel`?#>d0E_e=Lk_d_55H0;Ji zmCnZQ(KP#anWUA1Mcs4poJ(WMwGImj4msh{;Lg~!GG_GLR}2SzYkep5RkL%XtdHWZ zjE%b~_Odbq_=2@;io{-$_T;M0xHP-WHBV*lNxDyB*Y{-`>I#;cogZf#GbTvf2IlP5 zxGjS#+AVFX_TwTmSl>YFYqm%mD=xEc%%FRk8*MD=2;J{z*}GZnS7m3fu@^N%%c@|% zp&7cRigMCazqfVG9;)qIdv@Ntu(;{bQ+fw2Iaum}9NoW%MaA1EO0Q+?w#?5oW<-?^ zmV|1~FFbP%ZvJ6>t9>iG(2{;3y1jM|zp{(_tNnaW*fGl~{2u!=IaR{QZqHWod$ z?vb4=_H-F$dII*D78PMvS~Euz&tmKCgC%V^-D_PepZ{&n=1g0vRlcQF#*E0!jh3$E z^E3BY0?U^=nZ=_EWVeO#1AV2}Nk`;FN!AUb-Ei@ZYOVT_sy#{HVBHv1x*N3xT?!4` zus^u1S}RPmZnH($SWI=92a8zLN~Qg=sy*1t9V{hR^_Ov8*uWKS%P54jlX#0Iq zcqUiSTtBOM>sqcjqOM?`#hJ}-%@L~oSY#S*f5q#XhdnMzwV}*)J2d)I?3$?RV@o!4{b5J2y!PE zzbbG8)BY>0*}ckX!d~q7tZ}{3{@d(3?NbZR%}saOr(U<#a}#Kj&x^Y>XT_RCfs2TQ zz7BD4SK7Uk?*W0A_wRT9A)E<0fk60q?&Xxa>KluF6ToMh z>pRQ4)y|Sc+^v`_O=|O&6qay;ON9{Ii7g8tqpW$tGc5nnl-GCNlJj!W^;WV zX-P4$sEV>8)xXZjH|uH=c`lr!cp@lQ;oEX{7P+x0^A zGXhur3~8(Sy__B*KS3(4nkQ9Pjo5c&5uL9jmrQr{X4xftMk03HS>)ksLi~tB;Spju z=ln0|puM_)Mg0W6J>n1tUc^Cv-$Ok%FwVIYbDUcH;cHyd<*K-nBa0^D9%Tg6*b-e4 z2D(q^+~;FK2pY_iA~MoNp4rD*`7K@8+Ah&xLB86+85aQ*fT; zxG=h8cg8ywpNs11)jT(JdfL!5%=~sa4?UbQ0=I~9wpsrg*7&0ILCN4uwLCY%T}*sS zr@4MmW8A1N0NsSliexl9INf~=I#lEqLY48_> zw9Xf_Lwx)G)8cpWU!i@ixT!W;8xbX$M!W0(;(o)zp$%(DJ4l{blcMW%a8 z5>dQ?JM5FJXqDlgBdYQr`MT{KH+C) zYdE8yEoX!RAfHns0oBGqW`es}Z%Gs6%xIOAy&s^gI z4_dIJfOhh1D<|(!ug&I)v6tZMTm3g}S(54$R@)JYIEg7rYAV;@yh+42CpFbi`GsIC zU#%)nN;?%3+GWVQQkT)5yUDLZFN25`9sg!8X@jJyV7i#HBT2qBqVL3Wvs|Lz>fLN$ zo&6qZ5@xse;5|GeclX|;Uf^pf4ue_M#Q%(QAWX)>CUt<;N@rC`2tl`^`ezK&ZBJT9KJh|zXPjXL zTCSEKNuqctS9{B`-h7h{;xDKwxUqu$K9_hGedmrV4wibQc}|n4eVMQSIe>aYdJ8LU z!?!c|KaFTXSTPJP>m+SA80}xV#>@AADa7dblwPvuw%7R#Byr%B#51{u8IU zBEfy!^*=zm_9(Q)5Z!w%+0cT;MVhZa9B~*UDvR8;K(Kv?z9A;3)NOf~dWR44UQxza zTmF&vaXQ>7^^sB>^SbxV+?JuPp4W9k#v@^nl)pHead9tOnl6k$?0>}^BGpdvKsa;M z=`Z@$sQCF3_4Hgl!;MO`kWjjh{ytjAW-GOk@XTOmFO5*0!^PN-*BE`T(Egw{1m8T- zT(>y1Jor}($rNm|L^S#~UN83x6>GZi2arNAB<|Z<(|WlzO>zlanikE=P^W=!8p4Th znt{>Zh}(POnB)=%HI)A-b1n1rb8<};G`E#AU-HXjnTJEVVgv}%1;y#A^LP&XcZR)O z#q&v;Q;;{GjCguHW2v{RPSJeFbJLr&$atP)y*KDfout6bZqlzIEcP#D;MKaI!93zZ ze30hboV1)yKYLJYuOX%dr!#Tq{5zbQJ!~IuQ%JO}C;?GF)*(xFD=#*~WTEMkpyqIIoRN(w$V8RrYgQ|P+gFIuK!?1-FLyjxoxkU+6e zY}8YJE??GjYqdrbh0$@MHF7`b@8ZTVtnbA%aN@t?J1u?hl6pxmR5hGueK&yX;(LzG zwh|6|321GRM8ivJv^2xRYP7Ud*qft1Ggux$KEwBB+v*~BSGVM6e`0=IQ6Od-smp2S zof*4J36MwUp-o`6WlznPRmkNu%{Q1WL92XXBVW_z8#IcA_l(wD0Ah)e(2!)f8IFrE zNG{A#Xl?J>ry`U^5gZCDxJ83{Q4eu7(@w26%Fh;m@k2|GQ+^|6x-?;Z>~PiLG^76Y z8jU)cCj#YCqkm<#b9D#`~^uDM>t!JU^rM>)_p z6W3r?z=d~_=)@X2S)u2SMWxt})oSlOhCMv$jU3&C4`i`iVQ+he#9nmQzlruX65I== z;@Ktv7#%Td6S=~9X)I=2T2Lt;!9_%*QP{5bx3T*rSP$0>W#BQUD4SOoOdj`^yFRa( zi{OflriirpGWFFmu{u}Sf)i@uN5S~oOltGtYK!4kC{{c_jVp%6NX#6}*R^rPEl)PK z)jO4k<~Tl5Ch8*A%|T*j=8uMCcYPGTDioJ_ZwTaRB2R}1)d_(thHfWxx~Qra?wu^3 z4br|N@62nWb|9}d^Nv&t@tF7hX{7XXRkx_Deb|veJJL>Iy|t4g@LeI^X=&u{ZYZM= zS%=ivmpiQf3c61{?{YgCOM5`H|HF;s!Ld>tX<-4G#Ib0e2u)Dc&*-|IGVI$@BkIww zlQ&twvr-z$*M+okr(kP-l8BREl8MO2_-HtkpQLK{4$UCrgv^NnL8d~B?>7uwg0DbK z13%rLXK5^_xj?|h0B0{wZwC(rH%bLx>)ps*=bhb6YSe&>m;8VKXo?NfzB3;m0?|Dx8uARdXQ9e%AsvG zva9qj%VB7KWjL`Y%6`Ez`+Fp4K51t#@7AfR zbwZ)V{LN=R2GW)`=Iv@hOM06yKj5CRzH5CxDf@bB_QGH~!#tZkQr=UD#hF`03Gy&z z3+9!Ph%wn(h+XZy+0ngvLzmF^*@5`olPyRQQb0y`+tiGEZ5uL*Y@Sv4-(9qpeZFM` z%^LsS*Xg4kEYly0{fNad*-Gm^Gxqj5E-n#q8GNhNT(V6Q*A{0J^pBsa z(nYd<)VDl2gO)u8{ScD*87GeirmGSKZ| zOUsyQdo5$VZAZ15*CKaj3A&H?O}2?Pstv-SXCib2VTYi;Ij1TW*wYNIhcyFx6`|3B zZ9cU1Oeq&p9}N2~iB-SN$E5PW=<(-n&%;CIo*eX$A~VXQK*TcLwk>0fRA?K6{UhA1 zcJ;18tprkRpV>CqdPk#dCX}rgy(J*VyUtPiD3xhGmCAIFCq^M6BV8)P`RP2&k+&j7 zd`Y@$TSo8bJlhY@Co~FrSPjUjLHPN~C}_P2heUA?(2qfy+gz;)oLBpb&+in zNh^T>-gLSm9hurhc?Pu!)WRcojDuBCen;$^j*%)7Ic1jGh2Ta~9^)RRJ$1VK`xM}st(Eb65iD#sx&Y>(Ep?tf%cD*%@vxeI=6Lb!a z^Ne3CkVY($(W5)-o_+MxM7P^N+q4hknYazBA0gVmJRdU=_uW1fJzZR{89lN35B;bl zLfR~0&KO-+W;)~t*~ac9Q#$+g8g58$he`~>Zo}m=aa4B+bGg|<^e^QXtJr{R_Iae) zNS<4o@GUr3O7fBdX|p)sY+**{jM016Vdl+)XWe^^zi+Oi$S25-*?VuU;!R5@?44ok z$HoTo5Y-WA8raBlF=1<})&eCuPb zS)a2TyE{+sE)adsBtX9rd_t0!Je{*)cO>+_BtJk}Uw+Ak>!?tC3mbZKZ|Z%yksYuj zU?+SI=pm=^#WjD(LgS_YZhQE~ZvtPLB=w2uZa-4eUf;{!YG1X#-M`p?lbgX-pBR7W z5IKgOTvi*CqZuv~$6Ezi-wSgNqO)v)MTj^_!mKy$Sy#L%rqBG9K@z`#t*7&)ZD^DC zf%7KhT)NsBdSAwJW4X;d{f_0HknLXlRM^q{8~AJ0ufb`MW)PUwB3X`1PdsB_1^oU6A>NXMUB{X%y*c6`l;@>uT{8b4_}V&oIxZvYd2!EZiuG zvQCfCd?xGkxR7^`LnHKdW{YO$S9z=`u&1o~0#X-W=8bxf6>^2P3aWED;e&l&OMPEi z(x-6sJ8Rpt%J@+&hhX;*xpDt0?o3{C33_E-RGDkyR!2$`af;{(KN?bnqO4g4Hjv%YKL?C>W2rHL{|qxN}vRgWPh_a&EuP36R(* z*l>vY%0w2s&xu=B(fj}#4C%_98^k$?1tb7<+?vkTYG#+_caTG)8AG=HZa z0-90Js-E6G_Qpf+6)zjZ?!=2HT@0ooeEF&g9d7lnwErr2dA# z*jVk@TrIvgz{RPGKP&09gNOfU`11n!KVdW1O1X5(=-aNG|HP_zLs`I1Pu%L_Y_y|A zIlySxwrVx>0ARITdkoS_KK<{q1zVrM4Hkccxi%t);#rh?3$H@Nv#M^%PQ%84oeQ8l zN6xp%2C&+B$TuHpNg~ZTiRia<2J{`|EdzG?7q(Uwt}TKr{k_z3CT;;%Yg@Lhomv$n z-?|RU3zBbCpUV@AHq!cU;XihZ8-BflJ(tlHYVk!@^J8urZd6~%O{_gUqUY{J>4hue zD}*X2X3MA+lZ?R|d2~;&XI=P*r2S)c9DX=+)-JLz~M>xqOibZ7%!wmvmidSP=j6SH{?;c?d_^dtGK6G_FD6$~uWhZ*Ya;D1 zzJ*QoCmXZ{YPqGi)~{^J%YYPU?AW>rSv$#`S6e(pogIk}88@suq; zrB+UmNS|Od%9o!<{AjJ@3qG5ccV|w;){m^@Hgc%@h@f`f7Le;0A4>t&;M%#db>a)n`6)X@9!t0r2K9OEA) zy&vToFO{tK!(m6?_pg%me$0^IGiKC|E;z^H3f+}*tCU;ys`_W&CoLD|+Q_`wlIr;# z(gRcI484csUqNzk_)ZO4$*MXPZS1`u+<@b!=VN_m@;szxq1+md_eFKlA01P>yz2^2Sa(xbT3ToeKgAiC0Yx zJ6cJ)8_cgQ23<#3j$oep!m=Ma%>x4Ar#);gpsPD@RV+d!8)p5ZHd_l9Fa=fZO^S#uPTlig- zrC5)=YuGra@^&tIeR z6=SI!W6`_*n~?fOE-s*wq)!)=dGgbG8C)4UE2bPA^Ug29CA_|KdkI2C8EaPg*FMHD|lJVXWhgXKfcBSv1@$52rd|J z^2De+DD!2^72Ne#4^j*Bw68C(ygkup1I;al+S_IpgXNiphu4Cy@EUuCkNPZeRB&1> z@WEqEG~?j0H~q=Mtx~_rO^D+H^rnd~Hr7L7Y$VPYkj^&;F%BD9H0>%7f~)E_yw6M2 zvTc_!dhsFbkl)+IyDDK={@N_r74bFAsZgX2K`_!#<|YaLrbb7e@d9gyEK zgY>-Lf8gdwViU19*Y|)rX69s3*42O(ff%+P3_>l6MWgIg-Zj#%7^5+QlUs3&oyM&K zasy?^K~F1l*MEs8n!yC@)JUDz2cX4mJ9smlY)a?)evIkC_hU?V*Y7w$*m04B9%6zl zkLv)%OL006cu)?mT7%icLk*DOf*uUtO+P_8TGlbzP%Pg(C0z)F9ASm=6jH8>nr2k7 zVtEEoKr&G1+Z(yTmt5IfYAYdMhCF@0-+$93SJoCTZon1&JD8!!vtOKN;52+}-LLPs zafc9Vj(E~`(RAP$$uPygQY<3;*2=rp3D8Z4SvK-}aQ{MlDdg&z9r#j62n)&M*9X?Rw_;|vfyDZbD~-=ql# z7Wa9+PJ=zCnkww3@Hktr{s@_I(S@C~K7!3r#o>!XbY?E+2fB;%c8Oli#o&%Ek5*zj zN6_yzz|UtAC#iNAy?Tq?^}RuH=&Hf&(QQm|P@UxN$_0lAf^WCpYa*z#QLFgJ9cCG-^5 znV)ULeM0QG+nnFLnVE)N%T(#0rRp2ztZ7ILt1g4q0*NP?m%q(zTm$*VXA$R_D(R+c z+r821s_C+>WUP)%rhP#;4^T-SoylQm3vTL6#@Cf(oyie$uYK{hZ}#xelq^6~@|ML7 z^^d#Tp1Ji6^$*FiKisCG-H;3~4NcD4*E55p9P}l77B=BaK)35??$u65^R4vy%cJ<8 zZ%a{y6tfPQq=x$B=JSxXYpA~q>Ei1RcrwXp7VR#V+Y|LiLDEcSeT}&38Yac!9^PEw z!|Yi%H4W1@d`Q=D`+Jhu&ylG*>La$4=q4B>18~AO8eGBqea9OPX&|rUGl8T)?}6;~ z4ZBKyPQ^3iEscglsc6fSylt9GE_6k4S>uMFZzPl$kUKw_dgO27GPR2<`i^lK_6n2E zxGtNw(-&-$ZE~EsyC#&}SF*4qaHXHE^GXBl5*Vqia&NQf+A;%LMAo>axUE}<`p9;0 zx-jUQ9}XEmeDe$%jcFemJu5~dYbn;-pMm>Q=PI1xeXfga)HRdtVv9q^CpvTjE8Pky zo+1lZI>5mrp6sy{k|?g*QyjbvXXz1KR8;A%YGJ9SG*nw>7o2W^=?{lwjYE34XFN0k;?>g>1~%Cg2yc8Pzcm^hpfP<|*zeqdkwi3_^wpVg(l}gWqxZAN75jK7l2A<`HpK8t5_{lmNs(%#D3{7w~} zoo;`%oyv!p`a}X}p#ozcGk%ygPUUp-$SETP3oK&kKrJD-wc6v$N?rRpVP7)HDf1hDZZa}M`jl69wg<67tTMu0R+8>d@MV_o`sLyTu0DI8nT844>H1<>6!|jlRr91MZA$_aNS4jhT zKK3<7miSHi%2i?8g|D*t+M(UbDKhJS>rU2*-;zuWjf~;7i34B%jI#&?z&*nm5@Z1O3DszsBgq~EkAKx;7uh$ zuO|mXY9WQB7Jimx97i1L)gLFvl5aUe&C@}@_mq-*Anzr)dS|%~b;ZobZHlqE4pBT` zfoD@ZX2ip4o>Z}L#MKMZ5@nTgTrR{#UvCSixFU4YLnV;2GEFbxV!GPz4TBtYAERV( z^-kS5jz&I`TNN52PzNOYL0?Y0@K4BzrjSWR5$ zETTTA0_6mtk>xFEye$j4rdAr!%t`;gg|kkUMH)088r?3-IA*{h_RVs14nux{5_%j({0GO2Zt1GoN?p{9k~QhrisNML-kKKO zn$iz%R4naT{;wLAqA8v0NY$s-FqTj)PQ?kklVMMZ4 zr#N^$(!n=7Ro5XmCLnGnC*}XKS+?aZMfSu~m7&gug~=4!n_&~2tqBQrS)3;h&)Z5a>rf8j-7J51H&AvL z;wL!*<9Yiu`D?qBGQKR8y}gDL&*EDsozC-(`sN7KLUu6KmC&fiQzBZyy$SjXe1AZt z2Nf3$_vWb5NlsozeusmnR{!uYFiY||ORehyzVG>0jKfP%GbJ~&dxhfraIBZ^^aRaZ zhl!;)#^`X?g_2vRb-t19pN;*858K+ZuB|O6I8>bJ&|Ju{niHsoxiL|t6xX#=>ruNR z^pi7VNn=heT3-A%d|7M{&d7A%=mu*pLW)H9agC1WAS)COsU=+nT=??Rk0D`%@43}L zhN$^MX@Wz?6__uFY%S%`bNVNm4lQ+vVy`h_X#e`Q_Ro9TK3{8gVCIMRnS{1sYHQ{2 z#SI>~k}u}qhp5$(O|3Hz96()~gZ_q-&%35nQGEuNQjPXU8)~a>m7VHaEbT5C5N^Fg z@b6j5!kLjuK4Xrm%*-YvM3p9>)tC+i-N+WllyW$XG#w(WliIa0(;e=TaPTFnm_8($1Kf6+*f<$7!3IHEytTl|_SZ!jX$uO6hvivaO(ehodxMhy%If zi3+(u>e`B1?`XkE{8>+{P4VV%u4Db2v32F|L;nIl*Kt7?Tq+E1v-#=^JZe{9_`QH@<74!{u&~2rH0Ju1;rn9pD2F~g=^s_(2Hu>=jJIYgV}ITGK31LJa>%a^ z^Oc)CKI;(5J7o&SsG#4SrxaXCLSCM|LZdZ}y&8GM9$PnhC%s{{2p+epZpcm=i&xe4 z*-0byfGT=Z)5tt@Y+cLyZ;$zSJa&DF=~v>@|l;{ z#!CAOjM5m57tO&)4X#5^n3L{8c%6=B3_9YiygYi7ETI7>T;wvVbQ9{az%rn;yxUU8 z4_RjHC}fRmi%XtCjPWU>I4Q=jD6eBq+p$aGNBe5|7!T$_4yZfH>L4%NK6m{;aerle zDaFqnn(Wg2#%o05{Mee_IlhL?jgT%t8hR68@H2pa65Wi*xU1}y_r{D&MRMr z8mAJqm5c9ugLUq5D&XwWx|PfKRL`5%`CN+QqsT()#wbFeC@kvE zq$-1tSyQ?czisZUbo%~@`&oZYDT@lDJF9WDE^^i9?|%}xNss+K_^P| z^)o(s*3C62?Nj}C-kFhBte$OVb$7ZD_+7p9m#dzl02lJuk z!&k5dZvV{pQ}y%dH>h7+zfS#P`UUhe^b6@XqF+$I(fuO(_3AgOU*CR=^;vF=jg2+3 zl!c5*2@@uTGwkoe^kqDoGI_$>@o7sIEJ|O-#-}Y?Af-LBXvIwz#wMpPo}9KM?ZNbR zZk9)ucO+vomxHE*>-6;bk1XrpIQ^c)n=x6_=RNY^C=STT3uu}~2ML>&{y_S&^dG9$-@v zrm!VxOPDEcFmgX^_=u5d^X4x|e}KZ=a!arzT9YJli(2x?qDB4Thi7AlKU0UlDQQ#K zge4DTF3DW+IEA&OEn2j6zH#{@^XH@Ra`D-q6^pY5J(Rv=Vdj$M1IHyM4vaGm%*xD4 zUzE8dePDe2h`EDJgX7{%aUI%xT_`Gc0EKei}yUYmu-`Gc0PSU^q- zRxC}6osabXUx<5g*0S{Za^%?+!yiAAkl1aBYTv#O3Z)wjW3LG*BPI{-VQLuvSnhpe zN|&#?n9}7DDmRmLFp*9Am+0?avuv6F?y=V#@z@XH>)ZIt&W}wX+unJ;Z0CRdl5D4c z@deq&xm$P(?5mHCA5Qkx=R+35Hdk&oZG~OX^XP|P!5-50#zI{p^X8LnDPR4kA&(}q z<5M17{Lzj?cGr5_w{l@>6tGz2WIz4T>9>UDD`oz z3K{AS^ItF>nuoDN^DqrCKJdtr`71J)EC5QV7|~~Wgz*71>+}Vqj1Rz) zT?QDbRe&G#S>9(k8gU}Rq^3-tFbj4pMQ}60DEhnkcJpD4;r2iA=AYL+LH@(~@;84T z0*m~|-{Ft^jg*J~&42%$@rAOua)^5s2Yd5rIm5%aN_%!!4*MvVS{>2WA9oW~uum95WrR5KlJy5o;d0q1n@rbxnwNo`tJx<+lvEkyS z*h{h3hhHE5>0O`RwX$Vp%kakGjmP=p{5$y9wW@2C{+RxlEz}k|GGt`P)6Gve7mLMW zYg22}hxqr)#a}MY95{2}v*pj03ttIeMf1`8+NQNli(41BcH%qnleg; zwMU#7CvNYzz2Dd3*W%jNwXHgpPGxVfH*6oYeb5UvFVwsj_g-A^k>Ddwsh(1mHF3kW4<9{z^ya;}^^ed6?qKI%T|No*3^*SfFu68=3g=7}*s@;~xYtyKF%-4Atp zoAx#hY#G?{7XKDM0{`}&-Fr6va{T2!A$>x=?(%h)05L!;ZYXZp-?G2u7xyo2pD3Rw z|Bw7Xdg0m&*Y0)S>mH-r3)|-JjvT z^mDg60sfb8XJ@S2?S9_nb{~g*4snDczVC65A{XfoAI8`i(2qUlc7FmpAK|`fa=T-` zcDt7$9}e6PC~tMUU#WAuqmZAc5q1~SNx$TFZ$x;1+^G5q`S~IYeu(>6mD{}&?+=CG zW(vx?`YsO(INacNKk>2K9SEBEAKdP8UF2{X@=ok4gR*ATJ-f+9&#rnao2lD=UPhf|# zgn#9B4@I~HgwH^HH2~r}q}vNvZh-q4U@#DL`F6MaAo6uD{JJ5I4Zt`7@!EfOyHgQg zu@Py){ktxV%|iZ!3vTz_@ZSUfGL$dkh}*pn@n882>4AQIFUIiMPA(b$_PY1F&o`WJ z=;7DH?||-rE~q7_W%ZHOM*>&?8^@31)79zfUz&btdXv4$7It0OwdU@cyBof4`1*Lu z@s?TmcbY%VPZB4I{T=-s=xUW;gnkh^DSA@$d*XYdyUE>j8UNJuFIFA)Pki*mM|V}b&&{nOaS);Vc}V{fbikrSUI~{eSNNa}t-teKh!^!COykJ+VXB zA>=pbH(%s0^7x)C7uq|t_uh+pFAfSD6gEXWMJqH5&C~JEh<`bJ4*#$6f0ds)f9m{& za~ICN9{hT+zFFT~tSVOZ6Z#29xuab4x#)8PhYuX?)61t&B?=)hf6<8KR1nP8a4aE>cf2{IY<(A7^E?*yUeS{;}5nS6++XAKL++*Tn;#VzSwb-le)vNqh`PatO z#_VzIaacvG$m3snQ+m^*!lOb<|CavC`z`Oc%`SES?3Tf{x63xdW{n{8JE|v9^ip68m}?HCKK#6 zM%ZMcyv78Z%m}YB!zMG{Yf@p8ndLP$*km%iCJQ#1m0pt#o6H)oDS%C8qt|SKP3A4H z*#(=-2VQd+HkspIV~0(q+-oXeley|OPS|8zUc-8#{4g4?F~BAh>@`N%WTL#r1e?qV zuQ9_WGu~@bVUwBVH8$8}GQ1`WHkp-PlMS298m}pUO=hFlY=KSYEw9-Ho6HAZa~L+6 z<6dKjO{UyyDqxej>NQT-WL#dudZGL<8m}?HCKK#6M%ZMcyv78Z%m}YB!zMG{Yf@p8 zndLP$*km%iCJQ#1m0pt#o6H)oDS%C8qt|SKP3A4H*#(=-2VQd+HkspIV~0(q+-oXe zley|OPS|8zUW3xh|1|g|1LbM535HDu$8T=Ncvjwr8592l-}(Q$|Jg%0BRt?K4BfXnO3?5>PT9&>jJ#BeJ85ck~Zn2-jLuoPPWJSCCRqcZz5bl(=m?TaI?Z> zOk(Uuo-vEQcfRjx%~L*qQ?F626`vN?@y~I02p0smCt+r#FlL0Epx5LPWyko#;I|!S zCFsT@BC9?)U%guOl=wGc4bcaC{ST)ycAp%NAy3Q~R+FF6>lZeau@$%bWu5%w=n*~K zI;KbOK}e?nVdQ7;`)0gTUfS|W`6s#$I=EAQ$o&l5V`X|nzPMU=s;z9~Cr$$m{Kj}` zgYj;FyG3uz*R1w=O8qz08gZ@gG{W#pwI@G4as1f7KRK#r7Qe(>F9=Kdq4=VvqYm+G z>3~gJS)A~v=eGCe4)2;7xBCZq-`mP!#5>iqx9sCj0=YIn>mB?ko%ZdF@|l6QkpbE* zay-F#>U`B|@hRbNgv%_izh)-#0lGx}ZT|d3?MVUvsapJI>e-Q_pZxpS$%Na9GTYHi zZTEbqywo9$pg%EoH`48hnGtT}KLh?2+25F_$@f{UeoFN>agDIHtsdK#kFXdoFt!n4 z#^X6be|w#`r=|R?0Br?mk&Ij?^nSU+`_&HbKX!O`c6hJr@b1F9*e?F&4(|kd`}Da7;_J?cNHXQ*eYr>Uo_Q`J+^DNcUYj)4n*6Yfkl!4LKB%?JCd=Q)Au zdz`1fjb6WC_+{g{U7xY)vwiTFn^dVf)l@&5R)#ey z@g}VDu<)DwQ#u!r&ct%bi^B zRNfzvpBd}X|2^fsZC*lt#tnDoOGeN~p^=#2Kfn{V4Xc#jn@6RBZVtlCK^PiO$KR<> zS@os|Jp;xxbQGEC96NsE)6Y(VasIXq+_8=iBaPG-ZjE*HJ_GNlTA7~S=d^oo&u^)J zF!Z|x@b6Hr-=szU13+5<+6p{lj?#V}nyAkjuhj=z{2r2jJ@FBJ){BhY5Bl5sYsM@s zO|g^ZU+W#`)xx>DT!ewB zL;Y-IY?7SL?froKo`+v&_)$Nh_qXu=mP{LTYZiyeYr0k+lz_Y#y>zLYz+dnTuRG46 zC@xmySwGMty<724c(vtgJKl+Bx9^_=ytJ$sC z@bmmmlAqz;mTwp6K9=cjn-|M%mN5Hp3)%_%E&8zAXR64+O07Q3;&-q78wDDof95q0 zroW+2&_~UeDR1F_6|`nK?puwht!|>Wdg4j(jP3WJxt%n6)5wS4JUrh+@FW8si5@1% zI3WwS&3h=mHJH_0MHs5v_I2Hnb_o3Uf;MOy=rM2p9gO*xwYLut9pj-in-Nd*8|b6| zAx)x<0uNvGW?OxE=XY@L7%#N};`esYuE6u}{dtEGan3S;S0YRSp6}G3I^stB?iqwJ{?jwp9j6zc<1mSBbDsluuyX1vcibl+ z3_nNs4A3o->2Bv^?Rh4p?E>u?(9(0qyih+04gjBrKlMSR=~+SOo)fTw7^l}^J)psQ z0PV-*rELZ+g=@Fo0CS`jhjAjUACBHya{MMeienAZ>-~XeeS8PJ&@(Y7Fj{COfnN_V zrZ7hl<^hD4^f%WW!L#MnMw0wXxDefd{cY{>RzL1oZlX1S(Qg6m06e!Fvwo9jq8$U; z&p}IV{*U^s5&jvVb%NH6XRC)_NW?EJ?bZcEV>;lekG3#l@J?a8bmKdQXeq?*4SnRmkfN%-F*r*X3n&(V4&n$+}}(HqYh&@I4o(yh7sL-O{}Nl%-k zv__QvX8!Q!_Hj|0A-d;5Hv`X!x3Ki^MfhEX-w61%FH2iE<6*Sz4t^c6Bif8E=!c*k z;Gw-aAtPGy+Y3Jt&mF@OX4~Qa0sLq1;DPNLh0ve#yA%Bx(Cdz1=|S}1R-VlWVH5}O zC@;-+XcS}(B1M&zUg-ceirh(uPxqjepGK1XYju<|0F+cQk!R@ z-_6I2JLWaJEA|2q?oMT+vXg)Lar9mE!S-e6tA5*OiQW$Sw?1X;Y0AeWy*Lqp*lfRP z)M=@FDKQSXpqgdB8Sn#}U|KZr=lF4cZSEPrl;cxt%{9lIuW%#2z zFf=2XDg_U;Jnt#Ld-}J>kFl}s+?V`cjhzXURMnZsU(vLJqP7MiB$z1>BWTlUnyrmn zBeVj7Y>kiz^Qi86MOUcmDr%u=*Ny`bb6mh9QPjv1R7fe@VCT||@lXh)_HO{TM{ARTlR@ZKlZ5s!8zQ>kby)fwx` zM-#DR%0&14M7|J1^v|10Mfd!LK_Q!rfBs}TzUL?Nu|lEi^QY7CJ%16a|G6v4*q)!t z2f?DxpGZYV=l?&U%Jlk*S9GK*r5&(~CC1?b;hqxHI;#-n-zu+K!InXu;bSTo57!Vz zebvf)DfYULjq;*OX`jBO`8WqketZS@sbCxmM5w~5T+$=l@9FJ7g5_LZ92=| zig=W_ob8zMcEJOz2%pT4^7k;0bkO-75FE84e3r`poxK_HRMv(gsO|X>Ei?T~Hua6_ z6xFJ}J9IbVR>Yf*zrQrx!^D40$P|f}PgxKQakCUBtf}yi4#x@L|D=!F@^@M7(Rk-GXlgKOy*T@E*ZG0^4BqZJ&@o zA0kJGMRl5LZy=|XVD{g>6SdTihh+BO0U@vKzdxe9TAY=2h2Pa-`j;etiF4Mob{c{L@Mb1&0?Qpc`D?u>B)&(5s&rXQZix@ z&kw2pjxV|NyX_@D**@yWp_5$u*q$AB;WqPCGOPBqkXN?nR@CQhA*Vj~`Zrr>#N+&^ z5zP4!70mf@yI{_@jeI5w4d{ zg3lAo{D+AAUV!{kFjd^HmkWfP`9CjsAiuA0?RS&l61fT)~tt5KQ^if+=4nnDTXk-wGL@_f9SE zzpU6lG5s>Z%>O3AlOFof+^o6cwa^RnBS{H&iMNTQ+_}&<>w8@ z9~E?YO@dW@1ph0fyrn|U_;3HtWCwjEzrlx{D(L**7EJlN5k`#otlvh#l+SG_m(LSS zdF{w@dA(rD_X$?*7fkunqsr6o6ioTmqs!&pf+?Rqrd-}4_|uU78q(6$FvbkKoPB7)&7njRd3Z^`JNx8gRFy#v-mdmq(DX*DaE*~nmPep&SKj<5& zL&Q5RL{7g)D`)=p{78#?W=Q;&X)gZgW9+rwZ?HW2L+YT%ryzJ_NPhH-l<}Y7AxQ;2 zf9Mw}nZA&gS>I)X&kM<)evmT$oq}0k`bAoKGtz$_%;#`iZz!jKqz(~p8sv{v$mu6( z<&00?NXu%x1XE65Nh$9&%Q5Bjh15a!-yXq~(-%_84+*B6zL8Siv&Avx^p(^>=U*e3 za{5k6`9Q&x)0a}p8wFENe@ZEzC-{nx@u5$pjDJY5s_*P_d9OK+gOL2_7pa3DKl(vh zPFKYL0t5$yy;%bJA@ERaQ0)1atkdiBaQBOv7~}ei{8(T5Q0k!jcbDJ|A@S))Y2!0J z{U{~Vk5V%IC?(U6QZoH0CDV^mGW{qeE9X^pL-!Z`D6O3Rv+E4oe*L`Q)e3&Cg5Rj% zw<~y01;1CpdnUyrypH@$ZZF z>Y(dKe^$#(|GenWf1rQJHMnn|55FPghao3hd38^d6+XBhX8Y;;s)H{7LBW*M|JBM* z3Q14@S1V`z_6Y7D5}!V>GXA#f91jYKPd`{2e+1;)!F_SV>cdVUXZh<&`+0p{r{Am& zx_v9V%eh|I&p(Hhf2W9lVTgRAkhA=!1y2l#zf;JkLH;b5KC`wxd&T&%|LGsAgRcL^ z|8T7IS1`-(zpPw7Q84B0H~I$z9uAp z`qIkyHwpeuNc@#T&i+Y#&!uPi^qY(+RzO9y7pPSY?W_R{S0m~#5ES~>ZN zbp~PKncuSYjw!#Xg6WT{gN{FNgJZ@YTEX-~)j`MKxvQM_Rq(c#%H?kh-U9t0_dyh{ zM?VkIr}rw#tNDS;FKqu!KUN)d{U+Y;nC+P&SdG8nCqv4oKdTNpzj+Th-Wd{~KCL!> z<^G1{(~s54UqdsOBt3mvbExMTJw{Z)0)2AUsfw;e)L_ntm-FtMo4=4vdZ{b!OV|7 ztWv&B@D(BX(RWqG-zAv&(SOy-6Cvs8ziQ_#`o*0kL3OVzqU#kvUpXtjgnZB!% z>7y!{{-~B&-b%sDpMI!1=<(bmm~#4}O8LMYjwz1{X8y|rcZalRg^)A;7QxJKtKjd3 zq~9jwcZSI6hpI!w+Z-aN@28bN9wMhNrB!|S`DCLO1MevtG;%^ml z=D$Pm;E?!F3pwMzQNi`Xo()5KGP^kff2dI*`Rx(uqljmNeZRHf6?;eflUcUwwe=@s z3Fjey%al(JDQ~YRkNSQ{uq}z7EPn>lZxQG3R`4dl#R~ooc%6{nRlyHcF#8XE6J>dP zA5WJ4CuHfL@oS`iAN%1Z+&=~F+Xa3M`d&-!5%Jyv+w5vk|G$BkV}IIe;vWG24(rW{7#`9`F#Z(EYROCXL)CXZ#)NUoJl_x{O~wDV~Xhdi~}!6 ze{#RX_>;hCA8Wkf8Q{k-zpo@C|8IgBcHooc$MG59e5l$3PJ?4-x%yrU&ZBTX0%QK) z1^=YZ)#nbd%zrKT!bUugF!48oy>9$NVE;Tq_H_T(p#Pr+5BpF3yCSULF7S;zTzw9n zitiqvJ?t-*PhUhn&qwo7|6{#F;Ac(-AW+_~m**`;`BB4z!RKH-vi;Yqzvqp>dNIYw zheQ5s%(Ztkc)(256$LQA^TAzDKqpQ7OTah6f5MLEH^BQbzD-7+0M8zSXJUr01n-`q zf2V}`-2lGse9v2A+R7od#Ym#=jQKU$(LBZ3It3 z<81#mf$jU)8n1^p1&qM#)p*YVZ-)MT7p&$B#;ZQ6?GwxQA>WPtu5JH9u=NM7@v6KG z`29ILL)LFOcn0*fC-Vc}4qo4kIcsH6F<{$L;#N#|~IK~%Us_OR@aOUg!djyPs7TEPy zm3JZdZrjRrL>%gN?pC}piSqWYR{j>dh zH~8Ua^mk7ve}Md)end$A3HXztP9L^|M?a6uOnxtbdoM8UMR~7)XPm3Qlfv@f0?Yb; z1g8Iv?(b@^$MK$b^Lp$7nIGaG2kydPS?&-1_-xPXHu5vTXJCGO#qig_*I@o$ZnzO# zME$J&yAUkPp9vlW{j&3a0hrGtb|O9PUmARHnA7iWFyEh9{a6KlWQWs-yTOh3tL{aa zy}Sp&cD<^BK0N|Pi1N?LJ*vH5fmflwRsGSwFM?-7e>tLT|C`_&(ch;Sejj|_U9LYq z1kb+>?|?C*82>PMH`*UJ@q1%nlT$GMlq3HyfIO zisi?^tDkrJmjv_uu#LY0%=huSJrVD2aK9g5PE_gk-Va^_`!>0{UpZc1sDV8ifjz6?Gr`{;?Z)q1%EkPi4Bm~wjGFi@-~kxV zA%@$*`)9fJ=qm6_-_yUp!Sb#L+wb$xp56lXy73Qz_ULZ1&R+5T;NG*5ALk>|{}?>t zahLzEz^g?4UIM>#owoms|3~n@hI?KUG9`Zm?g#sErJ3JHz_$M=KMDFgG-~ZrwRb-H zOU}o^kbf-t?*gz*&;8mAupHl?;J`6#fm@$^$RELcL9wbn3Gkrr!(JF3-WT@fDcB#w z8OV>v`g^hAtH1}L!wjtNt>9sk-TYh&UiEV~pMD087<{eW9< zUIrh)e9`vhSZ^8`nc1`?S3u>Who2Ss}uz`w`(Fw^9J5##?m-eDP@1b#^DXRZJbxYXJI1z`FM zp*djXb4^#2C%yP~}N!Nag#Sbh94c)$Y3Tfql0-=~=Jp9DWU%GryT!RL$l{|E2~ z=ntwK+xHjnpxa!2e+Ngdhkj#TQT|VG@446?RBPrzWjE?G(4;>JEYFW$2ET*;v+^^+ z?_<8$@fZc}g7Vt-OaS*od!e$byi38-p3VYa1ABrGC7+LrlhMDH)3`{sc-}fAzZxv} zQ>(z+k)FSA!1mk;{`e7R&o_W?KHH7=L*REr|Njas&)3g^&p`QKHRYXyfjsy;!0;a+ z|2-ND6;b8A2j=r`JKhJtYr38O{V#Ys<`0HY#qZe<3?%$R|zWYLVp1uwI49?H>!Hco}U<#@F zJO`HTeHlDktRL@yr9C?UUaivT^rv7ya>h2N4_^k;9}(T5>e~qZ4f-Qt%9{wb-&bY- z&IT{%;_N0*Nx|!V7Y(#EBHR>pPf(pz@N-=`c-u*(qnv9 znEby8wt?&UATZwtM9@Jzj|~G$`#l!Ce~z>F)4=li+c&`x*atLQwI>J83w?YYf_G6r zE58nMH$SVq+rXDLx%G4t_$AD*Att}a!1NcjISZ_{mwmkd$Y}8VN8NgO0n=kV8IAc(!i799 z%?2-?2z!hQQXYWkKzXnwP<$2mm$0W$6UEEHe;4Puo56BF^gZy5R;(vq*6}xho9}Y- z=RxqGue;|R+rZCa{bAsKdne^P-1Ed&z-yb_`u8XBFR-7n^Y=q=5%rtG^r-*8z@OZY zcLat%0gsA$(G=Rp_4$MWus>)VhCsz12u4`IKkUCV!I5iSf7OFO7~}TS&0zZb+VSiQ zeIEXpJD;{feh$_XTfb|2;LUg98MfgYz^lPn_Lbbj?+N4z z#jHPW=G5_1M~}uY6AX!aLkeD7u8^$D@FQO+CK9+GOQn4NPC+t=`-#?MA@BKq^F;sR znd7E5`xiA|GU+nk$E|qAUzp0Y#!`N~kjdr!SaGS>mf`Ok;KvQ>Muf-k6UmOGAIs%p zT|RyVA=l+4a0k=i8GBlM*Gllm=AFOtIjrp4OS7QZAkW6;G8vLJ0M>l7%+QFLWm3@l=pU z`$V%F{E@cRI?D!;Cq=6GkOY1vi<%~{43M8pscP_<@v{Mj)+DgGe`bpC8kYu&H9%hN zu@rPJAEcA*nS3GHu@E$~dHS@OEq?Rd@y*j)CQZA{8<&MTw#5p`Oh@z5WI;yv19hV- zUkK7dxr@+tk#&6d>i66IR6Mx^TEX@tTNk#qHBb~^l8*a@6uUW* zfWEZjCUiEQ$WrPjvgpyaMf$4Pp{~=IFqyU-BX-78i|~orQ_S(lKlFtkj}>Cj)+B3$ zvi#WabdYY#cA*elnEq%%y^k;h3_m zpe2cH4qC&>>-+gu4PI+1*0Cs$nRsQ8%Rq7Jy(Oh+ow;NoV1}J2y{OyP#)%&*usI`qxYtA9Fp1kHy7IE=YR``zt4$*0`NmqB6!6S`$kJm`F7z%DnXE9=( zx&fUqLYXvszq5@ODr&4Xqe^WnbY%k{vV3x3hgNWP!R%nk(tf&_iYG=QH$^Z|`C_ZM zL2+`4xPhYMi3S_L*pbP_1t*eASIq!)ZHeGo%b64pG z=HJ2*YOML6F!*uwOtAx*%);{I&zkhjW?aWQy8H}o>ZbZR+|p@bU%CoG2Uj$01DHG& zK)r)l2NuVLsbavD!p~vt%a3eOrZzufj6dG~wvOj>f~qX_m>{dmv1nlhG(T+v*#s#0 z!l{`Key!VJe5M6(!TJmg`t9Fa;ZpR+(b@1hjH_ zjtxJl^^i=A<>D8|im5`WX+8{yo;467QUWHU0PB^{;`fjIVj~kL(y$h?Z;Ne#>H>B# z#zq-X-G*EWiVGs6kS z)*HLxVHbm*SrX*(0hV*ri(3T#){+`_Z#wjS*5#%AE}J^3bnRcHDu-HPGQ^jv#m@I9 zOqgj$Xkmfl2Qdu|KT$4w3|og-4wFvTP1#A@b!O2|d4&MG4Ara#_Jf+K>_pYxQgb<6 zFki6ErfIrQgA7b!EG}as7d^IIrqv39jxb84)xFK=J5?M;r!)p+SRkeHKDJ2E&_b*= z6_iwu8#`)>(%VLF5`Mrcmn&uqBV9-NGpCse$lR2+qgAx?S!%Vd0Ieuyob7w1ZAz{4 zO0E5r>QAi%s)I7rL!H{1HMK~pc?v_RCRM3Ijim}fORyyv56v6TDZ(M)1_Ga*IhZ*7 z#S&N|)t6f9sON~Aq?(vfaSa|_${&?di;SNrc5pQ^7KYYkc0RT+sF)^J%L|#TpT)SD z-|ABBp9q^fC7D(;tu>!7^*T1$`HbJLj0j3FoseumYN1FlFZgd^u;DPunQUl98v@#!DM(=z!X8V;AlhVfCmJFsfqW!?rB!fS%fRDb>Wx(R&KysrO%y8~c!sEmgIG zX@*weJ|-G!luzghsCy@l#)*lg>;+1QW4RouDp68&x)x{yh)W;a!#1%Xsw%6QiRwzrs3s!r zp>vWQ<7c+a(#Jb=pW5B1k<%L4im}&AULJ~}b|@%EO}S!+o^%{*+)HC>Ql$5|J}MPz zU2tfiq4!}_VN?B4N7rl2DBBWu-1BAzSSj4>L1(67$ut^|9j=>SV@okgQ@Wva%lF%p z_2xus_B<>|kC4%l%0VhKvTo@sL9JMoKV^1N`)_G4{}i%Hms zQvV!1Y^cRiO;kf^+6eVnSnznVQX1;b;$Gt@JC`r%k1kOy0qxXnnzoV&X&E>EQoDw7 z>!!^K)-+{xjeYa6uSOYA@};G&q@7qu!>!IM+@fZ4nYMiS5~(`WRGa2(LXarmwOFH~ z%pA7_$nY~e&2{}bqU+5R(%UipGlkFq{gl9%JGC~2oTlZxGA62Y-m)!%2`Xgi4gJ@1bsISp@bWjd+NMW)g!4)o!na(P3dE`{(KC7hhV_X~5wnW(q-B(;%%9eUs z7^Ai-T6q_lD@C2j;8=&m z4y^B`UZe$PWpvhDr3#q>9*c32D%-cQGb@T!M;|vO*?|}@FVC%>ZWqisPTgWpY{zrf z<>O0LKsvp{)Qcw`W8j$1R$8Nof@Ar%WYX7%RiZjRrMTkRvoRB>3h7fTkBoND*@|Nm zBJ%kL(wa$Z&&Z~F%F^l|F*U^&ay0NrAodW-lwud5&(nBxVGicp)nmUlNuM{$2gghS z>=v-q7LSVjd{>?$s?StP96Jzj7t&^ z1&rA zCRbl|@>b`0r3JP=A$v46il(2bTP3S)x?G(YRsHz7LPd1i!qr=C5ijBn2X5N=t~5^A z_$=h~XS=;p&kySGnx;-Yaq#M>Id$qd?$yzd+H3VxxeiN3+N;CX%B$m}JLU+>=BW(u zqt3ZxE1rZvh>6VKo%ZUqYN+=sDzd6XtQE!MRng}TcrAywPTD_QCyJE(8OeN4%P= 6010050) - #pragma clang diagnostic push - #pragma clang diagnostic ignored "-Wc11-extensions" - #pragma clang diagnostic ignored "-Wreserved-id-macro" -#elif defined (__GNUC__) - /* anonymous unions are enabled by default */ -#elif defined (__TMS470__) - /* anonymous unions are enabled by default */ -#elif defined (__TASKING__) - #pragma warning 586 -#elif defined (__CSMC__) - /* anonymous unions are enabled by default */ -#else - #warning Not supported compiler type -#endif - - -/* -------- Configuration of Core Peripherals ----------------------------------- */ -#define __CM55_REV 0x0001U /* Core revision r0p1 */ -#define __NVIC_PRIO_BITS 3U /* Number of Bits used for Priority Levels */ -#define __Vendor_SysTickConfig 0U /* Set to 1 if different SysTick Config is used */ -#define __VTOR_PRESENT 1U /* VTOR present */ -#define __MPU_PRESENT 1U /* MPU present */ -#define __FPU_PRESENT 1U /* FPU present */ -#define __FPU_DP 1U /* double precision FPU */ -#define __DSP_PRESENT 1U /* DSP extension present */ -#define __SAUREGION_PRESENT 1U /* SAU regions present */ -#define __PMU_PRESENT 1U /* PMU present */ -#define __PMU_NUM_EVENTCNT 8U /* PMU Event Counters */ -#define __ICACHE_PRESENT 1U /* Instruction Cache present */ -#define __DCACHE_PRESENT 1U /* Data Cache present */ - -#include "core_cm55.h" /* Processor and core peripherals */ -#include "system_ARMCM55.h" /* System Header */ - - -/* -------- End of section using anonymous unions and disabling warnings -------- */ -#if defined (__CC_ARM) - #pragma pop -#elif defined (__ICCARM__) - /* leave anonymous unions enabled */ -#elif (defined(__ARMCC_VERSION) && (__ARMCC_VERSION >= 6010050)) - #pragma clang diagnostic pop -#elif defined (__GNUC__) - /* anonymous unions are enabled by default */ -#elif defined (__TMS470__) - /* anonymous unions are enabled by default */ -#elif defined (__TASKING__) - #pragma warning restore -#elif defined (__CSMC__) - /* anonymous unions are enabled by default */ -#else - #warning Not supported compiler type -#endif - - -#ifdef __cplusplus -} -#endif - -#endif /* ARMCM55_H */ diff --git a/envs/core/src/cachel1_armv7.h b/envs/core/src/cachel1_armv7.h deleted file mode 100644 index abebc95..0000000 --- a/envs/core/src/cachel1_armv7.h +++ /dev/null @@ -1,411 +0,0 @@ -/****************************************************************************** - * @file cachel1_armv7.h - * @brief CMSIS Level 1 Cache API for Armv7-M and later - * @version V1.0.1 - * @date 19. April 2021 - ******************************************************************************/ -/* - * Copyright (c) 2020-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined ( __ICCARM__ ) - #pragma system_include /* treat file as system include file for MISRA check */ -#elif defined (__clang__) - #pragma clang system_header /* treat file as system include file */ -#endif - -#ifndef ARM_CACHEL1_ARMV7_H -#define ARM_CACHEL1_ARMV7_H - -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_CacheFunctions Cache Functions - \brief Functions that configure Instruction and Data cache. - @{ - */ - -/* Cache Size ID Register Macros */ -#define CCSIDR_WAYS(x) (((x) & SCB_CCSIDR_ASSOCIATIVITY_Msk) >> SCB_CCSIDR_ASSOCIATIVITY_Pos) -#define CCSIDR_SETS(x) (((x) & SCB_CCSIDR_NUMSETS_Msk ) >> SCB_CCSIDR_NUMSETS_Pos ) - -#ifndef __SCB_DCACHE_LINE_SIZE -#define __SCB_DCACHE_LINE_SIZE 32U /*!< Cortex-M7 cache line size is fixed to 32 bytes (8 words). See also register SCB_CCSIDR */ -#endif - -#ifndef __SCB_ICACHE_LINE_SIZE -#define __SCB_ICACHE_LINE_SIZE 32U /*!< Cortex-M7 cache line size is fixed to 32 bytes (8 words). See also register SCB_CCSIDR */ -#endif - -/** - \brief Enable I-Cache - \details Turns on I-Cache - */ -__STATIC_FORCEINLINE void SCB_EnableICache (void) -{ - #if defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U) - if (SCB->CCR & SCB_CCR_IC_Msk) return; /* return if ICache is already enabled */ - - __DSB(); - __ISB(); - SCB->ICIALLU = 0UL; /* invalidate I-Cache */ - __DSB(); - __ISB(); - SCB->CCR |= (uint32_t)SCB_CCR_IC_Msk; /* enable I-Cache */ - __DSB(); - __ISB(); - #endif -} - - -/** - \brief Disable I-Cache - \details Turns off I-Cache - */ -__STATIC_FORCEINLINE void SCB_DisableICache (void) -{ - #if defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U) - __DSB(); - __ISB(); - SCB->CCR &= ~(uint32_t)SCB_CCR_IC_Msk; /* disable I-Cache */ - SCB->ICIALLU = 0UL; /* invalidate I-Cache */ - __DSB(); - __ISB(); - #endif -} - - -/** - \brief Invalidate I-Cache - \details Invalidates I-Cache - */ -__STATIC_FORCEINLINE void SCB_InvalidateICache (void) -{ - #if defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U) - __DSB(); - __ISB(); - SCB->ICIALLU = 0UL; - __DSB(); - __ISB(); - #endif -} - - -/** - \brief I-Cache Invalidate by address - \details Invalidates I-Cache for the given address. - I-Cache is invalidated starting from a 32 byte aligned address in 32 byte granularity. - I-Cache memory blocks which are part of given address + given size are invalidated. - \param[in] addr address - \param[in] isize size of memory block (in number of bytes) -*/ -__STATIC_FORCEINLINE void SCB_InvalidateICache_by_Addr (volatile void *addr, int32_t isize) -{ - #if defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U) - if ( isize > 0 ) { - int32_t op_size = isize + (((uint32_t)addr) & (__SCB_ICACHE_LINE_SIZE - 1U)); - uint32_t op_addr = (uint32_t)addr /* & ~(__SCB_ICACHE_LINE_SIZE - 1U) */; - - __DSB(); - - do { - SCB->ICIMVAU = op_addr; /* register accepts only 32byte aligned values, only bits 31..5 are valid */ - op_addr += __SCB_ICACHE_LINE_SIZE; - op_size -= __SCB_ICACHE_LINE_SIZE; - } while ( op_size > 0 ); - - __DSB(); - __ISB(); - } - #endif -} - - -/** - \brief Enable D-Cache - \details Turns on D-Cache - */ -__STATIC_FORCEINLINE void SCB_EnableDCache (void) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - uint32_t ccsidr; - uint32_t sets; - uint32_t ways; - - if (SCB->CCR & SCB_CCR_DC_Msk) return; /* return if DCache is already enabled */ - - SCB->CSSELR = 0U; /* select Level 1 data cache */ - __DSB(); - - ccsidr = SCB->CCSIDR; - - /* invalidate D-Cache */ - sets = (uint32_t)(CCSIDR_SETS(ccsidr)); - do { - ways = (uint32_t)(CCSIDR_WAYS(ccsidr)); - do { - SCB->DCISW = (((sets << SCB_DCISW_SET_Pos) & SCB_DCISW_SET_Msk) | - ((ways << SCB_DCISW_WAY_Pos) & SCB_DCISW_WAY_Msk) ); - #if defined ( __CC_ARM ) - __schedule_barrier(); - #endif - } while (ways-- != 0U); - } while(sets-- != 0U); - __DSB(); - - SCB->CCR |= (uint32_t)SCB_CCR_DC_Msk; /* enable D-Cache */ - - __DSB(); - __ISB(); - #endif -} - - -/** - \brief Disable D-Cache - \details Turns off D-Cache - */ -__STATIC_FORCEINLINE void SCB_DisableDCache (void) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - uint32_t ccsidr; - uint32_t sets; - uint32_t ways; - - SCB->CSSELR = 0U; /* select Level 1 data cache */ - __DSB(); - - SCB->CCR &= ~(uint32_t)SCB_CCR_DC_Msk; /* disable D-Cache */ - __DSB(); - - ccsidr = SCB->CCSIDR; - - /* clean & invalidate D-Cache */ - sets = (uint32_t)(CCSIDR_SETS(ccsidr)); - do { - ways = (uint32_t)(CCSIDR_WAYS(ccsidr)); - do { - SCB->DCCISW = (((sets << SCB_DCCISW_SET_Pos) & SCB_DCCISW_SET_Msk) | - ((ways << SCB_DCCISW_WAY_Pos) & SCB_DCCISW_WAY_Msk) ); - #if defined ( __CC_ARM ) - __schedule_barrier(); - #endif - } while (ways-- != 0U); - } while(sets-- != 0U); - - __DSB(); - __ISB(); - #endif -} - - -/** - \brief Invalidate D-Cache - \details Invalidates D-Cache - */ -__STATIC_FORCEINLINE void SCB_InvalidateDCache (void) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - uint32_t ccsidr; - uint32_t sets; - uint32_t ways; - - SCB->CSSELR = 0U; /* select Level 1 data cache */ - __DSB(); - - ccsidr = SCB->CCSIDR; - - /* invalidate D-Cache */ - sets = (uint32_t)(CCSIDR_SETS(ccsidr)); - do { - ways = (uint32_t)(CCSIDR_WAYS(ccsidr)); - do { - SCB->DCISW = (((sets << SCB_DCISW_SET_Pos) & SCB_DCISW_SET_Msk) | - ((ways << SCB_DCISW_WAY_Pos) & SCB_DCISW_WAY_Msk) ); - #if defined ( __CC_ARM ) - __schedule_barrier(); - #endif - } while (ways-- != 0U); - } while(sets-- != 0U); - - __DSB(); - __ISB(); - #endif -} - - -/** - \brief Clean D-Cache - \details Cleans D-Cache - */ -__STATIC_FORCEINLINE void SCB_CleanDCache (void) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - uint32_t ccsidr; - uint32_t sets; - uint32_t ways; - - SCB->CSSELR = 0U; /* select Level 1 data cache */ - __DSB(); - - ccsidr = SCB->CCSIDR; - - /* clean D-Cache */ - sets = (uint32_t)(CCSIDR_SETS(ccsidr)); - do { - ways = (uint32_t)(CCSIDR_WAYS(ccsidr)); - do { - SCB->DCCSW = (((sets << SCB_DCCSW_SET_Pos) & SCB_DCCSW_SET_Msk) | - ((ways << SCB_DCCSW_WAY_Pos) & SCB_DCCSW_WAY_Msk) ); - #if defined ( __CC_ARM ) - __schedule_barrier(); - #endif - } while (ways-- != 0U); - } while(sets-- != 0U); - - __DSB(); - __ISB(); - #endif -} - - -/** - \brief Clean & Invalidate D-Cache - \details Cleans and Invalidates D-Cache - */ -__STATIC_FORCEINLINE void SCB_CleanInvalidateDCache (void) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - uint32_t ccsidr; - uint32_t sets; - uint32_t ways; - - SCB->CSSELR = 0U; /* select Level 1 data cache */ - __DSB(); - - ccsidr = SCB->CCSIDR; - - /* clean & invalidate D-Cache */ - sets = (uint32_t)(CCSIDR_SETS(ccsidr)); - do { - ways = (uint32_t)(CCSIDR_WAYS(ccsidr)); - do { - SCB->DCCISW = (((sets << SCB_DCCISW_SET_Pos) & SCB_DCCISW_SET_Msk) | - ((ways << SCB_DCCISW_WAY_Pos) & SCB_DCCISW_WAY_Msk) ); - #if defined ( __CC_ARM ) - __schedule_barrier(); - #endif - } while (ways-- != 0U); - } while(sets-- != 0U); - - __DSB(); - __ISB(); - #endif -} - - -/** - \brief D-Cache Invalidate by address - \details Invalidates D-Cache for the given address. - D-Cache is invalidated starting from a 32 byte aligned address in 32 byte granularity. - D-Cache memory blocks which are part of given address + given size are invalidated. - \param[in] addr address - \param[in] dsize size of memory block (in number of bytes) -*/ -__STATIC_FORCEINLINE void SCB_InvalidateDCache_by_Addr (volatile void *addr, int32_t dsize) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - if ( dsize > 0 ) { - int32_t op_size = dsize + (((uint32_t)addr) & (__SCB_DCACHE_LINE_SIZE - 1U)); - uint32_t op_addr = (uint32_t)addr /* & ~(__SCB_DCACHE_LINE_SIZE - 1U) */; - - __DSB(); - - do { - SCB->DCIMVAC = op_addr; /* register accepts only 32byte aligned values, only bits 31..5 are valid */ - op_addr += __SCB_DCACHE_LINE_SIZE; - op_size -= __SCB_DCACHE_LINE_SIZE; - } while ( op_size > 0 ); - - __DSB(); - __ISB(); - } - #endif -} - - -/** - \brief D-Cache Clean by address - \details Cleans D-Cache for the given address - D-Cache is cleaned starting from a 32 byte aligned address in 32 byte granularity. - D-Cache memory blocks which are part of given address + given size are cleaned. - \param[in] addr address - \param[in] dsize size of memory block (in number of bytes) -*/ -__STATIC_FORCEINLINE void SCB_CleanDCache_by_Addr (volatile void *addr, int32_t dsize) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - if ( dsize > 0 ) { - int32_t op_size = dsize + (((uint32_t)addr) & (__SCB_DCACHE_LINE_SIZE - 1U)); - uint32_t op_addr = (uint32_t)addr /* & ~(__SCB_DCACHE_LINE_SIZE - 1U) */; - - __DSB(); - - do { - SCB->DCCMVAC = op_addr; /* register accepts only 32byte aligned values, only bits 31..5 are valid */ - op_addr += __SCB_DCACHE_LINE_SIZE; - op_size -= __SCB_DCACHE_LINE_SIZE; - } while ( op_size > 0 ); - - __DSB(); - __ISB(); - } - #endif -} - - -/** - \brief D-Cache Clean and Invalidate by address - \details Cleans and invalidates D_Cache for the given address - D-Cache is cleaned and invalidated starting from a 32 byte aligned address in 32 byte granularity. - D-Cache memory blocks which are part of given address + given size are cleaned and invalidated. - \param[in] addr address (aligned to 32-byte boundary) - \param[in] dsize size of memory block (in number of bytes) -*/ -__STATIC_FORCEINLINE void SCB_CleanInvalidateDCache_by_Addr (volatile void *addr, int32_t dsize) -{ - #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) - if ( dsize > 0 ) { - int32_t op_size = dsize + (((uint32_t)addr) & (__SCB_DCACHE_LINE_SIZE - 1U)); - uint32_t op_addr = (uint32_t)addr /* & ~(__SCB_DCACHE_LINE_SIZE - 1U) */; - - __DSB(); - - do { - SCB->DCCIMVAC = op_addr; /* register accepts only 32byte aligned values, only bits 31..5 are valid */ - op_addr += __SCB_DCACHE_LINE_SIZE; - op_size -= __SCB_DCACHE_LINE_SIZE; - } while ( op_size > 0 ); - - __DSB(); - __ISB(); - } - #endif -} - -/*@} end of CMSIS_Core_CacheFunctions */ - -#endif /* ARM_CACHEL1_ARMV7_H */ diff --git a/envs/core/src/cmsis_compiler.h b/envs/core/src/cmsis_compiler.h deleted file mode 100644 index adbf296..0000000 --- a/envs/core/src/cmsis_compiler.h +++ /dev/null @@ -1,283 +0,0 @@ -/**************************************************************************//** - * @file cmsis_compiler.h - * @brief CMSIS compiler generic header file - * @version V5.1.0 - * @date 09. October 2018 - ******************************************************************************/ -/* - * Copyright (c) 2009-2018 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#ifndef __CMSIS_COMPILER_H -#define __CMSIS_COMPILER_H - -#include - -/* - * Arm Compiler 4/5 - */ -#if defined ( __CC_ARM ) - #include "cmsis_armcc.h" - - -/* - * Arm Compiler 6.6 LTM (armclang) - */ -#elif defined (__ARMCC_VERSION) && (__ARMCC_VERSION >= 6010050) && (__ARMCC_VERSION < 6100100) - #include "cmsis_armclang_ltm.h" - - /* - * Arm Compiler above 6.10.1 (armclang) - */ -#elif defined (__ARMCC_VERSION) && (__ARMCC_VERSION >= 6100100) - #include "cmsis_armclang.h" - - -/* - * GNU Compiler - */ -#elif defined ( __GNUC__ ) - #include "cmsis_gcc.h" - - -/* - * IAR Compiler - */ -#elif defined ( __ICCARM__ ) - #include - - -/* - * TI Arm Compiler - */ -#elif defined ( __TI_ARM__ ) - #include - - #ifndef __ASM - #define __ASM __asm - #endif - #ifndef __INLINE - #define __INLINE inline - #endif - #ifndef __STATIC_INLINE - #define __STATIC_INLINE static inline - #endif - #ifndef __STATIC_FORCEINLINE - #define __STATIC_FORCEINLINE __STATIC_INLINE - #endif - #ifndef __NO_RETURN - #define __NO_RETURN __attribute__((noreturn)) - #endif - #ifndef __USED - #define __USED __attribute__((used)) - #endif - #ifndef __WEAK - #define __WEAK __attribute__((weak)) - #endif - #ifndef __PACKED - #define __PACKED __attribute__((packed)) - #endif - #ifndef __PACKED_STRUCT - #define __PACKED_STRUCT struct __attribute__((packed)) - #endif - #ifndef __PACKED_UNION - #define __PACKED_UNION union __attribute__((packed)) - #endif - #ifndef __UNALIGNED_UINT32 /* deprecated */ - struct __attribute__((packed)) T_UINT32 { uint32_t v; }; - #define __UNALIGNED_UINT32(x) (((struct T_UINT32 *)(x))->v) - #endif - #ifndef __UNALIGNED_UINT16_WRITE - __PACKED_STRUCT T_UINT16_WRITE { uint16_t v; }; - #define __UNALIGNED_UINT16_WRITE(addr, val) (void)((((struct T_UINT16_WRITE *)(void*)(addr))->v) = (val)) - #endif - #ifndef __UNALIGNED_UINT16_READ - __PACKED_STRUCT T_UINT16_READ { uint16_t v; }; - #define __UNALIGNED_UINT16_READ(addr) (((const struct T_UINT16_READ *)(const void *)(addr))->v) - #endif - #ifndef __UNALIGNED_UINT32_WRITE - __PACKED_STRUCT T_UINT32_WRITE { uint32_t v; }; - #define __UNALIGNED_UINT32_WRITE(addr, val) (void)((((struct T_UINT32_WRITE *)(void *)(addr))->v) = (val)) - #endif - #ifndef __UNALIGNED_UINT32_READ - __PACKED_STRUCT T_UINT32_READ { uint32_t v; }; - #define __UNALIGNED_UINT32_READ(addr) (((const struct T_UINT32_READ *)(const void *)(addr))->v) - #endif - #ifndef __ALIGNED - #define __ALIGNED(x) __attribute__((aligned(x))) - #endif - #ifndef __RESTRICT - #define __RESTRICT __restrict - #endif - #ifndef __COMPILER_BARRIER - #warning No compiler specific solution for __COMPILER_BARRIER. __COMPILER_BARRIER is ignored. - #define __COMPILER_BARRIER() (void)0 - #endif - - -/* - * TASKING Compiler - */ -#elif defined ( __TASKING__ ) - /* - * The CMSIS functions have been implemented as intrinsics in the compiler. - * Please use "carm -?i" to get an up to date list of all intrinsics, - * Including the CMSIS ones. - */ - - #ifndef __ASM - #define __ASM __asm - #endif - #ifndef __INLINE - #define __INLINE inline - #endif - #ifndef __STATIC_INLINE - #define __STATIC_INLINE static inline - #endif - #ifndef __STATIC_FORCEINLINE - #define __STATIC_FORCEINLINE __STATIC_INLINE - #endif - #ifndef __NO_RETURN - #define __NO_RETURN __attribute__((noreturn)) - #endif - #ifndef __USED - #define __USED __attribute__((used)) - #endif - #ifndef __WEAK - #define __WEAK __attribute__((weak)) - #endif - #ifndef __PACKED - #define __PACKED __packed__ - #endif - #ifndef __PACKED_STRUCT - #define __PACKED_STRUCT struct __packed__ - #endif - #ifndef __PACKED_UNION - #define __PACKED_UNION union __packed__ - #endif - #ifndef __UNALIGNED_UINT32 /* deprecated */ - struct __packed__ T_UINT32 { uint32_t v; }; - #define __UNALIGNED_UINT32(x) (((struct T_UINT32 *)(x))->v) - #endif - #ifndef __UNALIGNED_UINT16_WRITE - __PACKED_STRUCT T_UINT16_WRITE { uint16_t v; }; - #define __UNALIGNED_UINT16_WRITE(addr, val) (void)((((struct T_UINT16_WRITE *)(void *)(addr))->v) = (val)) - #endif - #ifndef __UNALIGNED_UINT16_READ - __PACKED_STRUCT T_UINT16_READ { uint16_t v; }; - #define __UNALIGNED_UINT16_READ(addr) (((const struct T_UINT16_READ *)(const void *)(addr))->v) - #endif - #ifndef __UNALIGNED_UINT32_WRITE - __PACKED_STRUCT T_UINT32_WRITE { uint32_t v; }; - #define __UNALIGNED_UINT32_WRITE(addr, val) (void)((((struct T_UINT32_WRITE *)(void *)(addr))->v) = (val)) - #endif - #ifndef __UNALIGNED_UINT32_READ - __PACKED_STRUCT T_UINT32_READ { uint32_t v; }; - #define __UNALIGNED_UINT32_READ(addr) (((const struct T_UINT32_READ *)(const void *)(addr))->v) - #endif - #ifndef __ALIGNED - #define __ALIGNED(x) __align(x) - #endif - #ifndef __RESTRICT - #warning No compiler specific solution for __RESTRICT. __RESTRICT is ignored. - #define __RESTRICT - #endif - #ifndef __COMPILER_BARRIER - #warning No compiler specific solution for __COMPILER_BARRIER. __COMPILER_BARRIER is ignored. - #define __COMPILER_BARRIER() (void)0 - #endif - - -/* - * COSMIC Compiler - */ -#elif defined ( __CSMC__ ) - #include - - #ifndef __ASM - #define __ASM _asm - #endif - #ifndef __INLINE - #define __INLINE inline - #endif - #ifndef __STATIC_INLINE - #define __STATIC_INLINE static inline - #endif - #ifndef __STATIC_FORCEINLINE - #define __STATIC_FORCEINLINE __STATIC_INLINE - #endif - #ifndef __NO_RETURN - // NO RETURN is automatically detected hence no warning here - #define __NO_RETURN - #endif - #ifndef __USED - #warning No compiler specific solution for __USED. __USED is ignored. - #define __USED - #endif - #ifndef __WEAK - #define __WEAK __weak - #endif - #ifndef __PACKED - #define __PACKED @packed - #endif - #ifndef __PACKED_STRUCT - #define __PACKED_STRUCT @packed struct - #endif - #ifndef __PACKED_UNION - #define __PACKED_UNION @packed union - #endif - #ifndef __UNALIGNED_UINT32 /* deprecated */ - @packed struct T_UINT32 { uint32_t v; }; - #define __UNALIGNED_UINT32(x) (((struct T_UINT32 *)(x))->v) - #endif - #ifndef __UNALIGNED_UINT16_WRITE - __PACKED_STRUCT T_UINT16_WRITE { uint16_t v; }; - #define __UNALIGNED_UINT16_WRITE(addr, val) (void)((((struct T_UINT16_WRITE *)(void *)(addr))->v) = (val)) - #endif - #ifndef __UNALIGNED_UINT16_READ - __PACKED_STRUCT T_UINT16_READ { uint16_t v; }; - #define __UNALIGNED_UINT16_READ(addr) (((const struct T_UINT16_READ *)(const void *)(addr))->v) - #endif - #ifndef __UNALIGNED_UINT32_WRITE - __PACKED_STRUCT T_UINT32_WRITE { uint32_t v; }; - #define __UNALIGNED_UINT32_WRITE(addr, val) (void)((((struct T_UINT32_WRITE *)(void *)(addr))->v) = (val)) - #endif - #ifndef __UNALIGNED_UINT32_READ - __PACKED_STRUCT T_UINT32_READ { uint32_t v; }; - #define __UNALIGNED_UINT32_READ(addr) (((const struct T_UINT32_READ *)(const void *)(addr))->v) - #endif - #ifndef __ALIGNED - #warning No compiler specific solution for __ALIGNED. __ALIGNED is ignored. - #define __ALIGNED(x) - #endif - #ifndef __RESTRICT - #warning No compiler specific solution for __RESTRICT. __RESTRICT is ignored. - #define __RESTRICT - #endif - #ifndef __COMPILER_BARRIER - #warning No compiler specific solution for __COMPILER_BARRIER. __COMPILER_BARRIER is ignored. - #define __COMPILER_BARRIER() (void)0 - #endif - - -#else - #error Unknown compiler. -#endif - - -#endif /* __CMSIS_COMPILER_H */ - diff --git a/envs/core/src/cmsis_gcc.h b/envs/core/src/cmsis_gcc.h deleted file mode 100644 index 67bda4e..0000000 --- a/envs/core/src/cmsis_gcc.h +++ /dev/null @@ -1,2211 +0,0 @@ -/**************************************************************************//** - * @file cmsis_gcc.h - * @brief CMSIS compiler GCC header file - * @version V5.4.1 - * @date 27. May 2021 - ******************************************************************************/ -/* - * Copyright (c) 2009-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#ifndef __CMSIS_GCC_H -#define __CMSIS_GCC_H - -/* ignore some GCC warnings */ -#pragma GCC diagnostic push -#pragma GCC diagnostic ignored "-Wsign-conversion" -#pragma GCC diagnostic ignored "-Wconversion" -#pragma GCC diagnostic ignored "-Wunused-parameter" - -/* Fallback for __has_builtin */ -#ifndef __has_builtin - #define __has_builtin(x) (0) -#endif - -/* CMSIS compiler specific defines */ -#ifndef __ASM - #define __ASM __asm -#endif -#ifndef __INLINE - #define __INLINE inline -#endif -#ifndef __STATIC_INLINE - #define __STATIC_INLINE static inline -#endif -#ifndef __STATIC_FORCEINLINE - #define __STATIC_FORCEINLINE __attribute__((always_inline)) static inline -#endif -#ifndef __NO_RETURN - #define __NO_RETURN __attribute__((__noreturn__)) -#endif -#ifndef __USED - #define __USED __attribute__((used)) -#endif -#ifndef __WEAK - #define __WEAK __attribute__((weak)) -#endif -#ifndef __PACKED - #define __PACKED __attribute__((packed, aligned(1))) -#endif -#ifndef __PACKED_STRUCT - #define __PACKED_STRUCT struct __attribute__((packed, aligned(1))) -#endif -#ifndef __PACKED_UNION - #define __PACKED_UNION union __attribute__((packed, aligned(1))) -#endif -#ifndef __UNALIGNED_UINT32 /* deprecated */ - #pragma GCC diagnostic push - #pragma GCC diagnostic ignored "-Wpacked" - #pragma GCC diagnostic ignored "-Wattributes" - struct __attribute__((packed)) T_UINT32 { uint32_t v; }; - #pragma GCC diagnostic pop - #define __UNALIGNED_UINT32(x) (((struct T_UINT32 *)(x))->v) -#endif -#ifndef __UNALIGNED_UINT16_WRITE - #pragma GCC diagnostic push - #pragma GCC diagnostic ignored "-Wpacked" - #pragma GCC diagnostic ignored "-Wattributes" - __PACKED_STRUCT T_UINT16_WRITE { uint16_t v; }; - #pragma GCC diagnostic pop - #define __UNALIGNED_UINT16_WRITE(addr, val) (void)((((struct T_UINT16_WRITE *)(void *)(addr))->v) = (val)) -#endif -#ifndef __UNALIGNED_UINT16_READ - #pragma GCC diagnostic push - #pragma GCC diagnostic ignored "-Wpacked" - #pragma GCC diagnostic ignored "-Wattributes" - __PACKED_STRUCT T_UINT16_READ { uint16_t v; }; - #pragma GCC diagnostic pop - #define __UNALIGNED_UINT16_READ(addr) (((const struct T_UINT16_READ *)(const void *)(addr))->v) -#endif -#ifndef __UNALIGNED_UINT32_WRITE - #pragma GCC diagnostic push - #pragma GCC diagnostic ignored "-Wpacked" - #pragma GCC diagnostic ignored "-Wattributes" - __PACKED_STRUCT T_UINT32_WRITE { uint32_t v; }; - #pragma GCC diagnostic pop - #define __UNALIGNED_UINT32_WRITE(addr, val) (void)((((struct T_UINT32_WRITE *)(void *)(addr))->v) = (val)) -#endif -#ifndef __UNALIGNED_UINT32_READ - #pragma GCC diagnostic push - #pragma GCC diagnostic ignored "-Wpacked" - #pragma GCC diagnostic ignored "-Wattributes" - __PACKED_STRUCT T_UINT32_READ { uint32_t v; }; - #pragma GCC diagnostic pop - #define __UNALIGNED_UINT32_READ(addr) (((const struct T_UINT32_READ *)(const void *)(addr))->v) -#endif -#ifndef __ALIGNED - #define __ALIGNED(x) __attribute__((aligned(x))) -#endif -#ifndef __RESTRICT - #define __RESTRICT __restrict -#endif -#ifndef __COMPILER_BARRIER - #define __COMPILER_BARRIER() __ASM volatile("":::"memory") -#endif - -/* ######################### Startup and Lowlevel Init ######################## */ - -#ifndef __PROGRAM_START - -/** - \brief Initializes data and bss sections - \details This default implementations initialized all data and additional bss - sections relying on .copy.table and .zero.table specified properly - in the used linker script. - - */ -__STATIC_FORCEINLINE __NO_RETURN void __cmsis_start(void) -{ - extern void _start(void) __NO_RETURN; - - typedef struct { - uint32_t const* src; - uint32_t* dest; - uint32_t wlen; - } __copy_table_t; - - typedef struct { - uint32_t* dest; - uint32_t wlen; - } __zero_table_t; - - extern const __copy_table_t __copy_table_start__; - extern const __copy_table_t __copy_table_end__; - extern const __zero_table_t __zero_table_start__; - extern const __zero_table_t __zero_table_end__; - - for (__copy_table_t const* pTable = &__copy_table_start__; pTable < &__copy_table_end__; ++pTable) { - for(uint32_t i=0u; iwlen; ++i) { - pTable->dest[i] = pTable->src[i]; - } - } - - for (__zero_table_t const* pTable = &__zero_table_start__; pTable < &__zero_table_end__; ++pTable) { - for(uint32_t i=0u; iwlen; ++i) { - pTable->dest[i] = 0u; - } - } - - _start(); -} - -#define __PROGRAM_START __cmsis_start -#endif - -#ifndef __INITIAL_SP -#define __INITIAL_SP __StackTop -#endif - -#ifndef __STACK_LIMIT -#define __STACK_LIMIT __StackLimit -#endif - -#ifndef __VECTOR_TABLE -#define __VECTOR_TABLE __Vectors -#endif - -#ifndef __VECTOR_TABLE_ATTRIBUTE -#define __VECTOR_TABLE_ATTRIBUTE __attribute__((used, section(".vectors"))) -#endif - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -#ifndef __STACK_SEAL -#define __STACK_SEAL __StackSeal -#endif - -#ifndef __TZ_STACK_SEAL_SIZE -#define __TZ_STACK_SEAL_SIZE 8U -#endif - -#ifndef __TZ_STACK_SEAL_VALUE -#define __TZ_STACK_SEAL_VALUE 0xFEF5EDA5FEF5EDA5ULL -#endif - - -__STATIC_FORCEINLINE void __TZ_set_STACKSEAL_S (uint32_t* stackTop) { - *((uint64_t *)stackTop) = __TZ_STACK_SEAL_VALUE; -} -#endif - - -/* ########################## Core Instruction Access ######################### */ -/** \defgroup CMSIS_Core_InstructionInterface CMSIS Core Instruction Interface - Access to dedicated instructions - @{ -*/ - -/* Define macros for porting to both thumb1 and thumb2. - * For thumb1, use low register (r0-r7), specified by constraint "l" - * Otherwise, use general registers, specified by constraint "r" */ -#if defined (__thumb__) && !defined (__thumb2__) -#define __CMSIS_GCC_OUT_REG(r) "=l" (r) -#define __CMSIS_GCC_RW_REG(r) "+l" (r) -#define __CMSIS_GCC_USE_REG(r) "l" (r) -#else -#define __CMSIS_GCC_OUT_REG(r) "=r" (r) -#define __CMSIS_GCC_RW_REG(r) "+r" (r) -#define __CMSIS_GCC_USE_REG(r) "r" (r) -#endif - -/** - \brief No Operation - \details No Operation does nothing. This instruction can be used for code alignment purposes. - */ -#define __NOP() __ASM volatile ("nop") - -/** - \brief Wait For Interrupt - \details Wait For Interrupt is a hint instruction that suspends execution until one of a number of events occurs. - */ -#define __WFI() __ASM volatile ("wfi":::"memory") - - -/** - \brief Wait For Event - \details Wait For Event is a hint instruction that permits the processor to enter - a low-power state until one of a number of events occurs. - */ -#define __WFE() __ASM volatile ("wfe":::"memory") - - -/** - \brief Send Event - \details Send Event is a hint instruction. It causes an event to be signaled to the CPU. - */ -#define __SEV() __ASM volatile ("sev") - - -/** - \brief Instruction Synchronization Barrier - \details Instruction Synchronization Barrier flushes the pipeline in the processor, - so that all instructions following the ISB are fetched from cache or memory, - after the instruction has been completed. - */ -__STATIC_FORCEINLINE void __ISB(void) -{ - __ASM volatile ("isb 0xF":::"memory"); -} - - -/** - \brief Data Synchronization Barrier - \details Acts as a special kind of Data Memory Barrier. - It completes when all explicit memory accesses before this instruction complete. - */ -__STATIC_FORCEINLINE void __DSB(void) -{ - __ASM volatile ("dsb 0xF":::"memory"); -} - - -/** - \brief Data Memory Barrier - \details Ensures the apparent order of the explicit memory operations before - and after the instruction, without ensuring their completion. - */ -__STATIC_FORCEINLINE void __DMB(void) -{ - __ASM volatile ("dmb 0xF":::"memory"); -} - - -/** - \brief Reverse byte order (32 bit) - \details Reverses the byte order in unsigned integer value. For example, 0x12345678 becomes 0x78563412. - \param [in] value Value to reverse - \return Reversed value - */ -__STATIC_FORCEINLINE uint32_t __REV(uint32_t value) -{ -#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5) - return __builtin_bswap32(value); -#else - uint32_t result; - - __ASM ("rev %0, %1" : __CMSIS_GCC_OUT_REG (result) : __CMSIS_GCC_USE_REG (value) ); - return result; -#endif -} - - -/** - \brief Reverse byte order (16 bit) - \details Reverses the byte order within each halfword of a word. For example, 0x12345678 becomes 0x34127856. - \param [in] value Value to reverse - \return Reversed value - */ -__STATIC_FORCEINLINE uint32_t __REV16(uint32_t value) -{ - uint32_t result; - - __ASM ("rev16 %0, %1" : __CMSIS_GCC_OUT_REG (result) : __CMSIS_GCC_USE_REG (value) ); - return result; -} - - -/** - \brief Reverse byte order (16 bit) - \details Reverses the byte order in a 16-bit value and returns the signed 16-bit result. For example, 0x0080 becomes 0x8000. - \param [in] value Value to reverse - \return Reversed value - */ -__STATIC_FORCEINLINE int16_t __REVSH(int16_t value) -{ -#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8) - return (int16_t)__builtin_bswap16(value); -#else - int16_t result; - - __ASM ("revsh %0, %1" : __CMSIS_GCC_OUT_REG (result) : __CMSIS_GCC_USE_REG (value) ); - return result; -#endif -} - - -/** - \brief Rotate Right in unsigned value (32 bit) - \details Rotate Right (immediate) provides the value of the contents of a register rotated by a variable number of bits. - \param [in] op1 Value to rotate - \param [in] op2 Number of Bits to rotate - \return Rotated value - */ -__STATIC_FORCEINLINE uint32_t __ROR(uint32_t op1, uint32_t op2) -{ - op2 %= 32U; - if (op2 == 0U) - { - return op1; - } - return (op1 >> op2) | (op1 << (32U - op2)); -} - - -/** - \brief Breakpoint - \details Causes the processor to enter Debug state. - Debug tools can use this to investigate system state when the instruction at a particular address is reached. - \param [in] value is ignored by the processor. - If required, a debugger can use it to store additional information about the breakpoint. - */ -#define __BKPT(value) __ASM volatile ("bkpt "#value) - - -/** - \brief Reverse bit order of value - \details Reverses the bit order of the given value. - \param [in] value Value to reverse - \return Reversed value - */ -__STATIC_FORCEINLINE uint32_t __RBIT(uint32_t value) -{ - uint32_t result; - -#if ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) ) - __ASM ("rbit %0, %1" : "=r" (result) : "r" (value) ); -#else - uint32_t s = (4U /*sizeof(v)*/ * 8U) - 1U; /* extra shift needed at end */ - - result = value; /* r will be reversed bits of v; first get LSB of v */ - for (value >>= 1U; value != 0U; value >>= 1U) - { - result <<= 1U; - result |= value & 1U; - s--; - } - result <<= s; /* shift when v's highest bits are zero */ -#endif - return result; -} - - -/** - \brief Count leading zeros - \details Counts the number of leading zeros of a data value. - \param [in] value Value to count the leading zeros - \return number of leading zeros in value - */ -__STATIC_FORCEINLINE uint8_t __CLZ(uint32_t value) -{ - /* Even though __builtin_clz produces a CLZ instruction on ARM, formally - __builtin_clz(0) is undefined behaviour, so handle this case specially. - This guarantees ARM-compatible results if happening to compile on a non-ARM - target, and ensures the compiler doesn't decide to activate any - optimisations using the logic "value was passed to __builtin_clz, so it - is non-zero". - ARM GCC 7.3 and possibly earlier will optimise this test away, leaving a - single CLZ instruction. - */ - if (value == 0U) - { - return 32U; - } - return __builtin_clz(value); -} - - -#if ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) || \ - (defined (__ARM_ARCH_8M_BASE__ ) && (__ARM_ARCH_8M_BASE__ == 1)) ) -/** - \brief LDR Exclusive (8 bit) - \details Executes a exclusive LDR instruction for 8 bit value. - \param [in] ptr Pointer to data - \return value of type uint8_t at (*ptr) - */ -__STATIC_FORCEINLINE uint8_t __LDREXB(volatile uint8_t *addr) -{ - uint32_t result; - -#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8) - __ASM volatile ("ldrexb %0, %1" : "=r" (result) : "Q" (*addr) ); -#else - /* Prior to GCC 4.8, "Q" will be expanded to [rx, #0] which is not - accepted by assembler. So has to use following less efficient pattern. - */ - __ASM volatile ("ldrexb %0, [%1]" : "=r" (result) : "r" (addr) : "memory" ); -#endif - return ((uint8_t) result); /* Add explicit type cast here */ -} - - -/** - \brief LDR Exclusive (16 bit) - \details Executes a exclusive LDR instruction for 16 bit values. - \param [in] ptr Pointer to data - \return value of type uint16_t at (*ptr) - */ -__STATIC_FORCEINLINE uint16_t __LDREXH(volatile uint16_t *addr) -{ - uint32_t result; - -#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8) - __ASM volatile ("ldrexh %0, %1" : "=r" (result) : "Q" (*addr) ); -#else - /* Prior to GCC 4.8, "Q" will be expanded to [rx, #0] which is not - accepted by assembler. So has to use following less efficient pattern. - */ - __ASM volatile ("ldrexh %0, [%1]" : "=r" (result) : "r" (addr) : "memory" ); -#endif - return ((uint16_t) result); /* Add explicit type cast here */ -} - - -/** - \brief LDR Exclusive (32 bit) - \details Executes a exclusive LDR instruction for 32 bit values. - \param [in] ptr Pointer to data - \return value of type uint32_t at (*ptr) - */ -__STATIC_FORCEINLINE uint32_t __LDREXW(volatile uint32_t *addr) -{ - uint32_t result; - - __ASM volatile ("ldrex %0, %1" : "=r" (result) : "Q" (*addr) ); - return(result); -} - - -/** - \brief STR Exclusive (8 bit) - \details Executes a exclusive STR instruction for 8 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - \return 0 Function succeeded - \return 1 Function failed - */ -__STATIC_FORCEINLINE uint32_t __STREXB(uint8_t value, volatile uint8_t *addr) -{ - uint32_t result; - - __ASM volatile ("strexb %0, %2, %1" : "=&r" (result), "=Q" (*addr) : "r" ((uint32_t)value) ); - return(result); -} - - -/** - \brief STR Exclusive (16 bit) - \details Executes a exclusive STR instruction for 16 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - \return 0 Function succeeded - \return 1 Function failed - */ -__STATIC_FORCEINLINE uint32_t __STREXH(uint16_t value, volatile uint16_t *addr) -{ - uint32_t result; - - __ASM volatile ("strexh %0, %2, %1" : "=&r" (result), "=Q" (*addr) : "r" ((uint32_t)value) ); - return(result); -} - - -/** - \brief STR Exclusive (32 bit) - \details Executes a exclusive STR instruction for 32 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - \return 0 Function succeeded - \return 1 Function failed - */ -__STATIC_FORCEINLINE uint32_t __STREXW(uint32_t value, volatile uint32_t *addr) -{ - uint32_t result; - - __ASM volatile ("strex %0, %2, %1" : "=&r" (result), "=Q" (*addr) : "r" (value) ); - return(result); -} - - -/** - \brief Remove the exclusive lock - \details Removes the exclusive lock which is created by LDREX. - */ -__STATIC_FORCEINLINE void __CLREX(void) -{ - __ASM volatile ("clrex" ::: "memory"); -} - -#endif /* ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) || \ - (defined (__ARM_ARCH_8M_BASE__ ) && (__ARM_ARCH_8M_BASE__ == 1)) ) */ - - -#if ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) ) -/** - \brief Signed Saturate - \details Saturates a signed value. - \param [in] ARG1 Value to be saturated - \param [in] ARG2 Bit position to saturate to (1..32) - \return Saturated value - */ -#define __SSAT(ARG1, ARG2) \ -__extension__ \ -({ \ - int32_t __RES, __ARG1 = (ARG1); \ - __ASM volatile ("ssat %0, %1, %2" : "=r" (__RES) : "I" (ARG2), "r" (__ARG1) : "cc" ); \ - __RES; \ - }) - - -/** - \brief Unsigned Saturate - \details Saturates an unsigned value. - \param [in] ARG1 Value to be saturated - \param [in] ARG2 Bit position to saturate to (0..31) - \return Saturated value - */ -#define __USAT(ARG1, ARG2) \ -__extension__ \ -({ \ - uint32_t __RES, __ARG1 = (ARG1); \ - __ASM volatile ("usat %0, %1, %2" : "=r" (__RES) : "I" (ARG2), "r" (__ARG1) : "cc" ); \ - __RES; \ - }) - - -/** - \brief Rotate Right with Extend (32 bit) - \details Moves each bit of a bitstring right by one bit. - The carry input is shifted in at the left end of the bitstring. - \param [in] value Value to rotate - \return Rotated value - */ -__STATIC_FORCEINLINE uint32_t __RRX(uint32_t value) -{ - uint32_t result; - - __ASM volatile ("rrx %0, %1" : __CMSIS_GCC_OUT_REG (result) : __CMSIS_GCC_USE_REG (value) ); - return(result); -} - - -/** - \brief LDRT Unprivileged (8 bit) - \details Executes a Unprivileged LDRT instruction for 8 bit value. - \param [in] ptr Pointer to data - \return value of type uint8_t at (*ptr) - */ -__STATIC_FORCEINLINE uint8_t __LDRBT(volatile uint8_t *ptr) -{ - uint32_t result; - -#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8) - __ASM volatile ("ldrbt %0, %1" : "=r" (result) : "Q" (*ptr) ); -#else - /* Prior to GCC 4.8, "Q" will be expanded to [rx, #0] which is not - accepted by assembler. So has to use following less efficient pattern. - */ - __ASM volatile ("ldrbt %0, [%1]" : "=r" (result) : "r" (ptr) : "memory" ); -#endif - return ((uint8_t) result); /* Add explicit type cast here */ -} - - -/** - \brief LDRT Unprivileged (16 bit) - \details Executes a Unprivileged LDRT instruction for 16 bit values. - \param [in] ptr Pointer to data - \return value of type uint16_t at (*ptr) - */ -__STATIC_FORCEINLINE uint16_t __LDRHT(volatile uint16_t *ptr) -{ - uint32_t result; - -#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8) - __ASM volatile ("ldrht %0, %1" : "=r" (result) : "Q" (*ptr) ); -#else - /* Prior to GCC 4.8, "Q" will be expanded to [rx, #0] which is not - accepted by assembler. So has to use following less efficient pattern. - */ - __ASM volatile ("ldrht %0, [%1]" : "=r" (result) : "r" (ptr) : "memory" ); -#endif - return ((uint16_t) result); /* Add explicit type cast here */ -} - - -/** - \brief LDRT Unprivileged (32 bit) - \details Executes a Unprivileged LDRT instruction for 32 bit values. - \param [in] ptr Pointer to data - \return value of type uint32_t at (*ptr) - */ -__STATIC_FORCEINLINE uint32_t __LDRT(volatile uint32_t *ptr) -{ - uint32_t result; - - __ASM volatile ("ldrt %0, %1" : "=r" (result) : "Q" (*ptr) ); - return(result); -} - - -/** - \brief STRT Unprivileged (8 bit) - \details Executes a Unprivileged STRT instruction for 8 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - */ -__STATIC_FORCEINLINE void __STRBT(uint8_t value, volatile uint8_t *ptr) -{ - __ASM volatile ("strbt %1, %0" : "=Q" (*ptr) : "r" ((uint32_t)value) ); -} - - -/** - \brief STRT Unprivileged (16 bit) - \details Executes a Unprivileged STRT instruction for 16 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - */ -__STATIC_FORCEINLINE void __STRHT(uint16_t value, volatile uint16_t *ptr) -{ - __ASM volatile ("strht %1, %0" : "=Q" (*ptr) : "r" ((uint32_t)value) ); -} - - -/** - \brief STRT Unprivileged (32 bit) - \details Executes a Unprivileged STRT instruction for 32 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - */ -__STATIC_FORCEINLINE void __STRT(uint32_t value, volatile uint32_t *ptr) -{ - __ASM volatile ("strt %1, %0" : "=Q" (*ptr) : "r" (value) ); -} - -#else /* ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) ) */ - -/** - \brief Signed Saturate - \details Saturates a signed value. - \param [in] value Value to be saturated - \param [in] sat Bit position to saturate to (1..32) - \return Saturated value - */ -__STATIC_FORCEINLINE int32_t __SSAT(int32_t val, uint32_t sat) -{ - if ((sat >= 1U) && (sat <= 32U)) - { - const int32_t max = (int32_t)((1U << (sat - 1U)) - 1U); - const int32_t min = -1 - max ; - if (val > max) - { - return max; - } - else if (val < min) - { - return min; - } - } - return val; -} - -/** - \brief Unsigned Saturate - \details Saturates an unsigned value. - \param [in] value Value to be saturated - \param [in] sat Bit position to saturate to (0..31) - \return Saturated value - */ -__STATIC_FORCEINLINE uint32_t __USAT(int32_t val, uint32_t sat) -{ - if (sat <= 31U) - { - const uint32_t max = ((1U << sat) - 1U); - if (val > (int32_t)max) - { - return max; - } - else if (val < 0) - { - return 0U; - } - } - return (uint32_t)val; -} - -#endif /* ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) ) */ - - -#if ((defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) || \ - (defined (__ARM_ARCH_8M_BASE__ ) && (__ARM_ARCH_8M_BASE__ == 1)) ) -/** - \brief Load-Acquire (8 bit) - \details Executes a LDAB instruction for 8 bit value. - \param [in] ptr Pointer to data - \return value of type uint8_t at (*ptr) - */ -__STATIC_FORCEINLINE uint8_t __LDAB(volatile uint8_t *ptr) -{ - uint32_t result; - - __ASM volatile ("ldab %0, %1" : "=r" (result) : "Q" (*ptr) : "memory" ); - return ((uint8_t) result); -} - - -/** - \brief Load-Acquire (16 bit) - \details Executes a LDAH instruction for 16 bit values. - \param [in] ptr Pointer to data - \return value of type uint16_t at (*ptr) - */ -__STATIC_FORCEINLINE uint16_t __LDAH(volatile uint16_t *ptr) -{ - uint32_t result; - - __ASM volatile ("ldah %0, %1" : "=r" (result) : "Q" (*ptr) : "memory" ); - return ((uint16_t) result); -} - - -/** - \brief Load-Acquire (32 bit) - \details Executes a LDA instruction for 32 bit values. - \param [in] ptr Pointer to data - \return value of type uint32_t at (*ptr) - */ -__STATIC_FORCEINLINE uint32_t __LDA(volatile uint32_t *ptr) -{ - uint32_t result; - - __ASM volatile ("lda %0, %1" : "=r" (result) : "Q" (*ptr) : "memory" ); - return(result); -} - - -/** - \brief Store-Release (8 bit) - \details Executes a STLB instruction for 8 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - */ -__STATIC_FORCEINLINE void __STLB(uint8_t value, volatile uint8_t *ptr) -{ - __ASM volatile ("stlb %1, %0" : "=Q" (*ptr) : "r" ((uint32_t)value) : "memory" ); -} - - -/** - \brief Store-Release (16 bit) - \details Executes a STLH instruction for 16 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - */ -__STATIC_FORCEINLINE void __STLH(uint16_t value, volatile uint16_t *ptr) -{ - __ASM volatile ("stlh %1, %0" : "=Q" (*ptr) : "r" ((uint32_t)value) : "memory" ); -} - - -/** - \brief Store-Release (32 bit) - \details Executes a STL instruction for 32 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - */ -__STATIC_FORCEINLINE void __STL(uint32_t value, volatile uint32_t *ptr) -{ - __ASM volatile ("stl %1, %0" : "=Q" (*ptr) : "r" ((uint32_t)value) : "memory" ); -} - - -/** - \brief Load-Acquire Exclusive (8 bit) - \details Executes a LDAB exclusive instruction for 8 bit value. - \param [in] ptr Pointer to data - \return value of type uint8_t at (*ptr) - */ -__STATIC_FORCEINLINE uint8_t __LDAEXB(volatile uint8_t *ptr) -{ - uint32_t result; - - __ASM volatile ("ldaexb %0, %1" : "=r" (result) : "Q" (*ptr) : "memory" ); - return ((uint8_t) result); -} - - -/** - \brief Load-Acquire Exclusive (16 bit) - \details Executes a LDAH exclusive instruction for 16 bit values. - \param [in] ptr Pointer to data - \return value of type uint16_t at (*ptr) - */ -__STATIC_FORCEINLINE uint16_t __LDAEXH(volatile uint16_t *ptr) -{ - uint32_t result; - - __ASM volatile ("ldaexh %0, %1" : "=r" (result) : "Q" (*ptr) : "memory" ); - return ((uint16_t) result); -} - - -/** - \brief Load-Acquire Exclusive (32 bit) - \details Executes a LDA exclusive instruction for 32 bit values. - \param [in] ptr Pointer to data - \return value of type uint32_t at (*ptr) - */ -__STATIC_FORCEINLINE uint32_t __LDAEX(volatile uint32_t *ptr) -{ - uint32_t result; - - __ASM volatile ("ldaex %0, %1" : "=r" (result) : "Q" (*ptr) : "memory" ); - return(result); -} - - -/** - \brief Store-Release Exclusive (8 bit) - \details Executes a STLB exclusive instruction for 8 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - \return 0 Function succeeded - \return 1 Function failed - */ -__STATIC_FORCEINLINE uint32_t __STLEXB(uint8_t value, volatile uint8_t *ptr) -{ - uint32_t result; - - __ASM volatile ("stlexb %0, %2, %1" : "=&r" (result), "=Q" (*ptr) : "r" ((uint32_t)value) : "memory" ); - return(result); -} - - -/** - \brief Store-Release Exclusive (16 bit) - \details Executes a STLH exclusive instruction for 16 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - \return 0 Function succeeded - \return 1 Function failed - */ -__STATIC_FORCEINLINE uint32_t __STLEXH(uint16_t value, volatile uint16_t *ptr) -{ - uint32_t result; - - __ASM volatile ("stlexh %0, %2, %1" : "=&r" (result), "=Q" (*ptr) : "r" ((uint32_t)value) : "memory" ); - return(result); -} - - -/** - \brief Store-Release Exclusive (32 bit) - \details Executes a STL exclusive instruction for 32 bit values. - \param [in] value Value to store - \param [in] ptr Pointer to location - \return 0 Function succeeded - \return 1 Function failed - */ -__STATIC_FORCEINLINE uint32_t __STLEX(uint32_t value, volatile uint32_t *ptr) -{ - uint32_t result; - - __ASM volatile ("stlex %0, %2, %1" : "=&r" (result), "=Q" (*ptr) : "r" ((uint32_t)value) : "memory" ); - return(result); -} - -#endif /* ((defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) || \ - (defined (__ARM_ARCH_8M_BASE__ ) && (__ARM_ARCH_8M_BASE__ == 1)) ) */ - -/*@}*/ /* end of group CMSIS_Core_InstructionInterface */ - - -/* ########################### Core Function Access ########################### */ -/** \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_RegAccFunctions CMSIS Core Register Access Functions - @{ - */ - -/** - \brief Enable IRQ Interrupts - \details Enables IRQ interrupts by clearing special-purpose register PRIMASK. - Can only be executed in Privileged modes. - */ -__STATIC_FORCEINLINE void __enable_irq(void) -{ - __ASM volatile ("cpsie i" : : : "memory"); -} - - -/** - \brief Disable IRQ Interrupts - \details Disables IRQ interrupts by setting special-purpose register PRIMASK. - Can only be executed in Privileged modes. - */ -__STATIC_FORCEINLINE void __disable_irq(void) -{ - __ASM volatile ("cpsid i" : : : "memory"); -} - - -/** - \brief Get Control Register - \details Returns the content of the Control Register. - \return Control Register value - */ -__STATIC_FORCEINLINE uint32_t __get_CONTROL(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, control" : "=r" (result) ); - return(result); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Control Register (non-secure) - \details Returns the content of the non-secure Control Register when in secure mode. - \return non-secure Control Register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_CONTROL_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, control_ns" : "=r" (result) ); - return(result); -} -#endif - - -/** - \brief Set Control Register - \details Writes the given value to the Control Register. - \param [in] control Control Register value to set - */ -__STATIC_FORCEINLINE void __set_CONTROL(uint32_t control) -{ - __ASM volatile ("MSR control, %0" : : "r" (control) : "memory"); - __ISB(); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Control Register (non-secure) - \details Writes the given value to the non-secure Control Register when in secure state. - \param [in] control Control Register value to set - */ -__STATIC_FORCEINLINE void __TZ_set_CONTROL_NS(uint32_t control) -{ - __ASM volatile ("MSR control_ns, %0" : : "r" (control) : "memory"); - __ISB(); -} -#endif - - -/** - \brief Get IPSR Register - \details Returns the content of the IPSR Register. - \return IPSR Register value - */ -__STATIC_FORCEINLINE uint32_t __get_IPSR(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, ipsr" : "=r" (result) ); - return(result); -} - - -/** - \brief Get APSR Register - \details Returns the content of the APSR Register. - \return APSR Register value - */ -__STATIC_FORCEINLINE uint32_t __get_APSR(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, apsr" : "=r" (result) ); - return(result); -} - - -/** - \brief Get xPSR Register - \details Returns the content of the xPSR Register. - \return xPSR Register value - */ -__STATIC_FORCEINLINE uint32_t __get_xPSR(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, xpsr" : "=r" (result) ); - return(result); -} - - -/** - \brief Get Process Stack Pointer - \details Returns the current value of the Process Stack Pointer (PSP). - \return PSP Register value - */ -__STATIC_FORCEINLINE uint32_t __get_PSP(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, psp" : "=r" (result) ); - return(result); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Process Stack Pointer (non-secure) - \details Returns the current value of the non-secure Process Stack Pointer (PSP) when in secure state. - \return PSP Register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_PSP_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, psp_ns" : "=r" (result) ); - return(result); -} -#endif - - -/** - \brief Set Process Stack Pointer - \details Assigns the given value to the Process Stack Pointer (PSP). - \param [in] topOfProcStack Process Stack Pointer value to set - */ -__STATIC_FORCEINLINE void __set_PSP(uint32_t topOfProcStack) -{ - __ASM volatile ("MSR psp, %0" : : "r" (topOfProcStack) : ); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Process Stack Pointer (non-secure) - \details Assigns the given value to the non-secure Process Stack Pointer (PSP) when in secure state. - \param [in] topOfProcStack Process Stack Pointer value to set - */ -__STATIC_FORCEINLINE void __TZ_set_PSP_NS(uint32_t topOfProcStack) -{ - __ASM volatile ("MSR psp_ns, %0" : : "r" (topOfProcStack) : ); -} -#endif - - -/** - \brief Get Main Stack Pointer - \details Returns the current value of the Main Stack Pointer (MSP). - \return MSP Register value - */ -__STATIC_FORCEINLINE uint32_t __get_MSP(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, msp" : "=r" (result) ); - return(result); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Main Stack Pointer (non-secure) - \details Returns the current value of the non-secure Main Stack Pointer (MSP) when in secure state. - \return MSP Register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_MSP_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, msp_ns" : "=r" (result) ); - return(result); -} -#endif - - -/** - \brief Set Main Stack Pointer - \details Assigns the given value to the Main Stack Pointer (MSP). - \param [in] topOfMainStack Main Stack Pointer value to set - */ -__STATIC_FORCEINLINE void __set_MSP(uint32_t topOfMainStack) -{ - __ASM volatile ("MSR msp, %0" : : "r" (topOfMainStack) : ); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Main Stack Pointer (non-secure) - \details Assigns the given value to the non-secure Main Stack Pointer (MSP) when in secure state. - \param [in] topOfMainStack Main Stack Pointer value to set - */ -__STATIC_FORCEINLINE void __TZ_set_MSP_NS(uint32_t topOfMainStack) -{ - __ASM volatile ("MSR msp_ns, %0" : : "r" (topOfMainStack) : ); -} -#endif - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Stack Pointer (non-secure) - \details Returns the current value of the non-secure Stack Pointer (SP) when in secure state. - \return SP Register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_SP_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, sp_ns" : "=r" (result) ); - return(result); -} - - -/** - \brief Set Stack Pointer (non-secure) - \details Assigns the given value to the non-secure Stack Pointer (SP) when in secure state. - \param [in] topOfStack Stack Pointer value to set - */ -__STATIC_FORCEINLINE void __TZ_set_SP_NS(uint32_t topOfStack) -{ - __ASM volatile ("MSR sp_ns, %0" : : "r" (topOfStack) : ); -} -#endif - - -/** - \brief Get Priority Mask - \details Returns the current state of the priority mask bit from the Priority Mask Register. - \return Priority Mask value - */ -__STATIC_FORCEINLINE uint32_t __get_PRIMASK(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, primask" : "=r" (result) ); - return(result); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Priority Mask (non-secure) - \details Returns the current state of the non-secure priority mask bit from the Priority Mask Register when in secure state. - \return Priority Mask value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_PRIMASK_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, primask_ns" : "=r" (result) ); - return(result); -} -#endif - - -/** - \brief Set Priority Mask - \details Assigns the given value to the Priority Mask Register. - \param [in] priMask Priority Mask - */ -__STATIC_FORCEINLINE void __set_PRIMASK(uint32_t priMask) -{ - __ASM volatile ("MSR primask, %0" : : "r" (priMask) : "memory"); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Priority Mask (non-secure) - \details Assigns the given value to the non-secure Priority Mask Register when in secure state. - \param [in] priMask Priority Mask - */ -__STATIC_FORCEINLINE void __TZ_set_PRIMASK_NS(uint32_t priMask) -{ - __ASM volatile ("MSR primask_ns, %0" : : "r" (priMask) : "memory"); -} -#endif - - -#if ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) ) -/** - \brief Enable FIQ - \details Enables FIQ interrupts by clearing special-purpose register FAULTMASK. - Can only be executed in Privileged modes. - */ -__STATIC_FORCEINLINE void __enable_fault_irq(void) -{ - __ASM volatile ("cpsie f" : : : "memory"); -} - - -/** - \brief Disable FIQ - \details Disables FIQ interrupts by setting special-purpose register FAULTMASK. - Can only be executed in Privileged modes. - */ -__STATIC_FORCEINLINE void __disable_fault_irq(void) -{ - __ASM volatile ("cpsid f" : : : "memory"); -} - - -/** - \brief Get Base Priority - \details Returns the current value of the Base Priority register. - \return Base Priority register value - */ -__STATIC_FORCEINLINE uint32_t __get_BASEPRI(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, basepri" : "=r" (result) ); - return(result); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Base Priority (non-secure) - \details Returns the current value of the non-secure Base Priority register when in secure state. - \return Base Priority register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_BASEPRI_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, basepri_ns" : "=r" (result) ); - return(result); -} -#endif - - -/** - \brief Set Base Priority - \details Assigns the given value to the Base Priority register. - \param [in] basePri Base Priority value to set - */ -__STATIC_FORCEINLINE void __set_BASEPRI(uint32_t basePri) -{ - __ASM volatile ("MSR basepri, %0" : : "r" (basePri) : "memory"); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Base Priority (non-secure) - \details Assigns the given value to the non-secure Base Priority register when in secure state. - \param [in] basePri Base Priority value to set - */ -__STATIC_FORCEINLINE void __TZ_set_BASEPRI_NS(uint32_t basePri) -{ - __ASM volatile ("MSR basepri_ns, %0" : : "r" (basePri) : "memory"); -} -#endif - - -/** - \brief Set Base Priority with condition - \details Assigns the given value to the Base Priority register only if BASEPRI masking is disabled, - or the new value increases the BASEPRI priority level. - \param [in] basePri Base Priority value to set - */ -__STATIC_FORCEINLINE void __set_BASEPRI_MAX(uint32_t basePri) -{ - __ASM volatile ("MSR basepri_max, %0" : : "r" (basePri) : "memory"); -} - - -/** - \brief Get Fault Mask - \details Returns the current value of the Fault Mask register. - \return Fault Mask register value - */ -__STATIC_FORCEINLINE uint32_t __get_FAULTMASK(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, faultmask" : "=r" (result) ); - return(result); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Fault Mask (non-secure) - \details Returns the current value of the non-secure Fault Mask register when in secure state. - \return Fault Mask register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_FAULTMASK_NS(void) -{ - uint32_t result; - - __ASM volatile ("MRS %0, faultmask_ns" : "=r" (result) ); - return(result); -} -#endif - - -/** - \brief Set Fault Mask - \details Assigns the given value to the Fault Mask register. - \param [in] faultMask Fault Mask value to set - */ -__STATIC_FORCEINLINE void __set_FAULTMASK(uint32_t faultMask) -{ - __ASM volatile ("MSR faultmask, %0" : : "r" (faultMask) : "memory"); -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Fault Mask (non-secure) - \details Assigns the given value to the non-secure Fault Mask register when in secure state. - \param [in] faultMask Fault Mask value to set - */ -__STATIC_FORCEINLINE void __TZ_set_FAULTMASK_NS(uint32_t faultMask) -{ - __ASM volatile ("MSR faultmask_ns, %0" : : "r" (faultMask) : "memory"); -} -#endif - -#endif /* ((defined (__ARM_ARCH_7M__ ) && (__ARM_ARCH_7M__ == 1)) || \ - (defined (__ARM_ARCH_7EM__ ) && (__ARM_ARCH_7EM__ == 1)) || \ - (defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) ) */ - - -#if ((defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) || \ - (defined (__ARM_ARCH_8M_BASE__ ) && (__ARM_ARCH_8M_BASE__ == 1)) ) - -/** - \brief Get Process Stack Pointer Limit - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence zero is returned always in non-secure - mode. - - \details Returns the current value of the Process Stack Pointer Limit (PSPLIM). - \return PSPLIM Register value - */ -__STATIC_FORCEINLINE uint32_t __get_PSPLIM(void) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) && \ - (!defined (__ARM_FEATURE_CMSE) || (__ARM_FEATURE_CMSE < 3))) - // without main extensions, the non-secure PSPLIM is RAZ/WI - return 0U; -#else - uint32_t result; - __ASM volatile ("MRS %0, psplim" : "=r" (result) ); - return result; -#endif -} - -#if (defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Process Stack Pointer Limit (non-secure) - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence zero is returned always. - - \details Returns the current value of the non-secure Process Stack Pointer Limit (PSPLIM) when in secure state. - \return PSPLIM Register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_PSPLIM_NS(void) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1))) - // without main extensions, the non-secure PSPLIM is RAZ/WI - return 0U; -#else - uint32_t result; - __ASM volatile ("MRS %0, psplim_ns" : "=r" (result) ); - return result; -#endif -} -#endif - - -/** - \brief Set Process Stack Pointer Limit - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence the write is silently ignored in non-secure - mode. - - \details Assigns the given value to the Process Stack Pointer Limit (PSPLIM). - \param [in] ProcStackPtrLimit Process Stack Pointer Limit value to set - */ -__STATIC_FORCEINLINE void __set_PSPLIM(uint32_t ProcStackPtrLimit) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) && \ - (!defined (__ARM_FEATURE_CMSE) || (__ARM_FEATURE_CMSE < 3))) - // without main extensions, the non-secure PSPLIM is RAZ/WI - (void)ProcStackPtrLimit; -#else - __ASM volatile ("MSR psplim, %0" : : "r" (ProcStackPtrLimit)); -#endif -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Process Stack Pointer (non-secure) - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence the write is silently ignored. - - \details Assigns the given value to the non-secure Process Stack Pointer Limit (PSPLIM) when in secure state. - \param [in] ProcStackPtrLimit Process Stack Pointer Limit value to set - */ -__STATIC_FORCEINLINE void __TZ_set_PSPLIM_NS(uint32_t ProcStackPtrLimit) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1))) - // without main extensions, the non-secure PSPLIM is RAZ/WI - (void)ProcStackPtrLimit; -#else - __ASM volatile ("MSR psplim_ns, %0\n" : : "r" (ProcStackPtrLimit)); -#endif -} -#endif - - -/** - \brief Get Main Stack Pointer Limit - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence zero is returned always in non-secure - mode. - - \details Returns the current value of the Main Stack Pointer Limit (MSPLIM). - \return MSPLIM Register value - */ -__STATIC_FORCEINLINE uint32_t __get_MSPLIM(void) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) && \ - (!defined (__ARM_FEATURE_CMSE) || (__ARM_FEATURE_CMSE < 3))) - // without main extensions, the non-secure MSPLIM is RAZ/WI - return 0U; -#else - uint32_t result; - __ASM volatile ("MRS %0, msplim" : "=r" (result) ); - return result; -#endif -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Get Main Stack Pointer Limit (non-secure) - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence zero is returned always. - - \details Returns the current value of the non-secure Main Stack Pointer Limit(MSPLIM) when in secure state. - \return MSPLIM Register value - */ -__STATIC_FORCEINLINE uint32_t __TZ_get_MSPLIM_NS(void) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1))) - // without main extensions, the non-secure MSPLIM is RAZ/WI - return 0U; -#else - uint32_t result; - __ASM volatile ("MRS %0, msplim_ns" : "=r" (result) ); - return result; -#endif -} -#endif - - -/** - \brief Set Main Stack Pointer Limit - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence the write is silently ignored in non-secure - mode. - - \details Assigns the given value to the Main Stack Pointer Limit (MSPLIM). - \param [in] MainStackPtrLimit Main Stack Pointer Limit value to set - */ -__STATIC_FORCEINLINE void __set_MSPLIM(uint32_t MainStackPtrLimit) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) && \ - (!defined (__ARM_FEATURE_CMSE) || (__ARM_FEATURE_CMSE < 3))) - // without main extensions, the non-secure MSPLIM is RAZ/WI - (void)MainStackPtrLimit; -#else - __ASM volatile ("MSR msplim, %0" : : "r" (MainStackPtrLimit)); -#endif -} - - -#if (defined (__ARM_FEATURE_CMSE ) && (__ARM_FEATURE_CMSE == 3)) -/** - \brief Set Main Stack Pointer Limit (non-secure) - Devices without ARMv8-M Main Extensions (i.e. Cortex-M23) lack the non-secure - Stack Pointer Limit register hence the write is silently ignored. - - \details Assigns the given value to the non-secure Main Stack Pointer Limit (MSPLIM) when in secure state. - \param [in] MainStackPtrLimit Main Stack Pointer value to set - */ -__STATIC_FORCEINLINE void __TZ_set_MSPLIM_NS(uint32_t MainStackPtrLimit) -{ -#if (!(defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1))) - // without main extensions, the non-secure MSPLIM is RAZ/WI - (void)MainStackPtrLimit; -#else - __ASM volatile ("MSR msplim_ns, %0" : : "r" (MainStackPtrLimit)); -#endif -} -#endif - -#endif /* ((defined (__ARM_ARCH_8M_MAIN__ ) && (__ARM_ARCH_8M_MAIN__ == 1)) || \ - (defined (__ARM_ARCH_8M_BASE__ ) && (__ARM_ARCH_8M_BASE__ == 1)) ) */ - - -/** - \brief Get FPSCR - \details Returns the current value of the Floating Point Status/Control register. - \return Floating Point Status/Control register value - */ -__STATIC_FORCEINLINE uint32_t __get_FPSCR(void) -{ -#if ((defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U)) && \ - (defined (__FPU_USED ) && (__FPU_USED == 1U)) ) -#if __has_builtin(__builtin_arm_get_fpscr) -// Re-enable using built-in when GCC has been fixed -// || (__GNUC__ > 7) || (__GNUC__ == 7 && __GNUC_MINOR__ >= 2) - /* see https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00443.html */ - return __builtin_arm_get_fpscr(); -#else - uint32_t result; - - __ASM volatile ("VMRS %0, fpscr" : "=r" (result) ); - return(result); -#endif -#else - return(0U); -#endif -} - - -/** - \brief Set FPSCR - \details Assigns the given value to the Floating Point Status/Control register. - \param [in] fpscr Floating Point Status/Control value to set - */ -__STATIC_FORCEINLINE void __set_FPSCR(uint32_t fpscr) -{ -#if ((defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U)) && \ - (defined (__FPU_USED ) && (__FPU_USED == 1U)) ) -#if __has_builtin(__builtin_arm_set_fpscr) -// Re-enable using built-in when GCC has been fixed -// || (__GNUC__ > 7) || (__GNUC__ == 7 && __GNUC_MINOR__ >= 2) - /* see https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00443.html */ - __builtin_arm_set_fpscr(fpscr); -#else - __ASM volatile ("VMSR fpscr, %0" : : "r" (fpscr) : "vfpcc", "memory"); -#endif -#else - (void)fpscr; -#endif -} - - -/*@} end of CMSIS_Core_RegAccFunctions */ - - -/* ################### Compiler specific Intrinsics ########################### */ -/** \defgroup CMSIS_SIMD_intrinsics CMSIS SIMD Intrinsics - Access to dedicated SIMD instructions - @{ -*/ - -#if (defined (__ARM_FEATURE_DSP) && (__ARM_FEATURE_DSP == 1)) - -__STATIC_FORCEINLINE uint32_t __SADD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("sadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __QADD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("qadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SHADD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("shadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UADD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("uadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UQADD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uqadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UHADD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uhadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - - -__STATIC_FORCEINLINE uint32_t __SSUB8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("ssub8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __QSUB8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("qsub8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SHSUB8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("shsub8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __USUB8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("usub8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UQSUB8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uqsub8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UHSUB8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uhsub8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - - -__STATIC_FORCEINLINE uint32_t __SADD16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("sadd16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __QADD16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("qadd16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SHADD16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("shadd16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UADD16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("uadd16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UQADD16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uqadd16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UHADD16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uhadd16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SSUB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("ssub16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __QSUB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("qsub16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SHSUB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("shsub16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __USUB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("usub16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UQSUB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uqsub16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UHSUB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uhsub16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SASX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("sasx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __QASX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("qasx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SHASX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("shasx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UASX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("uasx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UQASX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uqasx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UHASX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uhasx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SSAX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("ssax %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __QSAX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("qsax %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SHSAX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("shsax %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __USAX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("usax %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UQSAX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uqsax %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UHSAX(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uhsax %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __USAD8(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("usad8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __USADA8(uint32_t op1, uint32_t op2, uint32_t op3) -{ - uint32_t result; - - __ASM ("usada8 %0, %1, %2, %3" : "=r" (result) : "r" (op1), "r" (op2), "r" (op3) ); - return(result); -} - -#define __SSAT16(ARG1, ARG2) \ -__extension__ \ -({ \ - int32_t __RES, __ARG1 = (ARG1); \ - __ASM volatile ("ssat16 %0, %1, %2" : "=r" (__RES) : "I" (ARG2), "r" (__ARG1) : "cc" ); \ - __RES; \ - }) - -#define __USAT16(ARG1, ARG2) \ -__extension__ \ -({ \ - uint32_t __RES, __ARG1 = (ARG1); \ - __ASM volatile ("usat16 %0, %1, %2" : "=r" (__RES) : "I" (ARG2), "r" (__ARG1) : "cc" ); \ - __RES; \ - }) - -__STATIC_FORCEINLINE uint32_t __UXTB16(uint32_t op1) -{ - uint32_t result; - - __ASM ("uxtb16 %0, %1" : "=r" (result) : "r" (op1)); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __UXTAB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("uxtab16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SXTB16(uint32_t op1) -{ - uint32_t result; - - __ASM ("sxtb16 %0, %1" : "=r" (result) : "r" (op1)); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SXTB16_RORn(uint32_t op1, uint32_t rotate) -{ - uint32_t result; - if (__builtin_constant_p(rotate) && ((rotate == 8U) || (rotate == 16U) || (rotate == 24U))) { - __ASM volatile ("sxtb16 %0, %1, ROR %2" : "=r" (result) : "r" (op1), "i" (rotate) ); - } else { - result = __SXTB16(__ROR(op1, rotate)) ; - } - return result; -} - -__STATIC_FORCEINLINE uint32_t __SXTAB16(uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM ("sxtab16 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SXTAB16_RORn(uint32_t op1, uint32_t op2, uint32_t rotate) -{ - uint32_t result; - if (__builtin_constant_p(rotate) && ((rotate == 8U) || (rotate == 16U) || (rotate == 24U))) { - __ASM volatile ("sxtab16 %0, %1, %2, ROR %3" : "=r" (result) : "r" (op1) , "r" (op2) , "i" (rotate)); - } else { - result = __SXTAB16(op1, __ROR(op2, rotate)); - } - return result; -} - - -__STATIC_FORCEINLINE uint32_t __SMUAD (uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("smuad %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SMUADX (uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("smuadx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SMLAD (uint32_t op1, uint32_t op2, uint32_t op3) -{ - uint32_t result; - - __ASM volatile ("smlad %0, %1, %2, %3" : "=r" (result) : "r" (op1), "r" (op2), "r" (op3) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SMLADX (uint32_t op1, uint32_t op2, uint32_t op3) -{ - uint32_t result; - - __ASM volatile ("smladx %0, %1, %2, %3" : "=r" (result) : "r" (op1), "r" (op2), "r" (op3) ); - return(result); -} - -__STATIC_FORCEINLINE uint64_t __SMLALD (uint32_t op1, uint32_t op2, uint64_t acc) -{ - union llreg_u{ - uint32_t w32[2]; - uint64_t w64; - } llr; - llr.w64 = acc; - -#ifndef __ARMEB__ /* Little endian */ - __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) ); -#else /* Big endian */ - __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[1]), "=r" (llr.w32[0]): "r" (op1), "r" (op2) , "0" (llr.w32[1]), "1" (llr.w32[0]) ); -#endif - - return(llr.w64); -} - -__STATIC_FORCEINLINE uint64_t __SMLALDX (uint32_t op1, uint32_t op2, uint64_t acc) -{ - union llreg_u{ - uint32_t w32[2]; - uint64_t w64; - } llr; - llr.w64 = acc; - -#ifndef __ARMEB__ /* Little endian */ - __ASM volatile ("smlaldx %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) ); -#else /* Big endian */ - __ASM volatile ("smlaldx %0, %1, %2, %3" : "=r" (llr.w32[1]), "=r" (llr.w32[0]): "r" (op1), "r" (op2) , "0" (llr.w32[1]), "1" (llr.w32[0]) ); -#endif - - return(llr.w64); -} - -__STATIC_FORCEINLINE uint32_t __SMUSD (uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("smusd %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SMUSDX (uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("smusdx %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SMLSD (uint32_t op1, uint32_t op2, uint32_t op3) -{ - uint32_t result; - - __ASM volatile ("smlsd %0, %1, %2, %3" : "=r" (result) : "r" (op1), "r" (op2), "r" (op3) ); - return(result); -} - -__STATIC_FORCEINLINE uint32_t __SMLSDX (uint32_t op1, uint32_t op2, uint32_t op3) -{ - uint32_t result; - - __ASM volatile ("smlsdx %0, %1, %2, %3" : "=r" (result) : "r" (op1), "r" (op2), "r" (op3) ); - return(result); -} - -__STATIC_FORCEINLINE uint64_t __SMLSLD (uint32_t op1, uint32_t op2, uint64_t acc) -{ - union llreg_u{ - uint32_t w32[2]; - uint64_t w64; - } llr; - llr.w64 = acc; - -#ifndef __ARMEB__ /* Little endian */ - __ASM volatile ("smlsld %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) ); -#else /* Big endian */ - __ASM volatile ("smlsld %0, %1, %2, %3" : "=r" (llr.w32[1]), "=r" (llr.w32[0]): "r" (op1), "r" (op2) , "0" (llr.w32[1]), "1" (llr.w32[0]) ); -#endif - - return(llr.w64); -} - -__STATIC_FORCEINLINE uint64_t __SMLSLDX (uint32_t op1, uint32_t op2, uint64_t acc) -{ - union llreg_u{ - uint32_t w32[2]; - uint64_t w64; - } llr; - llr.w64 = acc; - -#ifndef __ARMEB__ /* Little endian */ - __ASM volatile ("smlsldx %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) ); -#else /* Big endian */ - __ASM volatile ("smlsldx %0, %1, %2, %3" : "=r" (llr.w32[1]), "=r" (llr.w32[0]): "r" (op1), "r" (op2) , "0" (llr.w32[1]), "1" (llr.w32[0]) ); -#endif - - return(llr.w64); -} - -__STATIC_FORCEINLINE uint32_t __SEL (uint32_t op1, uint32_t op2) -{ - uint32_t result; - - __ASM volatile ("sel %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE int32_t __QADD( int32_t op1, int32_t op2) -{ - int32_t result; - - __ASM volatile ("qadd %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - -__STATIC_FORCEINLINE int32_t __QSUB( int32_t op1, int32_t op2) -{ - int32_t result; - - __ASM volatile ("qsub %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) ); - return(result); -} - - -#define __PKHBT(ARG1,ARG2,ARG3) \ -__extension__ \ -({ \ - uint32_t __RES, __ARG1 = (ARG1), __ARG2 = (ARG2); \ - __ASM ("pkhbt %0, %1, %2, lsl %3" : "=r" (__RES) : "r" (__ARG1), "r" (__ARG2), "I" (ARG3) ); \ - __RES; \ - }) - -#define __PKHTB(ARG1,ARG2,ARG3) \ -__extension__ \ -({ \ - uint32_t __RES, __ARG1 = (ARG1), __ARG2 = (ARG2); \ - if (ARG3 == 0) \ - __ASM ("pkhtb %0, %1, %2" : "=r" (__RES) : "r" (__ARG1), "r" (__ARG2) ); \ - else \ - __ASM ("pkhtb %0, %1, %2, asr %3" : "=r" (__RES) : "r" (__ARG1), "r" (__ARG2), "I" (ARG3) ); \ - __RES; \ - }) - - -__STATIC_FORCEINLINE int32_t __SMMLA (int32_t op1, int32_t op2, int32_t op3) -{ - int32_t result; - - __ASM ("smmla %0, %1, %2, %3" : "=r" (result): "r" (op1), "r" (op2), "r" (op3) ); - return(result); -} - -#endif /* (__ARM_FEATURE_DSP == 1) */ -/*@} end of group CMSIS_SIMD_intrinsics */ - - -#pragma GCC diagnostic pop - -#endif /* __CMSIS_GCC_H */ diff --git a/envs/core/src/cmsis_version.h b/envs/core/src/cmsis_version.h deleted file mode 100644 index eb9f7ca..0000000 --- a/envs/core/src/cmsis_version.h +++ /dev/null @@ -1,39 +0,0 @@ -/**************************************************************************//** - * @file cmsis_version.h - * @brief CMSIS Core(M) Version definitions - * @version V5.0.4 - * @date 23. July 2019 - ******************************************************************************/ -/* - * Copyright (c) 2009-2019 ARM Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined ( __ICCARM__ ) - #pragma system_include /* treat file as system include file for MISRA check */ -#elif defined (__clang__) - #pragma clang system_header /* treat file as system include file */ -#endif - -#ifndef __CMSIS_VERSION_H -#define __CMSIS_VERSION_H - -/* CMSIS Version definitions */ -#define __CM_CMSIS_VERSION_MAIN ( 5U) /*!< [31:16] CMSIS Core(M) main version */ -#define __CM_CMSIS_VERSION_SUB ( 5U) /*!< [15:0] CMSIS Core(M) sub version */ -#define __CM_CMSIS_VERSION ((__CM_CMSIS_VERSION_MAIN << 16U) | \ - __CM_CMSIS_VERSION_SUB ) /*!< CMSIS Core(M) version number */ -#endif diff --git a/envs/core/src/core_cm55.h b/envs/core/src/core_cm55.h deleted file mode 100644 index 63ac6c4..0000000 --- a/envs/core/src/core_cm55.h +++ /dev/null @@ -1,4289 +0,0 @@ -/**************************************************************************//** - * @file core_cm55.h - * @brief CMSIS Cortex-M55 Core Peripheral Access Layer Header File - * @version V1.2.2 - * @date 13. October 2021 - ******************************************************************************/ -/* - * Copyright (c) 2018-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined ( __ICCARM__ ) - #pragma system_include /* treat file as system include file for MISRA check */ -#elif defined (__clang__) - #pragma clang system_header /* treat file as system include file */ -#elif defined ( __GNUC__ ) - #pragma GCC diagnostic ignored "-Wpedantic" /* disable pedantic warning due to unnamed structs/unions */ -#endif - -#ifndef __CORE_CM55_H_GENERIC -#define __CORE_CM55_H_GENERIC - -#include - -#ifdef __cplusplus - extern "C" { -#endif - -/** - \page CMSIS_MISRA_Exceptions MISRA-C:2004 Compliance Exceptions - CMSIS violates the following MISRA-C:2004 rules: - - \li Required Rule 8.5, object/function definition in header file.
- Function definitions in header files are used to allow 'inlining'. - - \li Required Rule 18.4, declaration of union type or object of union type: '{...}'.
- Unions are used for effective representation of core registers. - - \li Advisory Rule 19.7, Function-like macro defined.
- Function-like macros are used to allow more efficient code. - */ - - -/******************************************************************************* - * CMSIS definitions - ******************************************************************************/ -/** - \ingroup Cortex_CM55 - @{ - */ - -#include "cmsis_version.h" - -/* CMSIS CM55 definitions */ -#define __CM55_CMSIS_VERSION_MAIN (__CM_CMSIS_VERSION_MAIN) /*!< \deprecated [31:16] CMSIS HAL main version */ -#define __CM55_CMSIS_VERSION_SUB (__CM_CMSIS_VERSION_SUB) /*!< \deprecated [15:0] CMSIS HAL sub version */ -#define __CM55_CMSIS_VERSION ((__CM55_CMSIS_VERSION_MAIN << 16U) | \ - __CM55_CMSIS_VERSION_SUB ) /*!< \deprecated CMSIS HAL version number */ - -#define __CORTEX_M (55U) /*!< Cortex-M Core */ - -#if defined ( __CC_ARM ) - #error Legacy Arm Compiler does not support Armv8.1-M target architecture. -#elif defined (__ARMCC_VERSION) && (__ARMCC_VERSION >= 6010050) - #if defined __ARM_FP - #if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) - #define __FPU_USED 1U - #else - #error "Compiler generates FPU instructions for a device without an FPU (check __FPU_PRESENT)" - #define __FPU_USED 0U - #endif - #else - #define __FPU_USED 0U - #endif - - #if defined(__ARM_FEATURE_DSP) - #if defined(__DSP_PRESENT) && (__DSP_PRESENT == 1U) - #define __DSP_USED 1U - #else - #error "Compiler generates DSP (SIMD) instructions for a devices without DSP extensions (check __DSP_PRESENT)" - #define __DSP_USED 0U - #endif - #else - #define __DSP_USED 0U - #endif - -#elif defined ( __GNUC__ ) - #if defined (__VFP_FP__) && !defined(__SOFTFP__) - #if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) - #define __FPU_USED 1U - #else - #error "Compiler generates FPU instructions for a device without an FPU (check __FPU_PRESENT)" - #define __FPU_USED 0U - #endif - #else - #define __FPU_USED 0U - #endif - - #if defined(__ARM_FEATURE_DSP) - #if defined(__DSP_PRESENT) && (__DSP_PRESENT == 1U) - #define __DSP_USED 1U - #else - #error "Compiler generates DSP (SIMD) instructions for a devices without DSP extensions (check __DSP_PRESENT)" - #define __DSP_USED 0U - #endif - #else - #define __DSP_USED 0U - #endif - -#elif defined ( __ICCARM__ ) - #if defined __ARMVFP__ - #if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) - #define __FPU_USED 1U - #else - #error "Compiler generates FPU instructions for a device without an FPU (check __FPU_PRESENT)" - #define __FPU_USED 0U - #endif - #else - #define __FPU_USED 0U - #endif - - #if defined(__ARM_FEATURE_DSP) - #if defined(__DSP_PRESENT) && (__DSP_PRESENT == 1U) - #define __DSP_USED 1U - #else - #error "Compiler generates DSP (SIMD) instructions for a devices without DSP extensions (check __DSP_PRESENT)" - #define __DSP_USED 0U - #endif - #else - #define __DSP_USED 0U - #endif - -#elif defined ( __TI_ARM__ ) - #if defined __TI_VFP_SUPPORT__ - #if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) - #define __FPU_USED 1U - #else - #error "Compiler generates FPU instructions for a device without an FPU (check __FPU_PRESENT)" - #define __FPU_USED 0U - #endif - #else - #define __FPU_USED 0U - #endif - -#elif defined ( __TASKING__ ) - #if defined __FPU_VFP__ - #if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) - #define __FPU_USED 1U - #else - #error "Compiler generates FPU instructions for a device without an FPU (check __FPU_PRESENT)" - #define __FPU_USED 0U - #endif - #else - #define __FPU_USED 0U - #endif - -#elif defined ( __CSMC__ ) - #if ( __CSMC__ & 0x400U) - #if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) - #define __FPU_USED 1U - #else - #error "Compiler generates FPU instructions for a device without an FPU (check __FPU_PRESENT)" - #define __FPU_USED 0U - #endif - #else - #define __FPU_USED 0U - #endif - -#endif - -#include "cmsis_compiler.h" /* CMSIS compiler specific defines */ - - -#ifdef __cplusplus -} -#endif - -#endif /* __CORE_CM55_H_GENERIC */ - -#ifndef __CMSIS_GENERIC - -#ifndef __CORE_CM55_H_DEPENDANT -#define __CORE_CM55_H_DEPENDANT - -#ifdef __cplusplus - extern "C" { -#endif - -/* check device defines and use defaults */ -#if defined __CHECK_DEVICE_DEFINES - #ifndef __CM55_REV - #define __CM55_REV 0x0000U - #warning "__CM55_REV not defined in device header file; using default!" - #endif - - #ifndef __FPU_PRESENT - #define __FPU_PRESENT 0U - #warning "__FPU_PRESENT not defined in device header file; using default!" - #endif - - #if __FPU_PRESENT != 0U - #ifndef __FPU_DP - #define __FPU_DP 0U - #warning "__FPU_DP not defined in device header file; using default!" - #endif - #endif - - #ifndef __MPU_PRESENT - #define __MPU_PRESENT 0U - #warning "__MPU_PRESENT not defined in device header file; using default!" - #endif - - #ifndef __ICACHE_PRESENT - #define __ICACHE_PRESENT 0U - #warning "__ICACHE_PRESENT not defined in device header file; using default!" - #endif - - #ifndef __DCACHE_PRESENT - #define __DCACHE_PRESENT 0U - #warning "__DCACHE_PRESENT not defined in device header file; using default!" - #endif - - #ifndef __VTOR_PRESENT - #define __VTOR_PRESENT 1U - #warning "__VTOR_PRESENT not defined in device header file; using default!" - #endif - - #ifndef __PMU_PRESENT - #define __PMU_PRESENT 0U - #warning "__PMU_PRESENT not defined in device header file; using default!" - #endif - - #if __PMU_PRESENT != 0U - #ifndef __PMU_NUM_EVENTCNT - #define __PMU_NUM_EVENTCNT 8U - #warning "__PMU_NUM_EVENTCNT not defined in device header file; using default!" - #elif (__PMU_NUM_EVENTCNT > 8 || __PMU_NUM_EVENTCNT < 2) - #error "__PMU_NUM_EVENTCNT is out of range in device header file!" */ - #endif - #endif - - #ifndef __SAUREGION_PRESENT - #define __SAUREGION_PRESENT 0U - #warning "__SAUREGION_PRESENT not defined in device header file; using default!" - #endif - - #ifndef __DSP_PRESENT - #define __DSP_PRESENT 0U - #warning "__DSP_PRESENT not defined in device header file; using default!" - #endif - - #ifndef __NVIC_PRIO_BITS - #define __NVIC_PRIO_BITS 3U - #warning "__NVIC_PRIO_BITS not defined in device header file; using default!" - #endif - - #ifndef __Vendor_SysTickConfig - #define __Vendor_SysTickConfig 0U - #warning "__Vendor_SysTickConfig not defined in device header file; using default!" - #endif -#endif - -/* IO definitions (access restrictions to peripheral registers) */ -/** - \defgroup CMSIS_glob_defs CMSIS Global Defines - - IO Type Qualifiers are used - \li to specify the access to peripheral variables. - \li for automatic generation of peripheral register debug information. -*/ -#ifdef __cplusplus - #define __I volatile /*!< Defines 'read only' permissions */ -#else - #define __I volatile const /*!< Defines 'read only' permissions */ -#endif -#define __O volatile /*!< Defines 'write only' permissions */ -#define __IO volatile /*!< Defines 'read / write' permissions */ - -/* following defines should be used for structure members */ -#define __IM volatile const /*! Defines 'read only' structure member permissions */ -#define __OM volatile /*! Defines 'write only' structure member permissions */ -#define __IOM volatile /*! Defines 'read / write' structure member permissions */ - -/*@} end of group Cortex_M55 */ - - - -/******************************************************************************* - * Register Abstraction - Core Register contain: - - Core Register - - Core NVIC Register - - Core SCB Register - - Core SysTick Register - - Core Debug Register - - Core MPU Register - - Core SAU Register - - Core FPU Register - ******************************************************************************/ -/** - \defgroup CMSIS_core_register Defines and Type Definitions - \brief Type definitions and defines for Cortex-M processor based devices. -*/ - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_CORE Status and Control Registers - \brief Core Register type definitions. - @{ - */ - -/** - \brief Union type to access the Application Program Status Register (APSR). - */ -typedef union -{ - struct - { - uint32_t _reserved0:16; /*!< bit: 0..15 Reserved */ - uint32_t GE:4; /*!< bit: 16..19 Greater than or Equal flags */ - uint32_t _reserved1:7; /*!< bit: 20..26 Reserved */ - uint32_t Q:1; /*!< bit: 27 Saturation condition flag */ - uint32_t V:1; /*!< bit: 28 Overflow condition code flag */ - uint32_t C:1; /*!< bit: 29 Carry condition code flag */ - uint32_t Z:1; /*!< bit: 30 Zero condition code flag */ - uint32_t N:1; /*!< bit: 31 Negative condition code flag */ - } b; /*!< Structure used for bit access */ - uint32_t w; /*!< Type used for word access */ -} APSR_Type; - -/* APSR Register Definitions */ -#define APSR_N_Pos 31U /*!< APSR: N Position */ -#define APSR_N_Msk (1UL << APSR_N_Pos) /*!< APSR: N Mask */ - -#define APSR_Z_Pos 30U /*!< APSR: Z Position */ -#define APSR_Z_Msk (1UL << APSR_Z_Pos) /*!< APSR: Z Mask */ - -#define APSR_C_Pos 29U /*!< APSR: C Position */ -#define APSR_C_Msk (1UL << APSR_C_Pos) /*!< APSR: C Mask */ - -#define APSR_V_Pos 28U /*!< APSR: V Position */ -#define APSR_V_Msk (1UL << APSR_V_Pos) /*!< APSR: V Mask */ - -#define APSR_Q_Pos 27U /*!< APSR: Q Position */ -#define APSR_Q_Msk (1UL << APSR_Q_Pos) /*!< APSR: Q Mask */ - -#define APSR_GE_Pos 16U /*!< APSR: GE Position */ -#define APSR_GE_Msk (0xFUL << APSR_GE_Pos) /*!< APSR: GE Mask */ - - -/** - \brief Union type to access the Interrupt Program Status Register (IPSR). - */ -typedef union -{ - struct - { - uint32_t ISR:9; /*!< bit: 0.. 8 Exception number */ - uint32_t _reserved0:23; /*!< bit: 9..31 Reserved */ - } b; /*!< Structure used for bit access */ - uint32_t w; /*!< Type used for word access */ -} IPSR_Type; - -/* IPSR Register Definitions */ -#define IPSR_ISR_Pos 0U /*!< IPSR: ISR Position */ -#define IPSR_ISR_Msk (0x1FFUL /*<< IPSR_ISR_Pos*/) /*!< IPSR: ISR Mask */ - - -/** - \brief Union type to access the Special-Purpose Program Status Registers (xPSR). - */ -typedef union -{ - struct - { - uint32_t ISR:9; /*!< bit: 0.. 8 Exception number */ - uint32_t _reserved0:7; /*!< bit: 9..15 Reserved */ - uint32_t GE:4; /*!< bit: 16..19 Greater than or Equal flags */ - uint32_t _reserved1:4; /*!< bit: 20..23 Reserved */ - uint32_t T:1; /*!< bit: 24 Thumb bit (read 0) */ - uint32_t IT:2; /*!< bit: 25..26 saved IT state (read 0) */ - uint32_t Q:1; /*!< bit: 27 Saturation condition flag */ - uint32_t V:1; /*!< bit: 28 Overflow condition code flag */ - uint32_t C:1; /*!< bit: 29 Carry condition code flag */ - uint32_t Z:1; /*!< bit: 30 Zero condition code flag */ - uint32_t N:1; /*!< bit: 31 Negative condition code flag */ - } b; /*!< Structure used for bit access */ - uint32_t w; /*!< Type used for word access */ -} xPSR_Type; - -/* xPSR Register Definitions */ -#define xPSR_N_Pos 31U /*!< xPSR: N Position */ -#define xPSR_N_Msk (1UL << xPSR_N_Pos) /*!< xPSR: N Mask */ - -#define xPSR_Z_Pos 30U /*!< xPSR: Z Position */ -#define xPSR_Z_Msk (1UL << xPSR_Z_Pos) /*!< xPSR: Z Mask */ - -#define xPSR_C_Pos 29U /*!< xPSR: C Position */ -#define xPSR_C_Msk (1UL << xPSR_C_Pos) /*!< xPSR: C Mask */ - -#define xPSR_V_Pos 28U /*!< xPSR: V Position */ -#define xPSR_V_Msk (1UL << xPSR_V_Pos) /*!< xPSR: V Mask */ - -#define xPSR_Q_Pos 27U /*!< xPSR: Q Position */ -#define xPSR_Q_Msk (1UL << xPSR_Q_Pos) /*!< xPSR: Q Mask */ - -#define xPSR_IT_Pos 25U /*!< xPSR: IT Position */ -#define xPSR_IT_Msk (3UL << xPSR_IT_Pos) /*!< xPSR: IT Mask */ - -#define xPSR_T_Pos 24U /*!< xPSR: T Position */ -#define xPSR_T_Msk (1UL << xPSR_T_Pos) /*!< xPSR: T Mask */ - -#define xPSR_GE_Pos 16U /*!< xPSR: GE Position */ -#define xPSR_GE_Msk (0xFUL << xPSR_GE_Pos) /*!< xPSR: GE Mask */ - -#define xPSR_ISR_Pos 0U /*!< xPSR: ISR Position */ -#define xPSR_ISR_Msk (0x1FFUL /*<< xPSR_ISR_Pos*/) /*!< xPSR: ISR Mask */ - - -/** - \brief Union type to access the Control Registers (CONTROL). - */ -typedef union -{ - struct - { - uint32_t nPRIV:1; /*!< bit: 0 Execution privilege in Thread mode */ - uint32_t SPSEL:1; /*!< bit: 1 Stack-pointer select */ - uint32_t FPCA:1; /*!< bit: 2 Floating-point context active */ - uint32_t SFPA:1; /*!< bit: 3 Secure floating-point active */ - uint32_t _reserved1:28; /*!< bit: 4..31 Reserved */ - } b; /*!< Structure used for bit access */ - uint32_t w; /*!< Type used for word access */ -} CONTROL_Type; - -/* CONTROL Register Definitions */ -#define CONTROL_SFPA_Pos 3U /*!< CONTROL: SFPA Position */ -#define CONTROL_SFPA_Msk (1UL << CONTROL_SFPA_Pos) /*!< CONTROL: SFPA Mask */ - -#define CONTROL_FPCA_Pos 2U /*!< CONTROL: FPCA Position */ -#define CONTROL_FPCA_Msk (1UL << CONTROL_FPCA_Pos) /*!< CONTROL: FPCA Mask */ - -#define CONTROL_SPSEL_Pos 1U /*!< CONTROL: SPSEL Position */ -#define CONTROL_SPSEL_Msk (1UL << CONTROL_SPSEL_Pos) /*!< CONTROL: SPSEL Mask */ - -#define CONTROL_nPRIV_Pos 0U /*!< CONTROL: nPRIV Position */ -#define CONTROL_nPRIV_Msk (1UL /*<< CONTROL_nPRIV_Pos*/) /*!< CONTROL: nPRIV Mask */ - -/*@} end of group CMSIS_CORE */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_NVIC Nested Vectored Interrupt Controller (NVIC) - \brief Type definitions for the NVIC Registers - @{ - */ - -/** - \brief Structure type to access the Nested Vectored Interrupt Controller (NVIC). - */ -typedef struct -{ - __IOM uint32_t ISER[16U]; /*!< Offset: 0x000 (R/W) Interrupt Set Enable Register */ - uint32_t RESERVED0[16U]; - __IOM uint32_t ICER[16U]; /*!< Offset: 0x080 (R/W) Interrupt Clear Enable Register */ - uint32_t RSERVED1[16U]; - __IOM uint32_t ISPR[16U]; /*!< Offset: 0x100 (R/W) Interrupt Set Pending Register */ - uint32_t RESERVED2[16U]; - __IOM uint32_t ICPR[16U]; /*!< Offset: 0x180 (R/W) Interrupt Clear Pending Register */ - uint32_t RESERVED3[16U]; - __IOM uint32_t IABR[16U]; /*!< Offset: 0x200 (R/W) Interrupt Active bit Register */ - uint32_t RESERVED4[16U]; - __IOM uint32_t ITNS[16U]; /*!< Offset: 0x280 (R/W) Interrupt Non-Secure State Register */ - uint32_t RESERVED5[16U]; - __IOM uint8_t IPR[496U]; /*!< Offset: 0x300 (R/W) Interrupt Priority Register (8Bit wide) */ - uint32_t RESERVED6[580U]; - __OM uint32_t STIR; /*!< Offset: 0xE00 ( /W) Software Trigger Interrupt Register */ -} NVIC_Type; - -/* Software Triggered Interrupt Register Definitions */ -#define NVIC_STIR_INTID_Pos 0U /*!< STIR: INTLINESNUM Position */ -#define NVIC_STIR_INTID_Msk (0x1FFUL /*<< NVIC_STIR_INTID_Pos*/) /*!< STIR: INTLINESNUM Mask */ - -/*@} end of group CMSIS_NVIC */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_SCB System Control Block (SCB) - \brief Type definitions for the System Control Block Registers - @{ - */ - -/** - \brief Structure type to access the System Control Block (SCB). - */ -typedef struct -{ - __IM uint32_t CPUID; /*!< Offset: 0x000 (R/ ) CPUID Base Register */ - __IOM uint32_t ICSR; /*!< Offset: 0x004 (R/W) Interrupt Control and State Register */ - __IOM uint32_t VTOR; /*!< Offset: 0x008 (R/W) Vector Table Offset Register */ - __IOM uint32_t AIRCR; /*!< Offset: 0x00C (R/W) Application Interrupt and Reset Control Register */ - __IOM uint32_t SCR; /*!< Offset: 0x010 (R/W) System Control Register */ - __IOM uint32_t CCR; /*!< Offset: 0x014 (R/W) Configuration Control Register */ - __IOM uint8_t SHPR[12U]; /*!< Offset: 0x018 (R/W) System Handlers Priority Registers (4-7, 8-11, 12-15) */ - __IOM uint32_t SHCSR; /*!< Offset: 0x024 (R/W) System Handler Control and State Register */ - __IOM uint32_t CFSR; /*!< Offset: 0x028 (R/W) Configurable Fault Status Register */ - __IOM uint32_t HFSR; /*!< Offset: 0x02C (R/W) HardFault Status Register */ - __IOM uint32_t DFSR; /*!< Offset: 0x030 (R/W) Debug Fault Status Register */ - __IOM uint32_t MMFAR; /*!< Offset: 0x034 (R/W) MemManage Fault Address Register */ - __IOM uint32_t BFAR; /*!< Offset: 0x038 (R/W) BusFault Address Register */ - __IOM uint32_t AFSR; /*!< Offset: 0x03C (R/W) Auxiliary Fault Status Register */ - __IM uint32_t ID_PFR[2U]; /*!< Offset: 0x040 (R/ ) Processor Feature Register */ - __IM uint32_t ID_DFR; /*!< Offset: 0x048 (R/ ) Debug Feature Register */ - __IM uint32_t ID_AFR; /*!< Offset: 0x04C (R/ ) Auxiliary Feature Register */ - __IM uint32_t ID_MMFR[4U]; /*!< Offset: 0x050 (R/ ) Memory Model Feature Register */ - __IM uint32_t ID_ISAR[6U]; /*!< Offset: 0x060 (R/ ) Instruction Set Attributes Register */ - __IM uint32_t CLIDR; /*!< Offset: 0x078 (R/ ) Cache Level ID register */ - __IM uint32_t CTR; /*!< Offset: 0x07C (R/ ) Cache Type register */ - __IM uint32_t CCSIDR; /*!< Offset: 0x080 (R/ ) Cache Size ID Register */ - __IOM uint32_t CSSELR; /*!< Offset: 0x084 (R/W) Cache Size Selection Register */ - __IOM uint32_t CPACR; /*!< Offset: 0x088 (R/W) Coprocessor Access Control Register */ - __IOM uint32_t NSACR; /*!< Offset: 0x08C (R/W) Non-Secure Access Control Register */ - uint32_t RESERVED7[21U]; - __IOM uint32_t SFSR; /*!< Offset: 0x0E4 (R/W) Secure Fault Status Register */ - __IOM uint32_t SFAR; /*!< Offset: 0x0E8 (R/W) Secure Fault Address Register */ - uint32_t RESERVED3[69U]; - __OM uint32_t STIR; /*!< Offset: 0x200 ( /W) Software Triggered Interrupt Register */ - __IOM uint32_t RFSR; /*!< Offset: 0x204 (R/W) RAS Fault Status Register */ - uint32_t RESERVED4[14U]; - __IM uint32_t MVFR0; /*!< Offset: 0x240 (R/ ) Media and VFP Feature Register 0 */ - __IM uint32_t MVFR1; /*!< Offset: 0x244 (R/ ) Media and VFP Feature Register 1 */ - __IM uint32_t MVFR2; /*!< Offset: 0x248 (R/ ) Media and VFP Feature Register 2 */ - uint32_t RESERVED5[1U]; - __OM uint32_t ICIALLU; /*!< Offset: 0x250 ( /W) I-Cache Invalidate All to PoU */ - uint32_t RESERVED6[1U]; - __OM uint32_t ICIMVAU; /*!< Offset: 0x258 ( /W) I-Cache Invalidate by MVA to PoU */ - __OM uint32_t DCIMVAC; /*!< Offset: 0x25C ( /W) D-Cache Invalidate by MVA to PoC */ - __OM uint32_t DCISW; /*!< Offset: 0x260 ( /W) D-Cache Invalidate by Set-way */ - __OM uint32_t DCCMVAU; /*!< Offset: 0x264 ( /W) D-Cache Clean by MVA to PoU */ - __OM uint32_t DCCMVAC; /*!< Offset: 0x268 ( /W) D-Cache Clean by MVA to PoC */ - __OM uint32_t DCCSW; /*!< Offset: 0x26C ( /W) D-Cache Clean by Set-way */ - __OM uint32_t DCCIMVAC; /*!< Offset: 0x270 ( /W) D-Cache Clean and Invalidate by MVA to PoC */ - __OM uint32_t DCCISW; /*!< Offset: 0x274 ( /W) D-Cache Clean and Invalidate by Set-way */ - __OM uint32_t BPIALL; /*!< Offset: 0x278 ( /W) Branch Predictor Invalidate All */ -} SCB_Type; - -/* SCB CPUID Register Definitions */ -#define SCB_CPUID_IMPLEMENTER_Pos 24U /*!< SCB CPUID: IMPLEMENTER Position */ -#define SCB_CPUID_IMPLEMENTER_Msk (0xFFUL << SCB_CPUID_IMPLEMENTER_Pos) /*!< SCB CPUID: IMPLEMENTER Mask */ - -#define SCB_CPUID_VARIANT_Pos 20U /*!< SCB CPUID: VARIANT Position */ -#define SCB_CPUID_VARIANT_Msk (0xFUL << SCB_CPUID_VARIANT_Pos) /*!< SCB CPUID: VARIANT Mask */ - -#define SCB_CPUID_ARCHITECTURE_Pos 16U /*!< SCB CPUID: ARCHITECTURE Position */ -#define SCB_CPUID_ARCHITECTURE_Msk (0xFUL << SCB_CPUID_ARCHITECTURE_Pos) /*!< SCB CPUID: ARCHITECTURE Mask */ - -#define SCB_CPUID_PARTNO_Pos 4U /*!< SCB CPUID: PARTNO Position */ -#define SCB_CPUID_PARTNO_Msk (0xFFFUL << SCB_CPUID_PARTNO_Pos) /*!< SCB CPUID: PARTNO Mask */ - -#define SCB_CPUID_REVISION_Pos 0U /*!< SCB CPUID: REVISION Position */ -#define SCB_CPUID_REVISION_Msk (0xFUL /*<< SCB_CPUID_REVISION_Pos*/) /*!< SCB CPUID: REVISION Mask */ - -/* SCB Interrupt Control State Register Definitions */ -#define SCB_ICSR_PENDNMISET_Pos 31U /*!< SCB ICSR: PENDNMISET Position */ -#define SCB_ICSR_PENDNMISET_Msk (1UL << SCB_ICSR_PENDNMISET_Pos) /*!< SCB ICSR: PENDNMISET Mask */ - -#define SCB_ICSR_NMIPENDSET_Pos SCB_ICSR_PENDNMISET_Pos /*!< SCB ICSR: NMIPENDSET Position, backward compatibility */ -#define SCB_ICSR_NMIPENDSET_Msk SCB_ICSR_PENDNMISET_Msk /*!< SCB ICSR: NMIPENDSET Mask, backward compatibility */ - -#define SCB_ICSR_PENDNMICLR_Pos 30U /*!< SCB ICSR: PENDNMICLR Position */ -#define SCB_ICSR_PENDNMICLR_Msk (1UL << SCB_ICSR_PENDNMICLR_Pos) /*!< SCB ICSR: PENDNMICLR Mask */ - -#define SCB_ICSR_PENDSVSET_Pos 28U /*!< SCB ICSR: PENDSVSET Position */ -#define SCB_ICSR_PENDSVSET_Msk (1UL << SCB_ICSR_PENDSVSET_Pos) /*!< SCB ICSR: PENDSVSET Mask */ - -#define SCB_ICSR_PENDSVCLR_Pos 27U /*!< SCB ICSR: PENDSVCLR Position */ -#define SCB_ICSR_PENDSVCLR_Msk (1UL << SCB_ICSR_PENDSVCLR_Pos) /*!< SCB ICSR: PENDSVCLR Mask */ - -#define SCB_ICSR_PENDSTSET_Pos 26U /*!< SCB ICSR: PENDSTSET Position */ -#define SCB_ICSR_PENDSTSET_Msk (1UL << SCB_ICSR_PENDSTSET_Pos) /*!< SCB ICSR: PENDSTSET Mask */ - -#define SCB_ICSR_PENDSTCLR_Pos 25U /*!< SCB ICSR: PENDSTCLR Position */ -#define SCB_ICSR_PENDSTCLR_Msk (1UL << SCB_ICSR_PENDSTCLR_Pos) /*!< SCB ICSR: PENDSTCLR Mask */ - -#define SCB_ICSR_STTNS_Pos 24U /*!< SCB ICSR: STTNS Position (Security Extension) */ -#define SCB_ICSR_STTNS_Msk (1UL << SCB_ICSR_STTNS_Pos) /*!< SCB ICSR: STTNS Mask (Security Extension) */ - -#define SCB_ICSR_ISRPREEMPT_Pos 23U /*!< SCB ICSR: ISRPREEMPT Position */ -#define SCB_ICSR_ISRPREEMPT_Msk (1UL << SCB_ICSR_ISRPREEMPT_Pos) /*!< SCB ICSR: ISRPREEMPT Mask */ - -#define SCB_ICSR_ISRPENDING_Pos 22U /*!< SCB ICSR: ISRPENDING Position */ -#define SCB_ICSR_ISRPENDING_Msk (1UL << SCB_ICSR_ISRPENDING_Pos) /*!< SCB ICSR: ISRPENDING Mask */ - -#define SCB_ICSR_VECTPENDING_Pos 12U /*!< SCB ICSR: VECTPENDING Position */ -#define SCB_ICSR_VECTPENDING_Msk (0x1FFUL << SCB_ICSR_VECTPENDING_Pos) /*!< SCB ICSR: VECTPENDING Mask */ - -#define SCB_ICSR_RETTOBASE_Pos 11U /*!< SCB ICSR: RETTOBASE Position */ -#define SCB_ICSR_RETTOBASE_Msk (1UL << SCB_ICSR_RETTOBASE_Pos) /*!< SCB ICSR: RETTOBASE Mask */ - -#define SCB_ICSR_VECTACTIVE_Pos 0U /*!< SCB ICSR: VECTACTIVE Position */ -#define SCB_ICSR_VECTACTIVE_Msk (0x1FFUL /*<< SCB_ICSR_VECTACTIVE_Pos*/) /*!< SCB ICSR: VECTACTIVE Mask */ - -/* SCB Vector Table Offset Register Definitions */ -#define SCB_VTOR_TBLOFF_Pos 7U /*!< SCB VTOR: TBLOFF Position */ -#define SCB_VTOR_TBLOFF_Msk (0x1FFFFFFUL << SCB_VTOR_TBLOFF_Pos) /*!< SCB VTOR: TBLOFF Mask */ - -/* SCB Application Interrupt and Reset Control Register Definitions */ -#define SCB_AIRCR_VECTKEY_Pos 16U /*!< SCB AIRCR: VECTKEY Position */ -#define SCB_AIRCR_VECTKEY_Msk (0xFFFFUL << SCB_AIRCR_VECTKEY_Pos) /*!< SCB AIRCR: VECTKEY Mask */ - -#define SCB_AIRCR_VECTKEYSTAT_Pos 16U /*!< SCB AIRCR: VECTKEYSTAT Position */ -#define SCB_AIRCR_VECTKEYSTAT_Msk (0xFFFFUL << SCB_AIRCR_VECTKEYSTAT_Pos) /*!< SCB AIRCR: VECTKEYSTAT Mask */ - -#define SCB_AIRCR_ENDIANESS_Pos 15U /*!< SCB AIRCR: ENDIANESS Position */ -#define SCB_AIRCR_ENDIANESS_Msk (1UL << SCB_AIRCR_ENDIANESS_Pos) /*!< SCB AIRCR: ENDIANESS Mask */ - -#define SCB_AIRCR_PRIS_Pos 14U /*!< SCB AIRCR: PRIS Position */ -#define SCB_AIRCR_PRIS_Msk (1UL << SCB_AIRCR_PRIS_Pos) /*!< SCB AIRCR: PRIS Mask */ - -#define SCB_AIRCR_BFHFNMINS_Pos 13U /*!< SCB AIRCR: BFHFNMINS Position */ -#define SCB_AIRCR_BFHFNMINS_Msk (1UL << SCB_AIRCR_BFHFNMINS_Pos) /*!< SCB AIRCR: BFHFNMINS Mask */ - -#define SCB_AIRCR_PRIGROUP_Pos 8U /*!< SCB AIRCR: PRIGROUP Position */ -#define SCB_AIRCR_PRIGROUP_Msk (7UL << SCB_AIRCR_PRIGROUP_Pos) /*!< SCB AIRCR: PRIGROUP Mask */ - -#define SCB_AIRCR_IESB_Pos 5U /*!< SCB AIRCR: Implicit ESB Enable Position */ -#define SCB_AIRCR_IESB_Msk (1UL << SCB_AIRCR_IESB_Pos) /*!< SCB AIRCR: Implicit ESB Enable Mask */ - -#define SCB_AIRCR_DIT_Pos 4U /*!< SCB AIRCR: Data Independent Timing Position */ -#define SCB_AIRCR_DIT_Msk (1UL << SCB_AIRCR_DIT_Pos) /*!< SCB AIRCR: Data Independent Timing Mask */ - -#define SCB_AIRCR_SYSRESETREQS_Pos 3U /*!< SCB AIRCR: SYSRESETREQS Position */ -#define SCB_AIRCR_SYSRESETREQS_Msk (1UL << SCB_AIRCR_SYSRESETREQS_Pos) /*!< SCB AIRCR: SYSRESETREQS Mask */ - -#define SCB_AIRCR_SYSRESETREQ_Pos 2U /*!< SCB AIRCR: SYSRESETREQ Position */ -#define SCB_AIRCR_SYSRESETREQ_Msk (1UL << SCB_AIRCR_SYSRESETREQ_Pos) /*!< SCB AIRCR: SYSRESETREQ Mask */ - -#define SCB_AIRCR_VECTCLRACTIVE_Pos 1U /*!< SCB AIRCR: VECTCLRACTIVE Position */ -#define SCB_AIRCR_VECTCLRACTIVE_Msk (1UL << SCB_AIRCR_VECTCLRACTIVE_Pos) /*!< SCB AIRCR: VECTCLRACTIVE Mask */ - -/* SCB System Control Register Definitions */ -#define SCB_SCR_SEVONPEND_Pos 4U /*!< SCB SCR: SEVONPEND Position */ -#define SCB_SCR_SEVONPEND_Msk (1UL << SCB_SCR_SEVONPEND_Pos) /*!< SCB SCR: SEVONPEND Mask */ - -#define SCB_SCR_SLEEPDEEPS_Pos 3U /*!< SCB SCR: SLEEPDEEPS Position */ -#define SCB_SCR_SLEEPDEEPS_Msk (1UL << SCB_SCR_SLEEPDEEPS_Pos) /*!< SCB SCR: SLEEPDEEPS Mask */ - -#define SCB_SCR_SLEEPDEEP_Pos 2U /*!< SCB SCR: SLEEPDEEP Position */ -#define SCB_SCR_SLEEPDEEP_Msk (1UL << SCB_SCR_SLEEPDEEP_Pos) /*!< SCB SCR: SLEEPDEEP Mask */ - -#define SCB_SCR_SLEEPONEXIT_Pos 1U /*!< SCB SCR: SLEEPONEXIT Position */ -#define SCB_SCR_SLEEPONEXIT_Msk (1UL << SCB_SCR_SLEEPONEXIT_Pos) /*!< SCB SCR: SLEEPONEXIT Mask */ - -/* SCB Configuration Control Register Definitions */ -#define SCB_CCR_TRD_Pos 20U /*!< SCB CCR: TRD Position */ -#define SCB_CCR_TRD_Msk (1UL << SCB_CCR_TRD_Pos) /*!< SCB CCR: TRD Mask */ - -#define SCB_CCR_LOB_Pos 19U /*!< SCB CCR: LOB Position */ -#define SCB_CCR_LOB_Msk (1UL << SCB_CCR_LOB_Pos) /*!< SCB CCR: LOB Mask */ - -#define SCB_CCR_BP_Pos 18U /*!< SCB CCR: BP Position */ -#define SCB_CCR_BP_Msk (1UL << SCB_CCR_BP_Pos) /*!< SCB CCR: BP Mask */ - -#define SCB_CCR_IC_Pos 17U /*!< SCB CCR: IC Position */ -#define SCB_CCR_IC_Msk (1UL << SCB_CCR_IC_Pos) /*!< SCB CCR: IC Mask */ - -#define SCB_CCR_DC_Pos 16U /*!< SCB CCR: DC Position */ -#define SCB_CCR_DC_Msk (1UL << SCB_CCR_DC_Pos) /*!< SCB CCR: DC Mask */ - -#define SCB_CCR_STKOFHFNMIGN_Pos 10U /*!< SCB CCR: STKOFHFNMIGN Position */ -#define SCB_CCR_STKOFHFNMIGN_Msk (1UL << SCB_CCR_STKOFHFNMIGN_Pos) /*!< SCB CCR: STKOFHFNMIGN Mask */ - -#define SCB_CCR_BFHFNMIGN_Pos 8U /*!< SCB CCR: BFHFNMIGN Position */ -#define SCB_CCR_BFHFNMIGN_Msk (1UL << SCB_CCR_BFHFNMIGN_Pos) /*!< SCB CCR: BFHFNMIGN Mask */ - -#define SCB_CCR_DIV_0_TRP_Pos 4U /*!< SCB CCR: DIV_0_TRP Position */ -#define SCB_CCR_DIV_0_TRP_Msk (1UL << SCB_CCR_DIV_0_TRP_Pos) /*!< SCB CCR: DIV_0_TRP Mask */ - -#define SCB_CCR_UNALIGN_TRP_Pos 3U /*!< SCB CCR: UNALIGN_TRP Position */ -#define SCB_CCR_UNALIGN_TRP_Msk (1UL << SCB_CCR_UNALIGN_TRP_Pos) /*!< SCB CCR: UNALIGN_TRP Mask */ - -#define SCB_CCR_USERSETMPEND_Pos 1U /*!< SCB CCR: USERSETMPEND Position */ -#define SCB_CCR_USERSETMPEND_Msk (1UL << SCB_CCR_USERSETMPEND_Pos) /*!< SCB CCR: USERSETMPEND Mask */ - -/* SCB System Handler Control and State Register Definitions */ -#define SCB_SHCSR_HARDFAULTPENDED_Pos 21U /*!< SCB SHCSR: HARDFAULTPENDED Position */ -#define SCB_SHCSR_HARDFAULTPENDED_Msk (1UL << SCB_SHCSR_HARDFAULTPENDED_Pos) /*!< SCB SHCSR: HARDFAULTPENDED Mask */ - -#define SCB_SHCSR_SECUREFAULTPENDED_Pos 20U /*!< SCB SHCSR: SECUREFAULTPENDED Position */ -#define SCB_SHCSR_SECUREFAULTPENDED_Msk (1UL << SCB_SHCSR_SECUREFAULTPENDED_Pos) /*!< SCB SHCSR: SECUREFAULTPENDED Mask */ - -#define SCB_SHCSR_SECUREFAULTENA_Pos 19U /*!< SCB SHCSR: SECUREFAULTENA Position */ -#define SCB_SHCSR_SECUREFAULTENA_Msk (1UL << SCB_SHCSR_SECUREFAULTENA_Pos) /*!< SCB SHCSR: SECUREFAULTENA Mask */ - -#define SCB_SHCSR_USGFAULTENA_Pos 18U /*!< SCB SHCSR: USGFAULTENA Position */ -#define SCB_SHCSR_USGFAULTENA_Msk (1UL << SCB_SHCSR_USGFAULTENA_Pos) /*!< SCB SHCSR: USGFAULTENA Mask */ - -#define SCB_SHCSR_BUSFAULTENA_Pos 17U /*!< SCB SHCSR: BUSFAULTENA Position */ -#define SCB_SHCSR_BUSFAULTENA_Msk (1UL << SCB_SHCSR_BUSFAULTENA_Pos) /*!< SCB SHCSR: BUSFAULTENA Mask */ - -#define SCB_SHCSR_MEMFAULTENA_Pos 16U /*!< SCB SHCSR: MEMFAULTENA Position */ -#define SCB_SHCSR_MEMFAULTENA_Msk (1UL << SCB_SHCSR_MEMFAULTENA_Pos) /*!< SCB SHCSR: MEMFAULTENA Mask */ - -#define SCB_SHCSR_SVCALLPENDED_Pos 15U /*!< SCB SHCSR: SVCALLPENDED Position */ -#define SCB_SHCSR_SVCALLPENDED_Msk (1UL << SCB_SHCSR_SVCALLPENDED_Pos) /*!< SCB SHCSR: SVCALLPENDED Mask */ - -#define SCB_SHCSR_BUSFAULTPENDED_Pos 14U /*!< SCB SHCSR: BUSFAULTPENDED Position */ -#define SCB_SHCSR_BUSFAULTPENDED_Msk (1UL << SCB_SHCSR_BUSFAULTPENDED_Pos) /*!< SCB SHCSR: BUSFAULTPENDED Mask */ - -#define SCB_SHCSR_MEMFAULTPENDED_Pos 13U /*!< SCB SHCSR: MEMFAULTPENDED Position */ -#define SCB_SHCSR_MEMFAULTPENDED_Msk (1UL << SCB_SHCSR_MEMFAULTPENDED_Pos) /*!< SCB SHCSR: MEMFAULTPENDED Mask */ - -#define SCB_SHCSR_USGFAULTPENDED_Pos 12U /*!< SCB SHCSR: USGFAULTPENDED Position */ -#define SCB_SHCSR_USGFAULTPENDED_Msk (1UL << SCB_SHCSR_USGFAULTPENDED_Pos) /*!< SCB SHCSR: USGFAULTPENDED Mask */ - -#define SCB_SHCSR_SYSTICKACT_Pos 11U /*!< SCB SHCSR: SYSTICKACT Position */ -#define SCB_SHCSR_SYSTICKACT_Msk (1UL << SCB_SHCSR_SYSTICKACT_Pos) /*!< SCB SHCSR: SYSTICKACT Mask */ - -#define SCB_SHCSR_PENDSVACT_Pos 10U /*!< SCB SHCSR: PENDSVACT Position */ -#define SCB_SHCSR_PENDSVACT_Msk (1UL << SCB_SHCSR_PENDSVACT_Pos) /*!< SCB SHCSR: PENDSVACT Mask */ - -#define SCB_SHCSR_MONITORACT_Pos 8U /*!< SCB SHCSR: MONITORACT Position */ -#define SCB_SHCSR_MONITORACT_Msk (1UL << SCB_SHCSR_MONITORACT_Pos) /*!< SCB SHCSR: MONITORACT Mask */ - -#define SCB_SHCSR_SVCALLACT_Pos 7U /*!< SCB SHCSR: SVCALLACT Position */ -#define SCB_SHCSR_SVCALLACT_Msk (1UL << SCB_SHCSR_SVCALLACT_Pos) /*!< SCB SHCSR: SVCALLACT Mask */ - -#define SCB_SHCSR_NMIACT_Pos 5U /*!< SCB SHCSR: NMIACT Position */ -#define SCB_SHCSR_NMIACT_Msk (1UL << SCB_SHCSR_NMIACT_Pos) /*!< SCB SHCSR: NMIACT Mask */ - -#define SCB_SHCSR_SECUREFAULTACT_Pos 4U /*!< SCB SHCSR: SECUREFAULTACT Position */ -#define SCB_SHCSR_SECUREFAULTACT_Msk (1UL << SCB_SHCSR_SECUREFAULTACT_Pos) /*!< SCB SHCSR: SECUREFAULTACT Mask */ - -#define SCB_SHCSR_USGFAULTACT_Pos 3U /*!< SCB SHCSR: USGFAULTACT Position */ -#define SCB_SHCSR_USGFAULTACT_Msk (1UL << SCB_SHCSR_USGFAULTACT_Pos) /*!< SCB SHCSR: USGFAULTACT Mask */ - -#define SCB_SHCSR_HARDFAULTACT_Pos 2U /*!< SCB SHCSR: HARDFAULTACT Position */ -#define SCB_SHCSR_HARDFAULTACT_Msk (1UL << SCB_SHCSR_HARDFAULTACT_Pos) /*!< SCB SHCSR: HARDFAULTACT Mask */ - -#define SCB_SHCSR_BUSFAULTACT_Pos 1U /*!< SCB SHCSR: BUSFAULTACT Position */ -#define SCB_SHCSR_BUSFAULTACT_Msk (1UL << SCB_SHCSR_BUSFAULTACT_Pos) /*!< SCB SHCSR: BUSFAULTACT Mask */ - -#define SCB_SHCSR_MEMFAULTACT_Pos 0U /*!< SCB SHCSR: MEMFAULTACT Position */ -#define SCB_SHCSR_MEMFAULTACT_Msk (1UL /*<< SCB_SHCSR_MEMFAULTACT_Pos*/) /*!< SCB SHCSR: MEMFAULTACT Mask */ - -/* SCB Configurable Fault Status Register Definitions */ -#define SCB_CFSR_USGFAULTSR_Pos 16U /*!< SCB CFSR: Usage Fault Status Register Position */ -#define SCB_CFSR_USGFAULTSR_Msk (0xFFFFUL << SCB_CFSR_USGFAULTSR_Pos) /*!< SCB CFSR: Usage Fault Status Register Mask */ - -#define SCB_CFSR_BUSFAULTSR_Pos 8U /*!< SCB CFSR: Bus Fault Status Register Position */ -#define SCB_CFSR_BUSFAULTSR_Msk (0xFFUL << SCB_CFSR_BUSFAULTSR_Pos) /*!< SCB CFSR: Bus Fault Status Register Mask */ - -#define SCB_CFSR_MEMFAULTSR_Pos 0U /*!< SCB CFSR: Memory Manage Fault Status Register Position */ -#define SCB_CFSR_MEMFAULTSR_Msk (0xFFUL /*<< SCB_CFSR_MEMFAULTSR_Pos*/) /*!< SCB CFSR: Memory Manage Fault Status Register Mask */ - -/* MemManage Fault Status Register (part of SCB Configurable Fault Status Register) */ -#define SCB_CFSR_MMARVALID_Pos (SCB_CFSR_MEMFAULTSR_Pos + 7U) /*!< SCB CFSR (MMFSR): MMARVALID Position */ -#define SCB_CFSR_MMARVALID_Msk (1UL << SCB_CFSR_MMARVALID_Pos) /*!< SCB CFSR (MMFSR): MMARVALID Mask */ - -#define SCB_CFSR_MLSPERR_Pos (SCB_CFSR_MEMFAULTSR_Pos + 5U) /*!< SCB CFSR (MMFSR): MLSPERR Position */ -#define SCB_CFSR_MLSPERR_Msk (1UL << SCB_CFSR_MLSPERR_Pos) /*!< SCB CFSR (MMFSR): MLSPERR Mask */ - -#define SCB_CFSR_MSTKERR_Pos (SCB_CFSR_MEMFAULTSR_Pos + 4U) /*!< SCB CFSR (MMFSR): MSTKERR Position */ -#define SCB_CFSR_MSTKERR_Msk (1UL << SCB_CFSR_MSTKERR_Pos) /*!< SCB CFSR (MMFSR): MSTKERR Mask */ - -#define SCB_CFSR_MUNSTKERR_Pos (SCB_CFSR_MEMFAULTSR_Pos + 3U) /*!< SCB CFSR (MMFSR): MUNSTKERR Position */ -#define SCB_CFSR_MUNSTKERR_Msk (1UL << SCB_CFSR_MUNSTKERR_Pos) /*!< SCB CFSR (MMFSR): MUNSTKERR Mask */ - -#define SCB_CFSR_DACCVIOL_Pos (SCB_CFSR_MEMFAULTSR_Pos + 1U) /*!< SCB CFSR (MMFSR): DACCVIOL Position */ -#define SCB_CFSR_DACCVIOL_Msk (1UL << SCB_CFSR_DACCVIOL_Pos) /*!< SCB CFSR (MMFSR): DACCVIOL Mask */ - -#define SCB_CFSR_IACCVIOL_Pos (SCB_CFSR_MEMFAULTSR_Pos + 0U) /*!< SCB CFSR (MMFSR): IACCVIOL Position */ -#define SCB_CFSR_IACCVIOL_Msk (1UL /*<< SCB_CFSR_IACCVIOL_Pos*/) /*!< SCB CFSR (MMFSR): IACCVIOL Mask */ - -/* BusFault Status Register (part of SCB Configurable Fault Status Register) */ -#define SCB_CFSR_BFARVALID_Pos (SCB_CFSR_BUSFAULTSR_Pos + 7U) /*!< SCB CFSR (BFSR): BFARVALID Position */ -#define SCB_CFSR_BFARVALID_Msk (1UL << SCB_CFSR_BFARVALID_Pos) /*!< SCB CFSR (BFSR): BFARVALID Mask */ - -#define SCB_CFSR_LSPERR_Pos (SCB_CFSR_BUSFAULTSR_Pos + 5U) /*!< SCB CFSR (BFSR): LSPERR Position */ -#define SCB_CFSR_LSPERR_Msk (1UL << SCB_CFSR_LSPERR_Pos) /*!< SCB CFSR (BFSR): LSPERR Mask */ - -#define SCB_CFSR_STKERR_Pos (SCB_CFSR_BUSFAULTSR_Pos + 4U) /*!< SCB CFSR (BFSR): STKERR Position */ -#define SCB_CFSR_STKERR_Msk (1UL << SCB_CFSR_STKERR_Pos) /*!< SCB CFSR (BFSR): STKERR Mask */ - -#define SCB_CFSR_UNSTKERR_Pos (SCB_CFSR_BUSFAULTSR_Pos + 3U) /*!< SCB CFSR (BFSR): UNSTKERR Position */ -#define SCB_CFSR_UNSTKERR_Msk (1UL << SCB_CFSR_UNSTKERR_Pos) /*!< SCB CFSR (BFSR): UNSTKERR Mask */ - -#define SCB_CFSR_IMPRECISERR_Pos (SCB_CFSR_BUSFAULTSR_Pos + 2U) /*!< SCB CFSR (BFSR): IMPRECISERR Position */ -#define SCB_CFSR_IMPRECISERR_Msk (1UL << SCB_CFSR_IMPRECISERR_Pos) /*!< SCB CFSR (BFSR): IMPRECISERR Mask */ - -#define SCB_CFSR_PRECISERR_Pos (SCB_CFSR_BUSFAULTSR_Pos + 1U) /*!< SCB CFSR (BFSR): PRECISERR Position */ -#define SCB_CFSR_PRECISERR_Msk (1UL << SCB_CFSR_PRECISERR_Pos) /*!< SCB CFSR (BFSR): PRECISERR Mask */ - -#define SCB_CFSR_IBUSERR_Pos (SCB_CFSR_BUSFAULTSR_Pos + 0U) /*!< SCB CFSR (BFSR): IBUSERR Position */ -#define SCB_CFSR_IBUSERR_Msk (1UL << SCB_CFSR_IBUSERR_Pos) /*!< SCB CFSR (BFSR): IBUSERR Mask */ - -/* UsageFault Status Register (part of SCB Configurable Fault Status Register) */ -#define SCB_CFSR_DIVBYZERO_Pos (SCB_CFSR_USGFAULTSR_Pos + 9U) /*!< SCB CFSR (UFSR): DIVBYZERO Position */ -#define SCB_CFSR_DIVBYZERO_Msk (1UL << SCB_CFSR_DIVBYZERO_Pos) /*!< SCB CFSR (UFSR): DIVBYZERO Mask */ - -#define SCB_CFSR_UNALIGNED_Pos (SCB_CFSR_USGFAULTSR_Pos + 8U) /*!< SCB CFSR (UFSR): UNALIGNED Position */ -#define SCB_CFSR_UNALIGNED_Msk (1UL << SCB_CFSR_UNALIGNED_Pos) /*!< SCB CFSR (UFSR): UNALIGNED Mask */ - -#define SCB_CFSR_STKOF_Pos (SCB_CFSR_USGFAULTSR_Pos + 4U) /*!< SCB CFSR (UFSR): STKOF Position */ -#define SCB_CFSR_STKOF_Msk (1UL << SCB_CFSR_STKOF_Pos) /*!< SCB CFSR (UFSR): STKOF Mask */ - -#define SCB_CFSR_NOCP_Pos (SCB_CFSR_USGFAULTSR_Pos + 3U) /*!< SCB CFSR (UFSR): NOCP Position */ -#define SCB_CFSR_NOCP_Msk (1UL << SCB_CFSR_NOCP_Pos) /*!< SCB CFSR (UFSR): NOCP Mask */ - -#define SCB_CFSR_INVPC_Pos (SCB_CFSR_USGFAULTSR_Pos + 2U) /*!< SCB CFSR (UFSR): INVPC Position */ -#define SCB_CFSR_INVPC_Msk (1UL << SCB_CFSR_INVPC_Pos) /*!< SCB CFSR (UFSR): INVPC Mask */ - -#define SCB_CFSR_INVSTATE_Pos (SCB_CFSR_USGFAULTSR_Pos + 1U) /*!< SCB CFSR (UFSR): INVSTATE Position */ -#define SCB_CFSR_INVSTATE_Msk (1UL << SCB_CFSR_INVSTATE_Pos) /*!< SCB CFSR (UFSR): INVSTATE Mask */ - -#define SCB_CFSR_UNDEFINSTR_Pos (SCB_CFSR_USGFAULTSR_Pos + 0U) /*!< SCB CFSR (UFSR): UNDEFINSTR Position */ -#define SCB_CFSR_UNDEFINSTR_Msk (1UL << SCB_CFSR_UNDEFINSTR_Pos) /*!< SCB CFSR (UFSR): UNDEFINSTR Mask */ - -/* SCB Hard Fault Status Register Definitions */ -#define SCB_HFSR_DEBUGEVT_Pos 31U /*!< SCB HFSR: DEBUGEVT Position */ -#define SCB_HFSR_DEBUGEVT_Msk (1UL << SCB_HFSR_DEBUGEVT_Pos) /*!< SCB HFSR: DEBUGEVT Mask */ - -#define SCB_HFSR_FORCED_Pos 30U /*!< SCB HFSR: FORCED Position */ -#define SCB_HFSR_FORCED_Msk (1UL << SCB_HFSR_FORCED_Pos) /*!< SCB HFSR: FORCED Mask */ - -#define SCB_HFSR_VECTTBL_Pos 1U /*!< SCB HFSR: VECTTBL Position */ -#define SCB_HFSR_VECTTBL_Msk (1UL << SCB_HFSR_VECTTBL_Pos) /*!< SCB HFSR: VECTTBL Mask */ - -/* SCB Debug Fault Status Register Definitions */ -#define SCB_DFSR_PMU_Pos 5U /*!< SCB DFSR: PMU Position */ -#define SCB_DFSR_PMU_Msk (1UL << SCB_DFSR_PMU_Pos) /*!< SCB DFSR: PMU Mask */ - -#define SCB_DFSR_EXTERNAL_Pos 4U /*!< SCB DFSR: EXTERNAL Position */ -#define SCB_DFSR_EXTERNAL_Msk (1UL << SCB_DFSR_EXTERNAL_Pos) /*!< SCB DFSR: EXTERNAL Mask */ - -#define SCB_DFSR_VCATCH_Pos 3U /*!< SCB DFSR: VCATCH Position */ -#define SCB_DFSR_VCATCH_Msk (1UL << SCB_DFSR_VCATCH_Pos) /*!< SCB DFSR: VCATCH Mask */ - -#define SCB_DFSR_DWTTRAP_Pos 2U /*!< SCB DFSR: DWTTRAP Position */ -#define SCB_DFSR_DWTTRAP_Msk (1UL << SCB_DFSR_DWTTRAP_Pos) /*!< SCB DFSR: DWTTRAP Mask */ - -#define SCB_DFSR_BKPT_Pos 1U /*!< SCB DFSR: BKPT Position */ -#define SCB_DFSR_BKPT_Msk (1UL << SCB_DFSR_BKPT_Pos) /*!< SCB DFSR: BKPT Mask */ - -#define SCB_DFSR_HALTED_Pos 0U /*!< SCB DFSR: HALTED Position */ -#define SCB_DFSR_HALTED_Msk (1UL /*<< SCB_DFSR_HALTED_Pos*/) /*!< SCB DFSR: HALTED Mask */ - -/* SCB Non-Secure Access Control Register Definitions */ -#define SCB_NSACR_CP11_Pos 11U /*!< SCB NSACR: CP11 Position */ -#define SCB_NSACR_CP11_Msk (1UL << SCB_NSACR_CP11_Pos) /*!< SCB NSACR: CP11 Mask */ - -#define SCB_NSACR_CP10_Pos 10U /*!< SCB NSACR: CP10 Position */ -#define SCB_NSACR_CP10_Msk (1UL << SCB_NSACR_CP10_Pos) /*!< SCB NSACR: CP10 Mask */ - -#define SCB_NSACR_CP7_Pos 7U /*!< SCB NSACR: CP7 Position */ -#define SCB_NSACR_CP7_Msk (1UL << SCB_NSACR_CP7_Pos) /*!< SCB NSACR: CP7 Mask */ - -#define SCB_NSACR_CP6_Pos 6U /*!< SCB NSACR: CP6 Position */ -#define SCB_NSACR_CP6_Msk (1UL << SCB_NSACR_CP6_Pos) /*!< SCB NSACR: CP6 Mask */ - -#define SCB_NSACR_CP5_Pos 5U /*!< SCB NSACR: CP5 Position */ -#define SCB_NSACR_CP5_Msk (1UL << SCB_NSACR_CP5_Pos) /*!< SCB NSACR: CP5 Mask */ - -#define SCB_NSACR_CP4_Pos 4U /*!< SCB NSACR: CP4 Position */ -#define SCB_NSACR_CP4_Msk (1UL << SCB_NSACR_CP4_Pos) /*!< SCB NSACR: CP4 Mask */ - -#define SCB_NSACR_CP3_Pos 3U /*!< SCB NSACR: CP3 Position */ -#define SCB_NSACR_CP3_Msk (1UL << SCB_NSACR_CP3_Pos) /*!< SCB NSACR: CP3 Mask */ - -#define SCB_NSACR_CP2_Pos 2U /*!< SCB NSACR: CP2 Position */ -#define SCB_NSACR_CP2_Msk (1UL << SCB_NSACR_CP2_Pos) /*!< SCB NSACR: CP2 Mask */ - -#define SCB_NSACR_CP1_Pos 1U /*!< SCB NSACR: CP1 Position */ -#define SCB_NSACR_CP1_Msk (1UL << SCB_NSACR_CP1_Pos) /*!< SCB NSACR: CP1 Mask */ - -#define SCB_NSACR_CP0_Pos 0U /*!< SCB NSACR: CP0 Position */ -#define SCB_NSACR_CP0_Msk (1UL /*<< SCB_NSACR_CP0_Pos*/) /*!< SCB NSACR: CP0 Mask */ - -/* SCB Debug Feature Register 0 Definitions */ -#define SCB_ID_DFR_UDE_Pos 28U /*!< SCB ID_DFR: UDE Position */ -#define SCB_ID_DFR_UDE_Msk (0xFUL << SCB_ID_DFR_UDE_Pos) /*!< SCB ID_DFR: UDE Mask */ - -#define SCB_ID_DFR_MProfDbg_Pos 20U /*!< SCB ID_DFR: MProfDbg Position */ -#define SCB_ID_DFR_MProfDbg_Msk (0xFUL << SCB_ID_DFR_MProfDbg_Pos) /*!< SCB ID_DFR: MProfDbg Mask */ - -/* SCB Cache Level ID Register Definitions */ -#define SCB_CLIDR_LOUU_Pos 27U /*!< SCB CLIDR: LoUU Position */ -#define SCB_CLIDR_LOUU_Msk (7UL << SCB_CLIDR_LOUU_Pos) /*!< SCB CLIDR: LoUU Mask */ - -#define SCB_CLIDR_LOC_Pos 24U /*!< SCB CLIDR: LoC Position */ -#define SCB_CLIDR_LOC_Msk (7UL << SCB_CLIDR_LOC_Pos) /*!< SCB CLIDR: LoC Mask */ - -/* SCB Cache Type Register Definitions */ -#define SCB_CTR_FORMAT_Pos 29U /*!< SCB CTR: Format Position */ -#define SCB_CTR_FORMAT_Msk (7UL << SCB_CTR_FORMAT_Pos) /*!< SCB CTR: Format Mask */ - -#define SCB_CTR_CWG_Pos 24U /*!< SCB CTR: CWG Position */ -#define SCB_CTR_CWG_Msk (0xFUL << SCB_CTR_CWG_Pos) /*!< SCB CTR: CWG Mask */ - -#define SCB_CTR_ERG_Pos 20U /*!< SCB CTR: ERG Position */ -#define SCB_CTR_ERG_Msk (0xFUL << SCB_CTR_ERG_Pos) /*!< SCB CTR: ERG Mask */ - -#define SCB_CTR_DMINLINE_Pos 16U /*!< SCB CTR: DminLine Position */ -#define SCB_CTR_DMINLINE_Msk (0xFUL << SCB_CTR_DMINLINE_Pos) /*!< SCB CTR: DminLine Mask */ - -#define SCB_CTR_IMINLINE_Pos 0U /*!< SCB CTR: ImInLine Position */ -#define SCB_CTR_IMINLINE_Msk (0xFUL /*<< SCB_CTR_IMINLINE_Pos*/) /*!< SCB CTR: ImInLine Mask */ - -/* SCB Cache Size ID Register Definitions */ -#define SCB_CCSIDR_WT_Pos 31U /*!< SCB CCSIDR: WT Position */ -#define SCB_CCSIDR_WT_Msk (1UL << SCB_CCSIDR_WT_Pos) /*!< SCB CCSIDR: WT Mask */ - -#define SCB_CCSIDR_WB_Pos 30U /*!< SCB CCSIDR: WB Position */ -#define SCB_CCSIDR_WB_Msk (1UL << SCB_CCSIDR_WB_Pos) /*!< SCB CCSIDR: WB Mask */ - -#define SCB_CCSIDR_RA_Pos 29U /*!< SCB CCSIDR: RA Position */ -#define SCB_CCSIDR_RA_Msk (1UL << SCB_CCSIDR_RA_Pos) /*!< SCB CCSIDR: RA Mask */ - -#define SCB_CCSIDR_WA_Pos 28U /*!< SCB CCSIDR: WA Position */ -#define SCB_CCSIDR_WA_Msk (1UL << SCB_CCSIDR_WA_Pos) /*!< SCB CCSIDR: WA Mask */ - -#define SCB_CCSIDR_NUMSETS_Pos 13U /*!< SCB CCSIDR: NumSets Position */ -#define SCB_CCSIDR_NUMSETS_Msk (0x7FFFUL << SCB_CCSIDR_NUMSETS_Pos) /*!< SCB CCSIDR: NumSets Mask */ - -#define SCB_CCSIDR_ASSOCIATIVITY_Pos 3U /*!< SCB CCSIDR: Associativity Position */ -#define SCB_CCSIDR_ASSOCIATIVITY_Msk (0x3FFUL << SCB_CCSIDR_ASSOCIATIVITY_Pos) /*!< SCB CCSIDR: Associativity Mask */ - -#define SCB_CCSIDR_LINESIZE_Pos 0U /*!< SCB CCSIDR: LineSize Position */ -#define SCB_CCSIDR_LINESIZE_Msk (7UL /*<< SCB_CCSIDR_LINESIZE_Pos*/) /*!< SCB CCSIDR: LineSize Mask */ - -/* SCB Cache Size Selection Register Definitions */ -#define SCB_CSSELR_LEVEL_Pos 1U /*!< SCB CSSELR: Level Position */ -#define SCB_CSSELR_LEVEL_Msk (7UL << SCB_CSSELR_LEVEL_Pos) /*!< SCB CSSELR: Level Mask */ - -#define SCB_CSSELR_IND_Pos 0U /*!< SCB CSSELR: InD Position */ -#define SCB_CSSELR_IND_Msk (1UL /*<< SCB_CSSELR_IND_Pos*/) /*!< SCB CSSELR: InD Mask */ - -/* SCB Software Triggered Interrupt Register Definitions */ -#define SCB_STIR_INTID_Pos 0U /*!< SCB STIR: INTID Position */ -#define SCB_STIR_INTID_Msk (0x1FFUL /*<< SCB_STIR_INTID_Pos*/) /*!< SCB STIR: INTID Mask */ - -/* SCB RAS Fault Status Register Definitions */ -#define SCB_RFSR_V_Pos 31U /*!< SCB RFSR: V Position */ -#define SCB_RFSR_V_Msk (1UL << SCB_RFSR_V_Pos) /*!< SCB RFSR: V Mask */ - -#define SCB_RFSR_IS_Pos 16U /*!< SCB RFSR: IS Position */ -#define SCB_RFSR_IS_Msk (0x7FFFUL << SCB_RFSR_IS_Pos) /*!< SCB RFSR: IS Mask */ - -#define SCB_RFSR_UET_Pos 0U /*!< SCB RFSR: UET Position */ -#define SCB_RFSR_UET_Msk (3UL /*<< SCB_RFSR_UET_Pos*/) /*!< SCB RFSR: UET Mask */ - -/* SCB D-Cache Invalidate by Set-way Register Definitions */ -#define SCB_DCISW_WAY_Pos 30U /*!< SCB DCISW: Way Position */ -#define SCB_DCISW_WAY_Msk (3UL << SCB_DCISW_WAY_Pos) /*!< SCB DCISW: Way Mask */ - -#define SCB_DCISW_SET_Pos 5U /*!< SCB DCISW: Set Position */ -#define SCB_DCISW_SET_Msk (0x1FFUL << SCB_DCISW_SET_Pos) /*!< SCB DCISW: Set Mask */ - -/* SCB D-Cache Clean by Set-way Register Definitions */ -#define SCB_DCCSW_WAY_Pos 30U /*!< SCB DCCSW: Way Position */ -#define SCB_DCCSW_WAY_Msk (3UL << SCB_DCCSW_WAY_Pos) /*!< SCB DCCSW: Way Mask */ - -#define SCB_DCCSW_SET_Pos 5U /*!< SCB DCCSW: Set Position */ -#define SCB_DCCSW_SET_Msk (0x1FFUL << SCB_DCCSW_SET_Pos) /*!< SCB DCCSW: Set Mask */ - -/* SCB D-Cache Clean and Invalidate by Set-way Register Definitions */ -#define SCB_DCCISW_WAY_Pos 30U /*!< SCB DCCISW: Way Position */ -#define SCB_DCCISW_WAY_Msk (3UL << SCB_DCCISW_WAY_Pos) /*!< SCB DCCISW: Way Mask */ - -#define SCB_DCCISW_SET_Pos 5U /*!< SCB DCCISW: Set Position */ -#define SCB_DCCISW_SET_Msk (0x1FFUL << SCB_DCCISW_SET_Pos) /*!< SCB DCCISW: Set Mask */ - -/*@} end of group CMSIS_SCB */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_SCnSCB System Controls not in SCB (SCnSCB) - \brief Type definitions for the System Control and ID Register not in the SCB - @{ - */ - -/** - \brief Structure type to access the System Control and ID Register not in the SCB. - */ -typedef struct -{ - uint32_t RESERVED0[1U]; - __IM uint32_t ICTR; /*!< Offset: 0x004 (R/ ) Interrupt Controller Type Register */ - __IOM uint32_t ACTLR; /*!< Offset: 0x008 (R/W) Auxiliary Control Register */ - __IOM uint32_t CPPWR; /*!< Offset: 0x00C (R/W) Coprocessor Power Control Register */ -} SCnSCB_Type; - -/* Interrupt Controller Type Register Definitions */ -#define SCnSCB_ICTR_INTLINESNUM_Pos 0U /*!< ICTR: INTLINESNUM Position */ -#define SCnSCB_ICTR_INTLINESNUM_Msk (0xFUL /*<< SCnSCB_ICTR_INTLINESNUM_Pos*/) /*!< ICTR: INTLINESNUM Mask */ - -/*@} end of group CMSIS_SCnotSCB */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_SysTick System Tick Timer (SysTick) - \brief Type definitions for the System Timer Registers. - @{ - */ - -/** - \brief Structure type to access the System Timer (SysTick). - */ -typedef struct -{ - __IOM uint32_t CTRL; /*!< Offset: 0x000 (R/W) SysTick Control and Status Register */ - __IOM uint32_t LOAD; /*!< Offset: 0x004 (R/W) SysTick Reload Value Register */ - __IOM uint32_t VAL; /*!< Offset: 0x008 (R/W) SysTick Current Value Register */ - __IM uint32_t CALIB; /*!< Offset: 0x00C (R/ ) SysTick Calibration Register */ -} SysTick_Type; - -/* SysTick Control / Status Register Definitions */ -#define SysTick_CTRL_COUNTFLAG_Pos 16U /*!< SysTick CTRL: COUNTFLAG Position */ -#define SysTick_CTRL_COUNTFLAG_Msk (1UL << SysTick_CTRL_COUNTFLAG_Pos) /*!< SysTick CTRL: COUNTFLAG Mask */ - -#define SysTick_CTRL_CLKSOURCE_Pos 2U /*!< SysTick CTRL: CLKSOURCE Position */ -#define SysTick_CTRL_CLKSOURCE_Msk (1UL << SysTick_CTRL_CLKSOURCE_Pos) /*!< SysTick CTRL: CLKSOURCE Mask */ - -#define SysTick_CTRL_TICKINT_Pos 1U /*!< SysTick CTRL: TICKINT Position */ -#define SysTick_CTRL_TICKINT_Msk (1UL << SysTick_CTRL_TICKINT_Pos) /*!< SysTick CTRL: TICKINT Mask */ - -#define SysTick_CTRL_ENABLE_Pos 0U /*!< SysTick CTRL: ENABLE Position */ -#define SysTick_CTRL_ENABLE_Msk (1UL /*<< SysTick_CTRL_ENABLE_Pos*/) /*!< SysTick CTRL: ENABLE Mask */ - -/* SysTick Reload Register Definitions */ -#define SysTick_LOAD_RELOAD_Pos 0U /*!< SysTick LOAD: RELOAD Position */ -#define SysTick_LOAD_RELOAD_Msk (0xFFFFFFUL /*<< SysTick_LOAD_RELOAD_Pos*/) /*!< SysTick LOAD: RELOAD Mask */ - -/* SysTick Current Register Definitions */ -#define SysTick_VAL_CURRENT_Pos 0U /*!< SysTick VAL: CURRENT Position */ -#define SysTick_VAL_CURRENT_Msk (0xFFFFFFUL /*<< SysTick_VAL_CURRENT_Pos*/) /*!< SysTick VAL: CURRENT Mask */ - -/* SysTick Calibration Register Definitions */ -#define SysTick_CALIB_NOREF_Pos 31U /*!< SysTick CALIB: NOREF Position */ -#define SysTick_CALIB_NOREF_Msk (1UL << SysTick_CALIB_NOREF_Pos) /*!< SysTick CALIB: NOREF Mask */ - -#define SysTick_CALIB_SKEW_Pos 30U /*!< SysTick CALIB: SKEW Position */ -#define SysTick_CALIB_SKEW_Msk (1UL << SysTick_CALIB_SKEW_Pos) /*!< SysTick CALIB: SKEW Mask */ - -#define SysTick_CALIB_TENMS_Pos 0U /*!< SysTick CALIB: TENMS Position */ -#define SysTick_CALIB_TENMS_Msk (0xFFFFFFUL /*<< SysTick_CALIB_TENMS_Pos*/) /*!< SysTick CALIB: TENMS Mask */ - -/*@} end of group CMSIS_SysTick */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_ITM Instrumentation Trace Macrocell (ITM) - \brief Type definitions for the Instrumentation Trace Macrocell (ITM) - @{ - */ - -/** - \brief Structure type to access the Instrumentation Trace Macrocell Register (ITM). - */ -typedef struct -{ - __OM union - { - __OM uint8_t u8; /*!< Offset: 0x000 ( /W) ITM Stimulus Port 8-bit */ - __OM uint16_t u16; /*!< Offset: 0x000 ( /W) ITM Stimulus Port 16-bit */ - __OM uint32_t u32; /*!< Offset: 0x000 ( /W) ITM Stimulus Port 32-bit */ - } PORT [32U]; /*!< Offset: 0x000 ( /W) ITM Stimulus Port Registers */ - uint32_t RESERVED0[864U]; - __IOM uint32_t TER; /*!< Offset: 0xE00 (R/W) ITM Trace Enable Register */ - uint32_t RESERVED1[15U]; - __IOM uint32_t TPR; /*!< Offset: 0xE40 (R/W) ITM Trace Privilege Register */ - uint32_t RESERVED2[15U]; - __IOM uint32_t TCR; /*!< Offset: 0xE80 (R/W) ITM Trace Control Register */ - uint32_t RESERVED3[32U]; - uint32_t RESERVED4[43U]; - __OM uint32_t LAR; /*!< Offset: 0xFB0 ( /W) ITM Lock Access Register */ - __IM uint32_t LSR; /*!< Offset: 0xFB4 (R/ ) ITM Lock Status Register */ - uint32_t RESERVED5[1U]; - __IM uint32_t DEVARCH; /*!< Offset: 0xFBC (R/ ) ITM Device Architecture Register */ - uint32_t RESERVED6[3U]; - __IM uint32_t DEVTYPE; /*!< Offset: 0xFCC (R/ ) ITM Device Type Register */ - __IM uint32_t PID4; /*!< Offset: 0xFD0 (R/ ) ITM Peripheral Identification Register #4 */ - __IM uint32_t PID5; /*!< Offset: 0xFD4 (R/ ) ITM Peripheral Identification Register #5 */ - __IM uint32_t PID6; /*!< Offset: 0xFD8 (R/ ) ITM Peripheral Identification Register #6 */ - __IM uint32_t PID7; /*!< Offset: 0xFDC (R/ ) ITM Peripheral Identification Register #7 */ - __IM uint32_t PID0; /*!< Offset: 0xFE0 (R/ ) ITM Peripheral Identification Register #0 */ - __IM uint32_t PID1; /*!< Offset: 0xFE4 (R/ ) ITM Peripheral Identification Register #1 */ - __IM uint32_t PID2; /*!< Offset: 0xFE8 (R/ ) ITM Peripheral Identification Register #2 */ - __IM uint32_t PID3; /*!< Offset: 0xFEC (R/ ) ITM Peripheral Identification Register #3 */ - __IM uint32_t CID0; /*!< Offset: 0xFF0 (R/ ) ITM Component Identification Register #0 */ - __IM uint32_t CID1; /*!< Offset: 0xFF4 (R/ ) ITM Component Identification Register #1 */ - __IM uint32_t CID2; /*!< Offset: 0xFF8 (R/ ) ITM Component Identification Register #2 */ - __IM uint32_t CID3; /*!< Offset: 0xFFC (R/ ) ITM Component Identification Register #3 */ -} ITM_Type; - -/* ITM Stimulus Port Register Definitions */ -#define ITM_STIM_DISABLED_Pos 1U /*!< ITM STIM: DISABLED Position */ -#define ITM_STIM_DISABLED_Msk (0x1UL << ITM_STIM_DISABLED_Pos) /*!< ITM STIM: DISABLED Mask */ - -#define ITM_STIM_FIFOREADY_Pos 0U /*!< ITM STIM: FIFOREADY Position */ -#define ITM_STIM_FIFOREADY_Msk (0x1UL /*<< ITM_STIM_FIFOREADY_Pos*/) /*!< ITM STIM: FIFOREADY Mask */ - -/* ITM Trace Privilege Register Definitions */ -#define ITM_TPR_PRIVMASK_Pos 0U /*!< ITM TPR: PRIVMASK Position */ -#define ITM_TPR_PRIVMASK_Msk (0xFUL /*<< ITM_TPR_PRIVMASK_Pos*/) /*!< ITM TPR: PRIVMASK Mask */ - -/* ITM Trace Control Register Definitions */ -#define ITM_TCR_BUSY_Pos 23U /*!< ITM TCR: BUSY Position */ -#define ITM_TCR_BUSY_Msk (1UL << ITM_TCR_BUSY_Pos) /*!< ITM TCR: BUSY Mask */ - -#define ITM_TCR_TRACEBUSID_Pos 16U /*!< ITM TCR: ATBID Position */ -#define ITM_TCR_TRACEBUSID_Msk (0x7FUL << ITM_TCR_TRACEBUSID_Pos) /*!< ITM TCR: ATBID Mask */ - -#define ITM_TCR_GTSFREQ_Pos 10U /*!< ITM TCR: Global timestamp frequency Position */ -#define ITM_TCR_GTSFREQ_Msk (3UL << ITM_TCR_GTSFREQ_Pos) /*!< ITM TCR: Global timestamp frequency Mask */ - -#define ITM_TCR_TSPRESCALE_Pos 8U /*!< ITM TCR: TSPRESCALE Position */ -#define ITM_TCR_TSPRESCALE_Msk (3UL << ITM_TCR_TSPRESCALE_Pos) /*!< ITM TCR: TSPRESCALE Mask */ - -#define ITM_TCR_STALLENA_Pos 5U /*!< ITM TCR: STALLENA Position */ -#define ITM_TCR_STALLENA_Msk (1UL << ITM_TCR_STALLENA_Pos) /*!< ITM TCR: STALLENA Mask */ - -#define ITM_TCR_SWOENA_Pos 4U /*!< ITM TCR: SWOENA Position */ -#define ITM_TCR_SWOENA_Msk (1UL << ITM_TCR_SWOENA_Pos) /*!< ITM TCR: SWOENA Mask */ - -#define ITM_TCR_DWTENA_Pos 3U /*!< ITM TCR: DWTENA Position */ -#define ITM_TCR_DWTENA_Msk (1UL << ITM_TCR_DWTENA_Pos) /*!< ITM TCR: DWTENA Mask */ - -#define ITM_TCR_SYNCENA_Pos 2U /*!< ITM TCR: SYNCENA Position */ -#define ITM_TCR_SYNCENA_Msk (1UL << ITM_TCR_SYNCENA_Pos) /*!< ITM TCR: SYNCENA Mask */ - -#define ITM_TCR_TSENA_Pos 1U /*!< ITM TCR: TSENA Position */ -#define ITM_TCR_TSENA_Msk (1UL << ITM_TCR_TSENA_Pos) /*!< ITM TCR: TSENA Mask */ - -#define ITM_TCR_ITMENA_Pos 0U /*!< ITM TCR: ITM Enable bit Position */ -#define ITM_TCR_ITMENA_Msk (1UL /*<< ITM_TCR_ITMENA_Pos*/) /*!< ITM TCR: ITM Enable bit Mask */ - -/* ITM Lock Status Register Definitions */ -#define ITM_LSR_ByteAcc_Pos 2U /*!< ITM LSR: ByteAcc Position */ -#define ITM_LSR_ByteAcc_Msk (1UL << ITM_LSR_ByteAcc_Pos) /*!< ITM LSR: ByteAcc Mask */ - -#define ITM_LSR_Access_Pos 1U /*!< ITM LSR: Access Position */ -#define ITM_LSR_Access_Msk (1UL << ITM_LSR_Access_Pos) /*!< ITM LSR: Access Mask */ - -#define ITM_LSR_Present_Pos 0U /*!< ITM LSR: Present Position */ -#define ITM_LSR_Present_Msk (1UL /*<< ITM_LSR_Present_Pos*/) /*!< ITM LSR: Present Mask */ - -/*@}*/ /* end of group CMSIS_ITM */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_DWT Data Watchpoint and Trace (DWT) - \brief Type definitions for the Data Watchpoint and Trace (DWT) - @{ - */ - -/** - \brief Structure type to access the Data Watchpoint and Trace Register (DWT). - */ -typedef struct -{ - __IOM uint32_t CTRL; /*!< Offset: 0x000 (R/W) Control Register */ - __IOM uint32_t CYCCNT; /*!< Offset: 0x004 (R/W) Cycle Count Register */ - __IOM uint32_t CPICNT; /*!< Offset: 0x008 (R/W) CPI Count Register */ - __IOM uint32_t EXCCNT; /*!< Offset: 0x00C (R/W) Exception Overhead Count Register */ - __IOM uint32_t SLEEPCNT; /*!< Offset: 0x010 (R/W) Sleep Count Register */ - __IOM uint32_t LSUCNT; /*!< Offset: 0x014 (R/W) LSU Count Register */ - __IOM uint32_t FOLDCNT; /*!< Offset: 0x018 (R/W) Folded-instruction Count Register */ - __IM uint32_t PCSR; /*!< Offset: 0x01C (R/ ) Program Counter Sample Register */ - __IOM uint32_t COMP0; /*!< Offset: 0x020 (R/W) Comparator Register 0 */ - uint32_t RESERVED1[1U]; - __IOM uint32_t FUNCTION0; /*!< Offset: 0x028 (R/W) Function Register 0 */ - uint32_t RESERVED2[1U]; - __IOM uint32_t COMP1; /*!< Offset: 0x030 (R/W) Comparator Register 1 */ - uint32_t RESERVED3[1U]; - __IOM uint32_t FUNCTION1; /*!< Offset: 0x038 (R/W) Function Register 1 */ - uint32_t RESERVED4[1U]; - __IOM uint32_t COMP2; /*!< Offset: 0x040 (R/W) Comparator Register 2 */ - uint32_t RESERVED5[1U]; - __IOM uint32_t FUNCTION2; /*!< Offset: 0x048 (R/W) Function Register 2 */ - uint32_t RESERVED6[1U]; - __IOM uint32_t COMP3; /*!< Offset: 0x050 (R/W) Comparator Register 3 */ - uint32_t RESERVED7[1U]; - __IOM uint32_t FUNCTION3; /*!< Offset: 0x058 (R/W) Function Register 3 */ - uint32_t RESERVED8[1U]; - __IOM uint32_t COMP4; /*!< Offset: 0x060 (R/W) Comparator Register 4 */ - uint32_t RESERVED9[1U]; - __IOM uint32_t FUNCTION4; /*!< Offset: 0x068 (R/W) Function Register 4 */ - uint32_t RESERVED10[1U]; - __IOM uint32_t COMP5; /*!< Offset: 0x070 (R/W) Comparator Register 5 */ - uint32_t RESERVED11[1U]; - __IOM uint32_t FUNCTION5; /*!< Offset: 0x078 (R/W) Function Register 5 */ - uint32_t RESERVED12[1U]; - __IOM uint32_t COMP6; /*!< Offset: 0x080 (R/W) Comparator Register 6 */ - uint32_t RESERVED13[1U]; - __IOM uint32_t FUNCTION6; /*!< Offset: 0x088 (R/W) Function Register 6 */ - uint32_t RESERVED14[1U]; - __IOM uint32_t COMP7; /*!< Offset: 0x090 (R/W) Comparator Register 7 */ - uint32_t RESERVED15[1U]; - __IOM uint32_t FUNCTION7; /*!< Offset: 0x098 (R/W) Function Register 7 */ - uint32_t RESERVED16[1U]; - __IOM uint32_t COMP8; /*!< Offset: 0x0A0 (R/W) Comparator Register 8 */ - uint32_t RESERVED17[1U]; - __IOM uint32_t FUNCTION8; /*!< Offset: 0x0A8 (R/W) Function Register 8 */ - uint32_t RESERVED18[1U]; - __IOM uint32_t COMP9; /*!< Offset: 0x0B0 (R/W) Comparator Register 9 */ - uint32_t RESERVED19[1U]; - __IOM uint32_t FUNCTION9; /*!< Offset: 0x0B8 (R/W) Function Register 9 */ - uint32_t RESERVED20[1U]; - __IOM uint32_t COMP10; /*!< Offset: 0x0C0 (R/W) Comparator Register 10 */ - uint32_t RESERVED21[1U]; - __IOM uint32_t FUNCTION10; /*!< Offset: 0x0C8 (R/W) Function Register 10 */ - uint32_t RESERVED22[1U]; - __IOM uint32_t COMP11; /*!< Offset: 0x0D0 (R/W) Comparator Register 11 */ - uint32_t RESERVED23[1U]; - __IOM uint32_t FUNCTION11; /*!< Offset: 0x0D8 (R/W) Function Register 11 */ - uint32_t RESERVED24[1U]; - __IOM uint32_t COMP12; /*!< Offset: 0x0E0 (R/W) Comparator Register 12 */ - uint32_t RESERVED25[1U]; - __IOM uint32_t FUNCTION12; /*!< Offset: 0x0E8 (R/W) Function Register 12 */ - uint32_t RESERVED26[1U]; - __IOM uint32_t COMP13; /*!< Offset: 0x0F0 (R/W) Comparator Register 13 */ - uint32_t RESERVED27[1U]; - __IOM uint32_t FUNCTION13; /*!< Offset: 0x0F8 (R/W) Function Register 13 */ - uint32_t RESERVED28[1U]; - __IOM uint32_t COMP14; /*!< Offset: 0x100 (R/W) Comparator Register 14 */ - uint32_t RESERVED29[1U]; - __IOM uint32_t FUNCTION14; /*!< Offset: 0x108 (R/W) Function Register 14 */ - uint32_t RESERVED30[1U]; - __IOM uint32_t COMP15; /*!< Offset: 0x110 (R/W) Comparator Register 15 */ - uint32_t RESERVED31[1U]; - __IOM uint32_t FUNCTION15; /*!< Offset: 0x118 (R/W) Function Register 15 */ - uint32_t RESERVED32[934U]; - __IM uint32_t LSR; /*!< Offset: 0xFB4 (R ) Lock Status Register */ - uint32_t RESERVED33[1U]; - __IM uint32_t DEVARCH; /*!< Offset: 0xFBC (R/ ) Device Architecture Register */ -} DWT_Type; - -/* DWT Control Register Definitions */ -#define DWT_CTRL_NUMCOMP_Pos 28U /*!< DWT CTRL: NUMCOMP Position */ -#define DWT_CTRL_NUMCOMP_Msk (0xFUL << DWT_CTRL_NUMCOMP_Pos) /*!< DWT CTRL: NUMCOMP Mask */ - -#define DWT_CTRL_NOTRCPKT_Pos 27U /*!< DWT CTRL: NOTRCPKT Position */ -#define DWT_CTRL_NOTRCPKT_Msk (0x1UL << DWT_CTRL_NOTRCPKT_Pos) /*!< DWT CTRL: NOTRCPKT Mask */ - -#define DWT_CTRL_NOEXTTRIG_Pos 26U /*!< DWT CTRL: NOEXTTRIG Position */ -#define DWT_CTRL_NOEXTTRIG_Msk (0x1UL << DWT_CTRL_NOEXTTRIG_Pos) /*!< DWT CTRL: NOEXTTRIG Mask */ - -#define DWT_CTRL_NOCYCCNT_Pos 25U /*!< DWT CTRL: NOCYCCNT Position */ -#define DWT_CTRL_NOCYCCNT_Msk (0x1UL << DWT_CTRL_NOCYCCNT_Pos) /*!< DWT CTRL: NOCYCCNT Mask */ - -#define DWT_CTRL_NOPRFCNT_Pos 24U /*!< DWT CTRL: NOPRFCNT Position */ -#define DWT_CTRL_NOPRFCNT_Msk (0x1UL << DWT_CTRL_NOPRFCNT_Pos) /*!< DWT CTRL: NOPRFCNT Mask */ - -#define DWT_CTRL_CYCDISS_Pos 23U /*!< DWT CTRL: CYCDISS Position */ -#define DWT_CTRL_CYCDISS_Msk (0x1UL << DWT_CTRL_CYCDISS_Pos) /*!< DWT CTRL: CYCDISS Mask */ - -#define DWT_CTRL_CYCEVTENA_Pos 22U /*!< DWT CTRL: CYCEVTENA Position */ -#define DWT_CTRL_CYCEVTENA_Msk (0x1UL << DWT_CTRL_CYCEVTENA_Pos) /*!< DWT CTRL: CYCEVTENA Mask */ - -#define DWT_CTRL_FOLDEVTENA_Pos 21U /*!< DWT CTRL: FOLDEVTENA Position */ -#define DWT_CTRL_FOLDEVTENA_Msk (0x1UL << DWT_CTRL_FOLDEVTENA_Pos) /*!< DWT CTRL: FOLDEVTENA Mask */ - -#define DWT_CTRL_LSUEVTENA_Pos 20U /*!< DWT CTRL: LSUEVTENA Position */ -#define DWT_CTRL_LSUEVTENA_Msk (0x1UL << DWT_CTRL_LSUEVTENA_Pos) /*!< DWT CTRL: LSUEVTENA Mask */ - -#define DWT_CTRL_SLEEPEVTENA_Pos 19U /*!< DWT CTRL: SLEEPEVTENA Position */ -#define DWT_CTRL_SLEEPEVTENA_Msk (0x1UL << DWT_CTRL_SLEEPEVTENA_Pos) /*!< DWT CTRL: SLEEPEVTENA Mask */ - -#define DWT_CTRL_EXCEVTENA_Pos 18U /*!< DWT CTRL: EXCEVTENA Position */ -#define DWT_CTRL_EXCEVTENA_Msk (0x1UL << DWT_CTRL_EXCEVTENA_Pos) /*!< DWT CTRL: EXCEVTENA Mask */ - -#define DWT_CTRL_CPIEVTENA_Pos 17U /*!< DWT CTRL: CPIEVTENA Position */ -#define DWT_CTRL_CPIEVTENA_Msk (0x1UL << DWT_CTRL_CPIEVTENA_Pos) /*!< DWT CTRL: CPIEVTENA Mask */ - -#define DWT_CTRL_EXCTRCENA_Pos 16U /*!< DWT CTRL: EXCTRCENA Position */ -#define DWT_CTRL_EXCTRCENA_Msk (0x1UL << DWT_CTRL_EXCTRCENA_Pos) /*!< DWT CTRL: EXCTRCENA Mask */ - -#define DWT_CTRL_PCSAMPLENA_Pos 12U /*!< DWT CTRL: PCSAMPLENA Position */ -#define DWT_CTRL_PCSAMPLENA_Msk (0x1UL << DWT_CTRL_PCSAMPLENA_Pos) /*!< DWT CTRL: PCSAMPLENA Mask */ - -#define DWT_CTRL_SYNCTAP_Pos 10U /*!< DWT CTRL: SYNCTAP Position */ -#define DWT_CTRL_SYNCTAP_Msk (0x3UL << DWT_CTRL_SYNCTAP_Pos) /*!< DWT CTRL: SYNCTAP Mask */ - -#define DWT_CTRL_CYCTAP_Pos 9U /*!< DWT CTRL: CYCTAP Position */ -#define DWT_CTRL_CYCTAP_Msk (0x1UL << DWT_CTRL_CYCTAP_Pos) /*!< DWT CTRL: CYCTAP Mask */ - -#define DWT_CTRL_POSTINIT_Pos 5U /*!< DWT CTRL: POSTINIT Position */ -#define DWT_CTRL_POSTINIT_Msk (0xFUL << DWT_CTRL_POSTINIT_Pos) /*!< DWT CTRL: POSTINIT Mask */ - -#define DWT_CTRL_POSTPRESET_Pos 1U /*!< DWT CTRL: POSTPRESET Position */ -#define DWT_CTRL_POSTPRESET_Msk (0xFUL << DWT_CTRL_POSTPRESET_Pos) /*!< DWT CTRL: POSTPRESET Mask */ - -#define DWT_CTRL_CYCCNTENA_Pos 0U /*!< DWT CTRL: CYCCNTENA Position */ -#define DWT_CTRL_CYCCNTENA_Msk (0x1UL /*<< DWT_CTRL_CYCCNTENA_Pos*/) /*!< DWT CTRL: CYCCNTENA Mask */ - -/* DWT CPI Count Register Definitions */ -#define DWT_CPICNT_CPICNT_Pos 0U /*!< DWT CPICNT: CPICNT Position */ -#define DWT_CPICNT_CPICNT_Msk (0xFFUL /*<< DWT_CPICNT_CPICNT_Pos*/) /*!< DWT CPICNT: CPICNT Mask */ - -/* DWT Exception Overhead Count Register Definitions */ -#define DWT_EXCCNT_EXCCNT_Pos 0U /*!< DWT EXCCNT: EXCCNT Position */ -#define DWT_EXCCNT_EXCCNT_Msk (0xFFUL /*<< DWT_EXCCNT_EXCCNT_Pos*/) /*!< DWT EXCCNT: EXCCNT Mask */ - -/* DWT Sleep Count Register Definitions */ -#define DWT_SLEEPCNT_SLEEPCNT_Pos 0U /*!< DWT SLEEPCNT: SLEEPCNT Position */ -#define DWT_SLEEPCNT_SLEEPCNT_Msk (0xFFUL /*<< DWT_SLEEPCNT_SLEEPCNT_Pos*/) /*!< DWT SLEEPCNT: SLEEPCNT Mask */ - -/* DWT LSU Count Register Definitions */ -#define DWT_LSUCNT_LSUCNT_Pos 0U /*!< DWT LSUCNT: LSUCNT Position */ -#define DWT_LSUCNT_LSUCNT_Msk (0xFFUL /*<< DWT_LSUCNT_LSUCNT_Pos*/) /*!< DWT LSUCNT: LSUCNT Mask */ - -/* DWT Folded-instruction Count Register Definitions */ -#define DWT_FOLDCNT_FOLDCNT_Pos 0U /*!< DWT FOLDCNT: FOLDCNT Position */ -#define DWT_FOLDCNT_FOLDCNT_Msk (0xFFUL /*<< DWT_FOLDCNT_FOLDCNT_Pos*/) /*!< DWT FOLDCNT: FOLDCNT Mask */ - -/* DWT Comparator Function Register Definitions */ -#define DWT_FUNCTION_ID_Pos 27U /*!< DWT FUNCTION: ID Position */ -#define DWT_FUNCTION_ID_Msk (0x1FUL << DWT_FUNCTION_ID_Pos) /*!< DWT FUNCTION: ID Mask */ - -#define DWT_FUNCTION_MATCHED_Pos 24U /*!< DWT FUNCTION: MATCHED Position */ -#define DWT_FUNCTION_MATCHED_Msk (0x1UL << DWT_FUNCTION_MATCHED_Pos) /*!< DWT FUNCTION: MATCHED Mask */ - -#define DWT_FUNCTION_DATAVSIZE_Pos 10U /*!< DWT FUNCTION: DATAVSIZE Position */ -#define DWT_FUNCTION_DATAVSIZE_Msk (0x3UL << DWT_FUNCTION_DATAVSIZE_Pos) /*!< DWT FUNCTION: DATAVSIZE Mask */ - -#define DWT_FUNCTION_ACTION_Pos 4U /*!< DWT FUNCTION: ACTION Position */ -#define DWT_FUNCTION_ACTION_Msk (0x1UL << DWT_FUNCTION_ACTION_Pos) /*!< DWT FUNCTION: ACTION Mask */ - -#define DWT_FUNCTION_MATCH_Pos 0U /*!< DWT FUNCTION: MATCH Position */ -#define DWT_FUNCTION_MATCH_Msk (0xFUL /*<< DWT_FUNCTION_MATCH_Pos*/) /*!< DWT FUNCTION: MATCH Mask */ - -/*@}*/ /* end of group CMSIS_DWT */ - - -/** - \ingroup CMSIS_core_register - \defgroup PwrModCtl_Type Power Mode Control Registers - \brief Type definitions for the Power Mode Control Registers (PWRMODCTL) - @{ - */ - -/** - \brief Structure type to access the Power Mode Control Registers (PWRMODCTL). - */ -typedef struct -{ - __IOM uint32_t CPDLPSTATE; - __IOM uint32_t DPDLPSTATE; -} PwrModCtl_Type; - - -/* PWRMODCTL Core Power Domain Low Power State (CPDLPSTATE) Register Definitions */ -#define PWRMODCTL_CPDLPSTATE_CLPSTATE_Pos 0U /*!< PWRMODCTL CPDLPSTATE CLPSTATE Position */ -#define PWRMODCTL_CPDLPSTATE_CLPSTATE_Msk 3UL /*!< PWRMODCTL CPDLPSTATE CLPSTATE Mask */ - -#define PWRMODCTL_CPDLPSTATE_ELPSTATE_Pos 4U /*!< PWRMODCTL CPDLPSTATE ELPSTATE Position */ -#define PWRMODCTL_CPDLPSTATE_ELPSTATE_Msk 3UL /*!< PWRMODCTL CPDLPSTATE ELPSTATE Mask */ - -#define PWRMODCTL_CPDLPSTATE_RLPSTATE_Pos 8U /*!< PWRMODCTL CPDLPSTATE RLPSTATE Position */ -#define PWRMODCTL_CPDLPSTATE_RLPSTATE_Msk 3UL /*!< PWRMODCTL CPDLPSTATE RLPSTATE Mask */ - -/* PWRMODCTL Debug Power Domain Low Power State (DPDLPSTATE) Register Definitions */ -#define PWRMODCTL_DPDLPSTATE_DLPSTATE_Pos 0U /*!< PWRMODCTL DPDLPSTATE DLPSTATE Position */ -#define PWRMODCTL_DPDLPSTATE_DLPSTATE_Msk 3UL /*!< PWRMODCTL DPDLPSTATE DLPSTATE Mask */ - -/*@}*/ /* end of group CMSIS_PWRMODCTL */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_TPI Trace Port Interface (TPI) - \brief Type definitions for the Trace Port Interface (TPI) - @{ - */ - -/** - \brief Structure type to access the Trace Port Interface Register (TPI). - */ -typedef struct -{ - __IM uint32_t SSPSR; /*!< Offset: 0x000 (R/ ) Supported Parallel Port Sizes Register */ - __IOM uint32_t CSPSR; /*!< Offset: 0x004 (R/W) Current Parallel Port Sizes Register */ - uint32_t RESERVED0[2U]; - __IOM uint32_t ACPR; /*!< Offset: 0x010 (R/W) Asynchronous Clock Prescaler Register */ - uint32_t RESERVED1[55U]; - __IOM uint32_t SPPR; /*!< Offset: 0x0F0 (R/W) Selected Pin Protocol Register */ - uint32_t RESERVED2[131U]; - __IM uint32_t FFSR; /*!< Offset: 0x300 (R/ ) Formatter and Flush Status Register */ - __IOM uint32_t FFCR; /*!< Offset: 0x304 (R/W) Formatter and Flush Control Register */ - __IOM uint32_t PSCR; /*!< Offset: 0x308 (R/W) Periodic Synchronization Control Register */ - uint32_t RESERVED3[809U]; - __OM uint32_t LAR; /*!< Offset: 0xFB0 ( /W) Software Lock Access Register */ - __IM uint32_t LSR; /*!< Offset: 0xFB4 (R/ ) Software Lock Status Register */ - uint32_t RESERVED4[4U]; - __IM uint32_t TYPE; /*!< Offset: 0xFC8 (R/ ) Device Identifier Register */ - __IM uint32_t DEVTYPE; /*!< Offset: 0xFCC (R/ ) Device Type Register */ -} TPI_Type; - -/* TPI Asynchronous Clock Prescaler Register Definitions */ -#define TPI_ACPR_SWOSCALER_Pos 0U /*!< TPI ACPR: SWOSCALER Position */ -#define TPI_ACPR_SWOSCALER_Msk (0xFFFFUL /*<< TPI_ACPR_SWOSCALER_Pos*/) /*!< TPI ACPR: SWOSCALER Mask */ - -/* TPI Selected Pin Protocol Register Definitions */ -#define TPI_SPPR_TXMODE_Pos 0U /*!< TPI SPPR: TXMODE Position */ -#define TPI_SPPR_TXMODE_Msk (0x3UL /*<< TPI_SPPR_TXMODE_Pos*/) /*!< TPI SPPR: TXMODE Mask */ - -/* TPI Formatter and Flush Status Register Definitions */ -#define TPI_FFSR_FtNonStop_Pos 3U /*!< TPI FFSR: FtNonStop Position */ -#define TPI_FFSR_FtNonStop_Msk (0x1UL << TPI_FFSR_FtNonStop_Pos) /*!< TPI FFSR: FtNonStop Mask */ - -#define TPI_FFSR_TCPresent_Pos 2U /*!< TPI FFSR: TCPresent Position */ -#define TPI_FFSR_TCPresent_Msk (0x1UL << TPI_FFSR_TCPresent_Pos) /*!< TPI FFSR: TCPresent Mask */ - -#define TPI_FFSR_FtStopped_Pos 1U /*!< TPI FFSR: FtStopped Position */ -#define TPI_FFSR_FtStopped_Msk (0x1UL << TPI_FFSR_FtStopped_Pos) /*!< TPI FFSR: FtStopped Mask */ - -#define TPI_FFSR_FlInProg_Pos 0U /*!< TPI FFSR: FlInProg Position */ -#define TPI_FFSR_FlInProg_Msk (0x1UL /*<< TPI_FFSR_FlInProg_Pos*/) /*!< TPI FFSR: FlInProg Mask */ - -/* TPI Formatter and Flush Control Register Definitions */ -#define TPI_FFCR_TrigIn_Pos 8U /*!< TPI FFCR: TrigIn Position */ -#define TPI_FFCR_TrigIn_Msk (0x1UL << TPI_FFCR_TrigIn_Pos) /*!< TPI FFCR: TrigIn Mask */ - -#define TPI_FFCR_FOnMan_Pos 6U /*!< TPI FFCR: FOnMan Position */ -#define TPI_FFCR_FOnMan_Msk (0x1UL << TPI_FFCR_FOnMan_Pos) /*!< TPI FFCR: FOnMan Mask */ - -#define TPI_FFCR_EnFmt_Pos 0U /*!< TPI FFCR: EnFmt Position */ -#define TPI_FFCR_EnFmt_Msk (0x3UL << /*TPI_FFCR_EnFmt_Pos*/) /*!< TPI FFCR: EnFmt Mask */ - -/* TPI Periodic Synchronization Control Register Definitions */ -#define TPI_PSCR_PSCount_Pos 0U /*!< TPI PSCR: PSCount Position */ -#define TPI_PSCR_PSCount_Msk (0x1FUL /*<< TPI_PSCR_PSCount_Pos*/) /*!< TPI PSCR: TPSCount Mask */ - -/* TPI Software Lock Status Register Definitions */ -#define TPI_LSR_nTT_Pos 1U /*!< TPI LSR: Not thirty-two bit. Position */ -#define TPI_LSR_nTT_Msk (0x1UL << TPI_LSR_nTT_Pos) /*!< TPI LSR: Not thirty-two bit. Mask */ - -#define TPI_LSR_SLK_Pos 1U /*!< TPI LSR: Software Lock status Position */ -#define TPI_LSR_SLK_Msk (0x1UL << TPI_LSR_SLK_Pos) /*!< TPI LSR: Software Lock status Mask */ - -#define TPI_LSR_SLI_Pos 0U /*!< TPI LSR: Software Lock implemented Position */ -#define TPI_LSR_SLI_Msk (0x1UL /*<< TPI_LSR_SLI_Pos*/) /*!< TPI LSR: Software Lock implemented Mask */ - -/* TPI DEVID Register Definitions */ -#define TPI_DEVID_NRZVALID_Pos 11U /*!< TPI DEVID: NRZVALID Position */ -#define TPI_DEVID_NRZVALID_Msk (0x1UL << TPI_DEVID_NRZVALID_Pos) /*!< TPI DEVID: NRZVALID Mask */ - -#define TPI_DEVID_MANCVALID_Pos 10U /*!< TPI DEVID: MANCVALID Position */ -#define TPI_DEVID_MANCVALID_Msk (0x1UL << TPI_DEVID_MANCVALID_Pos) /*!< TPI DEVID: MANCVALID Mask */ - -#define TPI_DEVID_PTINVALID_Pos 9U /*!< TPI DEVID: PTINVALID Position */ -#define TPI_DEVID_PTINVALID_Msk (0x1UL << TPI_DEVID_PTINVALID_Pos) /*!< TPI DEVID: PTINVALID Mask */ - -#define TPI_DEVID_FIFOSZ_Pos 6U /*!< TPI DEVID: FIFO depth Position */ -#define TPI_DEVID_FIFOSZ_Msk (0x7UL << TPI_DEVID_FIFOSZ_Pos) /*!< TPI DEVID: FIFO depth Mask */ - -/* TPI DEVTYPE Register Definitions */ -#define TPI_DEVTYPE_SubType_Pos 4U /*!< TPI DEVTYPE: SubType Position */ -#define TPI_DEVTYPE_SubType_Msk (0xFUL /*<< TPI_DEVTYPE_SubType_Pos*/) /*!< TPI DEVTYPE: SubType Mask */ - -#define TPI_DEVTYPE_MajorType_Pos 0U /*!< TPI DEVTYPE: MajorType Position */ -#define TPI_DEVTYPE_MajorType_Msk (0xFUL << TPI_DEVTYPE_MajorType_Pos) /*!< TPI DEVTYPE: MajorType Mask */ - -/*@}*/ /* end of group CMSIS_TPI */ - -#if defined (__PMU_PRESENT) && (__PMU_PRESENT == 1U) -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_PMU Performance Monitoring Unit (PMU) - \brief Type definitions for the Performance Monitoring Unit (PMU) - @{ - */ - -/** - \brief Structure type to access the Performance Monitoring Unit (PMU). - */ -typedef struct -{ - __IOM uint32_t EVCNTR[__PMU_NUM_EVENTCNT]; /*!< Offset: 0x0 (R/W) PMU Event Counter Registers */ -#if __PMU_NUM_EVENTCNT<31 - uint32_t RESERVED0[31U-__PMU_NUM_EVENTCNT]; -#endif - __IOM uint32_t CCNTR; /*!< Offset: 0x7C (R/W) PMU Cycle Counter Register */ - uint32_t RESERVED1[224]; - __IOM uint32_t EVTYPER[__PMU_NUM_EVENTCNT]; /*!< Offset: 0x400 (R/W) PMU Event Type and Filter Registers */ -#if __PMU_NUM_EVENTCNT<31 - uint32_t RESERVED2[31U-__PMU_NUM_EVENTCNT]; -#endif - __IOM uint32_t CCFILTR; /*!< Offset: 0x47C (R/W) PMU Cycle Counter Filter Register */ - uint32_t RESERVED3[480]; - __IOM uint32_t CNTENSET; /*!< Offset: 0xC00 (R/W) PMU Count Enable Set Register */ - uint32_t RESERVED4[7]; - __IOM uint32_t CNTENCLR; /*!< Offset: 0xC20 (R/W) PMU Count Enable Clear Register */ - uint32_t RESERVED5[7]; - __IOM uint32_t INTENSET; /*!< Offset: 0xC40 (R/W) PMU Interrupt Enable Set Register */ - uint32_t RESERVED6[7]; - __IOM uint32_t INTENCLR; /*!< Offset: 0xC60 (R/W) PMU Interrupt Enable Clear Register */ - uint32_t RESERVED7[7]; - __IOM uint32_t OVSCLR; /*!< Offset: 0xC80 (R/W) PMU Overflow Flag Status Clear Register */ - uint32_t RESERVED8[7]; - __IOM uint32_t SWINC; /*!< Offset: 0xCA0 (R/W) PMU Software Increment Register */ - uint32_t RESERVED9[7]; - __IOM uint32_t OVSSET; /*!< Offset: 0xCC0 (R/W) PMU Overflow Flag Status Set Register */ - uint32_t RESERVED10[79]; - __IOM uint32_t TYPE; /*!< Offset: 0xE00 (R/W) PMU Type Register */ - __IOM uint32_t CTRL; /*!< Offset: 0xE04 (R/W) PMU Control Register */ - uint32_t RESERVED11[108]; - __IOM uint32_t AUTHSTATUS; /*!< Offset: 0xFB8 (R/W) PMU Authentication Status Register */ - __IOM uint32_t DEVARCH; /*!< Offset: 0xFBC (R/W) PMU Device Architecture Register */ - uint32_t RESERVED12[3]; - __IOM uint32_t DEVTYPE; /*!< Offset: 0xFCC (R/W) PMU Device Type Register */ - __IOM uint32_t PIDR4; /*!< Offset: 0xFD0 (R/W) PMU Peripheral Identification Register 4 */ - uint32_t RESERVED13[3]; - __IOM uint32_t PIDR0; /*!< Offset: 0xFE0 (R/W) PMU Peripheral Identification Register 0 */ - __IOM uint32_t PIDR1; /*!< Offset: 0xFE4 (R/W) PMU Peripheral Identification Register 1 */ - __IOM uint32_t PIDR2; /*!< Offset: 0xFE8 (R/W) PMU Peripheral Identification Register 2 */ - __IOM uint32_t PIDR3; /*!< Offset: 0xFEC (R/W) PMU Peripheral Identification Register 3 */ - __IOM uint32_t CIDR0; /*!< Offset: 0xFF0 (R/W) PMU Component Identification Register 0 */ - __IOM uint32_t CIDR1; /*!< Offset: 0xFF4 (R/W) PMU Component Identification Register 1 */ - __IOM uint32_t CIDR2; /*!< Offset: 0xFF8 (R/W) PMU Component Identification Register 2 */ - __IOM uint32_t CIDR3; /*!< Offset: 0xFFC (R/W) PMU Component Identification Register 3 */ -} PMU_Type; - -/** \brief PMU Event Counter Registers (0-30) Definitions */ - -#define PMU_EVCNTR_CNT_Pos 0U /*!< PMU EVCNTR: Counter Position */ -#define PMU_EVCNTR_CNT_Msk (0xFFFFUL /*<< PMU_EVCNTRx_CNT_Pos*/) /*!< PMU EVCNTR: Counter Mask */ - -/** \brief PMU Event Type and Filter Registers (0-30) Definitions */ - -#define PMU_EVTYPER_EVENTTOCNT_Pos 0U /*!< PMU EVTYPER: Event to Count Position */ -#define PMU_EVTYPER_EVENTTOCNT_Msk (0xFFFFUL /*<< EVTYPERx_EVENTTOCNT_Pos*/) /*!< PMU EVTYPER: Event to Count Mask */ - -/** \brief PMU Count Enable Set Register Definitions */ - -#define PMU_CNTENSET_CNT0_ENABLE_Pos 0U /*!< PMU CNTENSET: Event Counter 0 Enable Set Position */ -#define PMU_CNTENSET_CNT0_ENABLE_Msk (1UL /*<< PMU_CNTENSET_CNT0_ENABLE_Pos*/) /*!< PMU CNTENSET: Event Counter 0 Enable Set Mask */ - -#define PMU_CNTENSET_CNT1_ENABLE_Pos 1U /*!< PMU CNTENSET: Event Counter 1 Enable Set Position */ -#define PMU_CNTENSET_CNT1_ENABLE_Msk (1UL << PMU_CNTENSET_CNT1_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 1 Enable Set Mask */ - -#define PMU_CNTENSET_CNT2_ENABLE_Pos 2U /*!< PMU CNTENSET: Event Counter 2 Enable Set Position */ -#define PMU_CNTENSET_CNT2_ENABLE_Msk (1UL << PMU_CNTENSET_CNT2_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 2 Enable Set Mask */ - -#define PMU_CNTENSET_CNT3_ENABLE_Pos 3U /*!< PMU CNTENSET: Event Counter 3 Enable Set Position */ -#define PMU_CNTENSET_CNT3_ENABLE_Msk (1UL << PMU_CNTENSET_CNT3_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 3 Enable Set Mask */ - -#define PMU_CNTENSET_CNT4_ENABLE_Pos 4U /*!< PMU CNTENSET: Event Counter 4 Enable Set Position */ -#define PMU_CNTENSET_CNT4_ENABLE_Msk (1UL << PMU_CNTENSET_CNT4_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 4 Enable Set Mask */ - -#define PMU_CNTENSET_CNT5_ENABLE_Pos 5U /*!< PMU CNTENSET: Event Counter 5 Enable Set Position */ -#define PMU_CNTENSET_CNT5_ENABLE_Msk (1UL << PMU_CNTENSET_CNT5_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 5 Enable Set Mask */ - -#define PMU_CNTENSET_CNT6_ENABLE_Pos 6U /*!< PMU CNTENSET: Event Counter 6 Enable Set Position */ -#define PMU_CNTENSET_CNT6_ENABLE_Msk (1UL << PMU_CNTENSET_CNT6_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 6 Enable Set Mask */ - -#define PMU_CNTENSET_CNT7_ENABLE_Pos 7U /*!< PMU CNTENSET: Event Counter 7 Enable Set Position */ -#define PMU_CNTENSET_CNT7_ENABLE_Msk (1UL << PMU_CNTENSET_CNT7_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 7 Enable Set Mask */ - -#define PMU_CNTENSET_CNT8_ENABLE_Pos 8U /*!< PMU CNTENSET: Event Counter 8 Enable Set Position */ -#define PMU_CNTENSET_CNT8_ENABLE_Msk (1UL << PMU_CNTENSET_CNT8_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 8 Enable Set Mask */ - -#define PMU_CNTENSET_CNT9_ENABLE_Pos 9U /*!< PMU CNTENSET: Event Counter 9 Enable Set Position */ -#define PMU_CNTENSET_CNT9_ENABLE_Msk (1UL << PMU_CNTENSET_CNT9_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 9 Enable Set Mask */ - -#define PMU_CNTENSET_CNT10_ENABLE_Pos 10U /*!< PMU CNTENSET: Event Counter 10 Enable Set Position */ -#define PMU_CNTENSET_CNT10_ENABLE_Msk (1UL << PMU_CNTENSET_CNT10_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 10 Enable Set Mask */ - -#define PMU_CNTENSET_CNT11_ENABLE_Pos 11U /*!< PMU CNTENSET: Event Counter 11 Enable Set Position */ -#define PMU_CNTENSET_CNT11_ENABLE_Msk (1UL << PMU_CNTENSET_CNT11_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 11 Enable Set Mask */ - -#define PMU_CNTENSET_CNT12_ENABLE_Pos 12U /*!< PMU CNTENSET: Event Counter 12 Enable Set Position */ -#define PMU_CNTENSET_CNT12_ENABLE_Msk (1UL << PMU_CNTENSET_CNT12_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 12 Enable Set Mask */ - -#define PMU_CNTENSET_CNT13_ENABLE_Pos 13U /*!< PMU CNTENSET: Event Counter 13 Enable Set Position */ -#define PMU_CNTENSET_CNT13_ENABLE_Msk (1UL << PMU_CNTENSET_CNT13_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 13 Enable Set Mask */ - -#define PMU_CNTENSET_CNT14_ENABLE_Pos 14U /*!< PMU CNTENSET: Event Counter 14 Enable Set Position */ -#define PMU_CNTENSET_CNT14_ENABLE_Msk (1UL << PMU_CNTENSET_CNT14_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 14 Enable Set Mask */ - -#define PMU_CNTENSET_CNT15_ENABLE_Pos 15U /*!< PMU CNTENSET: Event Counter 15 Enable Set Position */ -#define PMU_CNTENSET_CNT15_ENABLE_Msk (1UL << PMU_CNTENSET_CNT15_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 15 Enable Set Mask */ - -#define PMU_CNTENSET_CNT16_ENABLE_Pos 16U /*!< PMU CNTENSET: Event Counter 16 Enable Set Position */ -#define PMU_CNTENSET_CNT16_ENABLE_Msk (1UL << PMU_CNTENSET_CNT16_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 16 Enable Set Mask */ - -#define PMU_CNTENSET_CNT17_ENABLE_Pos 17U /*!< PMU CNTENSET: Event Counter 17 Enable Set Position */ -#define PMU_CNTENSET_CNT17_ENABLE_Msk (1UL << PMU_CNTENSET_CNT17_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 17 Enable Set Mask */ - -#define PMU_CNTENSET_CNT18_ENABLE_Pos 18U /*!< PMU CNTENSET: Event Counter 18 Enable Set Position */ -#define PMU_CNTENSET_CNT18_ENABLE_Msk (1UL << PMU_CNTENSET_CNT18_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 18 Enable Set Mask */ - -#define PMU_CNTENSET_CNT19_ENABLE_Pos 19U /*!< PMU CNTENSET: Event Counter 19 Enable Set Position */ -#define PMU_CNTENSET_CNT19_ENABLE_Msk (1UL << PMU_CNTENSET_CNT19_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 19 Enable Set Mask */ - -#define PMU_CNTENSET_CNT20_ENABLE_Pos 20U /*!< PMU CNTENSET: Event Counter 20 Enable Set Position */ -#define PMU_CNTENSET_CNT20_ENABLE_Msk (1UL << PMU_CNTENSET_CNT20_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 20 Enable Set Mask */ - -#define PMU_CNTENSET_CNT21_ENABLE_Pos 21U /*!< PMU CNTENSET: Event Counter 21 Enable Set Position */ -#define PMU_CNTENSET_CNT21_ENABLE_Msk (1UL << PMU_CNTENSET_CNT21_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 21 Enable Set Mask */ - -#define PMU_CNTENSET_CNT22_ENABLE_Pos 22U /*!< PMU CNTENSET: Event Counter 22 Enable Set Position */ -#define PMU_CNTENSET_CNT22_ENABLE_Msk (1UL << PMU_CNTENSET_CNT22_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 22 Enable Set Mask */ - -#define PMU_CNTENSET_CNT23_ENABLE_Pos 23U /*!< PMU CNTENSET: Event Counter 23 Enable Set Position */ -#define PMU_CNTENSET_CNT23_ENABLE_Msk (1UL << PMU_CNTENSET_CNT23_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 23 Enable Set Mask */ - -#define PMU_CNTENSET_CNT24_ENABLE_Pos 24U /*!< PMU CNTENSET: Event Counter 24 Enable Set Position */ -#define PMU_CNTENSET_CNT24_ENABLE_Msk (1UL << PMU_CNTENSET_CNT24_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 24 Enable Set Mask */ - -#define PMU_CNTENSET_CNT25_ENABLE_Pos 25U /*!< PMU CNTENSET: Event Counter 25 Enable Set Position */ -#define PMU_CNTENSET_CNT25_ENABLE_Msk (1UL << PMU_CNTENSET_CNT25_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 25 Enable Set Mask */ - -#define PMU_CNTENSET_CNT26_ENABLE_Pos 26U /*!< PMU CNTENSET: Event Counter 26 Enable Set Position */ -#define PMU_CNTENSET_CNT26_ENABLE_Msk (1UL << PMU_CNTENSET_CNT26_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 26 Enable Set Mask */ - -#define PMU_CNTENSET_CNT27_ENABLE_Pos 27U /*!< PMU CNTENSET: Event Counter 27 Enable Set Position */ -#define PMU_CNTENSET_CNT27_ENABLE_Msk (1UL << PMU_CNTENSET_CNT27_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 27 Enable Set Mask */ - -#define PMU_CNTENSET_CNT28_ENABLE_Pos 28U /*!< PMU CNTENSET: Event Counter 28 Enable Set Position */ -#define PMU_CNTENSET_CNT28_ENABLE_Msk (1UL << PMU_CNTENSET_CNT28_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 28 Enable Set Mask */ - -#define PMU_CNTENSET_CNT29_ENABLE_Pos 29U /*!< PMU CNTENSET: Event Counter 29 Enable Set Position */ -#define PMU_CNTENSET_CNT29_ENABLE_Msk (1UL << PMU_CNTENSET_CNT29_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 29 Enable Set Mask */ - -#define PMU_CNTENSET_CNT30_ENABLE_Pos 30U /*!< PMU CNTENSET: Event Counter 30 Enable Set Position */ -#define PMU_CNTENSET_CNT30_ENABLE_Msk (1UL << PMU_CNTENSET_CNT30_ENABLE_Pos) /*!< PMU CNTENSET: Event Counter 30 Enable Set Mask */ - -#define PMU_CNTENSET_CCNTR_ENABLE_Pos 31U /*!< PMU CNTENSET: Cycle Counter Enable Set Position */ -#define PMU_CNTENSET_CCNTR_ENABLE_Msk (1UL << PMU_CNTENSET_CCNTR_ENABLE_Pos) /*!< PMU CNTENSET: Cycle Counter Enable Set Mask */ - -/** \brief PMU Count Enable Clear Register Definitions */ - -#define PMU_CNTENSET_CNT0_ENABLE_Pos 0U /*!< PMU CNTENCLR: Event Counter 0 Enable Clear Position */ -#define PMU_CNTENCLR_CNT0_ENABLE_Msk (1UL /*<< PMU_CNTENCLR_CNT0_ENABLE_Pos*/) /*!< PMU CNTENCLR: Event Counter 0 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT1_ENABLE_Pos 1U /*!< PMU CNTENCLR: Event Counter 1 Enable Clear Position */ -#define PMU_CNTENCLR_CNT1_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT1_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 1 Enable Clear */ - -#define PMU_CNTENCLR_CNT2_ENABLE_Pos 2U /*!< PMU CNTENCLR: Event Counter 2 Enable Clear Position */ -#define PMU_CNTENCLR_CNT2_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT2_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 2 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT3_ENABLE_Pos 3U /*!< PMU CNTENCLR: Event Counter 3 Enable Clear Position */ -#define PMU_CNTENCLR_CNT3_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT3_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 3 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT4_ENABLE_Pos 4U /*!< PMU CNTENCLR: Event Counter 4 Enable Clear Position */ -#define PMU_CNTENCLR_CNT4_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT4_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 4 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT5_ENABLE_Pos 5U /*!< PMU CNTENCLR: Event Counter 5 Enable Clear Position */ -#define PMU_CNTENCLR_CNT5_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT5_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 5 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT6_ENABLE_Pos 6U /*!< PMU CNTENCLR: Event Counter 6 Enable Clear Position */ -#define PMU_CNTENCLR_CNT6_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT6_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 6 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT7_ENABLE_Pos 7U /*!< PMU CNTENCLR: Event Counter 7 Enable Clear Position */ -#define PMU_CNTENCLR_CNT7_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT7_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 7 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT8_ENABLE_Pos 8U /*!< PMU CNTENCLR: Event Counter 8 Enable Clear Position */ -#define PMU_CNTENCLR_CNT8_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT8_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 8 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT9_ENABLE_Pos 9U /*!< PMU CNTENCLR: Event Counter 9 Enable Clear Position */ -#define PMU_CNTENCLR_CNT9_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT9_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 9 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT10_ENABLE_Pos 10U /*!< PMU CNTENCLR: Event Counter 10 Enable Clear Position */ -#define PMU_CNTENCLR_CNT10_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT10_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 10 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT11_ENABLE_Pos 11U /*!< PMU CNTENCLR: Event Counter 11 Enable Clear Position */ -#define PMU_CNTENCLR_CNT11_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT11_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 11 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT12_ENABLE_Pos 12U /*!< PMU CNTENCLR: Event Counter 12 Enable Clear Position */ -#define PMU_CNTENCLR_CNT12_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT12_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 12 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT13_ENABLE_Pos 13U /*!< PMU CNTENCLR: Event Counter 13 Enable Clear Position */ -#define PMU_CNTENCLR_CNT13_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT13_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 13 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT14_ENABLE_Pos 14U /*!< PMU CNTENCLR: Event Counter 14 Enable Clear Position */ -#define PMU_CNTENCLR_CNT14_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT14_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 14 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT15_ENABLE_Pos 15U /*!< PMU CNTENCLR: Event Counter 15 Enable Clear Position */ -#define PMU_CNTENCLR_CNT15_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT15_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 15 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT16_ENABLE_Pos 16U /*!< PMU CNTENCLR: Event Counter 16 Enable Clear Position */ -#define PMU_CNTENCLR_CNT16_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT16_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 16 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT17_ENABLE_Pos 17U /*!< PMU CNTENCLR: Event Counter 17 Enable Clear Position */ -#define PMU_CNTENCLR_CNT17_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT17_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 17 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT18_ENABLE_Pos 18U /*!< PMU CNTENCLR: Event Counter 18 Enable Clear Position */ -#define PMU_CNTENCLR_CNT18_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT18_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 18 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT19_ENABLE_Pos 19U /*!< PMU CNTENCLR: Event Counter 19 Enable Clear Position */ -#define PMU_CNTENCLR_CNT19_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT19_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 19 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT20_ENABLE_Pos 20U /*!< PMU CNTENCLR: Event Counter 20 Enable Clear Position */ -#define PMU_CNTENCLR_CNT20_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT20_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 20 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT21_ENABLE_Pos 21U /*!< PMU CNTENCLR: Event Counter 21 Enable Clear Position */ -#define PMU_CNTENCLR_CNT21_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT21_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 21 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT22_ENABLE_Pos 22U /*!< PMU CNTENCLR: Event Counter 22 Enable Clear Position */ -#define PMU_CNTENCLR_CNT22_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT22_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 22 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT23_ENABLE_Pos 23U /*!< PMU CNTENCLR: Event Counter 23 Enable Clear Position */ -#define PMU_CNTENCLR_CNT23_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT23_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 23 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT24_ENABLE_Pos 24U /*!< PMU CNTENCLR: Event Counter 24 Enable Clear Position */ -#define PMU_CNTENCLR_CNT24_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT24_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 24 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT25_ENABLE_Pos 25U /*!< PMU CNTENCLR: Event Counter 25 Enable Clear Position */ -#define PMU_CNTENCLR_CNT25_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT25_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 25 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT26_ENABLE_Pos 26U /*!< PMU CNTENCLR: Event Counter 26 Enable Clear Position */ -#define PMU_CNTENCLR_CNT26_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT26_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 26 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT27_ENABLE_Pos 27U /*!< PMU CNTENCLR: Event Counter 27 Enable Clear Position */ -#define PMU_CNTENCLR_CNT27_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT27_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 27 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT28_ENABLE_Pos 28U /*!< PMU CNTENCLR: Event Counter 28 Enable Clear Position */ -#define PMU_CNTENCLR_CNT28_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT28_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 28 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT29_ENABLE_Pos 29U /*!< PMU CNTENCLR: Event Counter 29 Enable Clear Position */ -#define PMU_CNTENCLR_CNT29_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT29_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 29 Enable Clear Mask */ - -#define PMU_CNTENCLR_CNT30_ENABLE_Pos 30U /*!< PMU CNTENCLR: Event Counter 30 Enable Clear Position */ -#define PMU_CNTENCLR_CNT30_ENABLE_Msk (1UL << PMU_CNTENCLR_CNT30_ENABLE_Pos) /*!< PMU CNTENCLR: Event Counter 30 Enable Clear Mask */ - -#define PMU_CNTENCLR_CCNTR_ENABLE_Pos 31U /*!< PMU CNTENCLR: Cycle Counter Enable Clear Position */ -#define PMU_CNTENCLR_CCNTR_ENABLE_Msk (1UL << PMU_CNTENCLR_CCNTR_ENABLE_Pos) /*!< PMU CNTENCLR: Cycle Counter Enable Clear Mask */ - -/** \brief PMU Interrupt Enable Set Register Definitions */ - -#define PMU_INTENSET_CNT0_ENABLE_Pos 0U /*!< PMU INTENSET: Event Counter 0 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT0_ENABLE_Msk (1UL /*<< PMU_INTENSET_CNT0_ENABLE_Pos*/) /*!< PMU INTENSET: Event Counter 0 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT1_ENABLE_Pos 1U /*!< PMU INTENSET: Event Counter 1 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT1_ENABLE_Msk (1UL << PMU_INTENSET_CNT1_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 1 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT2_ENABLE_Pos 2U /*!< PMU INTENSET: Event Counter 2 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT2_ENABLE_Msk (1UL << PMU_INTENSET_CNT2_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 2 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT3_ENABLE_Pos 3U /*!< PMU INTENSET: Event Counter 3 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT3_ENABLE_Msk (1UL << PMU_INTENSET_CNT3_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 3 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT4_ENABLE_Pos 4U /*!< PMU INTENSET: Event Counter 4 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT4_ENABLE_Msk (1UL << PMU_INTENSET_CNT4_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 4 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT5_ENABLE_Pos 5U /*!< PMU INTENSET: Event Counter 5 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT5_ENABLE_Msk (1UL << PMU_INTENSET_CNT5_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 5 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT6_ENABLE_Pos 6U /*!< PMU INTENSET: Event Counter 6 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT6_ENABLE_Msk (1UL << PMU_INTENSET_CNT6_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 6 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT7_ENABLE_Pos 7U /*!< PMU INTENSET: Event Counter 7 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT7_ENABLE_Msk (1UL << PMU_INTENSET_CNT7_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 7 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT8_ENABLE_Pos 8U /*!< PMU INTENSET: Event Counter 8 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT8_ENABLE_Msk (1UL << PMU_INTENSET_CNT8_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 8 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT9_ENABLE_Pos 9U /*!< PMU INTENSET: Event Counter 9 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT9_ENABLE_Msk (1UL << PMU_INTENSET_CNT9_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 9 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT10_ENABLE_Pos 10U /*!< PMU INTENSET: Event Counter 10 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT10_ENABLE_Msk (1UL << PMU_INTENSET_CNT10_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 10 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT11_ENABLE_Pos 11U /*!< PMU INTENSET: Event Counter 11 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT11_ENABLE_Msk (1UL << PMU_INTENSET_CNT11_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 11 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT12_ENABLE_Pos 12U /*!< PMU INTENSET: Event Counter 12 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT12_ENABLE_Msk (1UL << PMU_INTENSET_CNT12_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 12 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT13_ENABLE_Pos 13U /*!< PMU INTENSET: Event Counter 13 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT13_ENABLE_Msk (1UL << PMU_INTENSET_CNT13_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 13 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT14_ENABLE_Pos 14U /*!< PMU INTENSET: Event Counter 14 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT14_ENABLE_Msk (1UL << PMU_INTENSET_CNT14_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 14 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT15_ENABLE_Pos 15U /*!< PMU INTENSET: Event Counter 15 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT15_ENABLE_Msk (1UL << PMU_INTENSET_CNT15_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 15 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT16_ENABLE_Pos 16U /*!< PMU INTENSET: Event Counter 16 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT16_ENABLE_Msk (1UL << PMU_INTENSET_CNT16_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 16 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT17_ENABLE_Pos 17U /*!< PMU INTENSET: Event Counter 17 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT17_ENABLE_Msk (1UL << PMU_INTENSET_CNT17_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 17 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT18_ENABLE_Pos 18U /*!< PMU INTENSET: Event Counter 18 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT18_ENABLE_Msk (1UL << PMU_INTENSET_CNT18_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 18 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT19_ENABLE_Pos 19U /*!< PMU INTENSET: Event Counter 19 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT19_ENABLE_Msk (1UL << PMU_INTENSET_CNT19_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 19 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT20_ENABLE_Pos 20U /*!< PMU INTENSET: Event Counter 20 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT20_ENABLE_Msk (1UL << PMU_INTENSET_CNT20_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 20 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT21_ENABLE_Pos 21U /*!< PMU INTENSET: Event Counter 21 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT21_ENABLE_Msk (1UL << PMU_INTENSET_CNT21_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 21 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT22_ENABLE_Pos 22U /*!< PMU INTENSET: Event Counter 22 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT22_ENABLE_Msk (1UL << PMU_INTENSET_CNT22_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 22 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT23_ENABLE_Pos 23U /*!< PMU INTENSET: Event Counter 23 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT23_ENABLE_Msk (1UL << PMU_INTENSET_CNT23_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 23 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT24_ENABLE_Pos 24U /*!< PMU INTENSET: Event Counter 24 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT24_ENABLE_Msk (1UL << PMU_INTENSET_CNT24_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 24 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT25_ENABLE_Pos 25U /*!< PMU INTENSET: Event Counter 25 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT25_ENABLE_Msk (1UL << PMU_INTENSET_CNT25_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 25 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT26_ENABLE_Pos 26U /*!< PMU INTENSET: Event Counter 26 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT26_ENABLE_Msk (1UL << PMU_INTENSET_CNT26_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 26 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT27_ENABLE_Pos 27U /*!< PMU INTENSET: Event Counter 27 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT27_ENABLE_Msk (1UL << PMU_INTENSET_CNT27_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 27 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT28_ENABLE_Pos 28U /*!< PMU INTENSET: Event Counter 28 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT28_ENABLE_Msk (1UL << PMU_INTENSET_CNT28_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 28 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT29_ENABLE_Pos 29U /*!< PMU INTENSET: Event Counter 29 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT29_ENABLE_Msk (1UL << PMU_INTENSET_CNT29_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 29 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CNT30_ENABLE_Pos 30U /*!< PMU INTENSET: Event Counter 30 Interrupt Enable Set Position */ -#define PMU_INTENSET_CNT30_ENABLE_Msk (1UL << PMU_INTENSET_CNT30_ENABLE_Pos) /*!< PMU INTENSET: Event Counter 30 Interrupt Enable Set Mask */ - -#define PMU_INTENSET_CYCCNT_ENABLE_Pos 31U /*!< PMU INTENSET: Cycle Counter Interrupt Enable Set Position */ -#define PMU_INTENSET_CCYCNT_ENABLE_Msk (1UL << PMU_INTENSET_CYCCNT_ENABLE_Pos) /*!< PMU INTENSET: Cycle Counter Interrupt Enable Set Mask */ - -/** \brief PMU Interrupt Enable Clear Register Definitions */ - -#define PMU_INTENSET_CNT0_ENABLE_Pos 0U /*!< PMU INTENCLR: Event Counter 0 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT0_ENABLE_Msk (1UL /*<< PMU_INTENCLR_CNT0_ENABLE_Pos*/) /*!< PMU INTENCLR: Event Counter 0 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT1_ENABLE_Pos 1U /*!< PMU INTENCLR: Event Counter 1 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT1_ENABLE_Msk (1UL << PMU_INTENCLR_CNT1_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 1 Interrupt Enable Clear */ - -#define PMU_INTENCLR_CNT2_ENABLE_Pos 2U /*!< PMU INTENCLR: Event Counter 2 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT2_ENABLE_Msk (1UL << PMU_INTENCLR_CNT2_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 2 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT3_ENABLE_Pos 3U /*!< PMU INTENCLR: Event Counter 3 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT3_ENABLE_Msk (1UL << PMU_INTENCLR_CNT3_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 3 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT4_ENABLE_Pos 4U /*!< PMU INTENCLR: Event Counter 4 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT4_ENABLE_Msk (1UL << PMU_INTENCLR_CNT4_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 4 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT5_ENABLE_Pos 5U /*!< PMU INTENCLR: Event Counter 5 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT5_ENABLE_Msk (1UL << PMU_INTENCLR_CNT5_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 5 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT6_ENABLE_Pos 6U /*!< PMU INTENCLR: Event Counter 6 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT6_ENABLE_Msk (1UL << PMU_INTENCLR_CNT6_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 6 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT7_ENABLE_Pos 7U /*!< PMU INTENCLR: Event Counter 7 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT7_ENABLE_Msk (1UL << PMU_INTENCLR_CNT7_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 7 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT8_ENABLE_Pos 8U /*!< PMU INTENCLR: Event Counter 8 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT8_ENABLE_Msk (1UL << PMU_INTENCLR_CNT8_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 8 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT9_ENABLE_Pos 9U /*!< PMU INTENCLR: Event Counter 9 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT9_ENABLE_Msk (1UL << PMU_INTENCLR_CNT9_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 9 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT10_ENABLE_Pos 10U /*!< PMU INTENCLR: Event Counter 10 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT10_ENABLE_Msk (1UL << PMU_INTENCLR_CNT10_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 10 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT11_ENABLE_Pos 11U /*!< PMU INTENCLR: Event Counter 11 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT11_ENABLE_Msk (1UL << PMU_INTENCLR_CNT11_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 11 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT12_ENABLE_Pos 12U /*!< PMU INTENCLR: Event Counter 12 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT12_ENABLE_Msk (1UL << PMU_INTENCLR_CNT12_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 12 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT13_ENABLE_Pos 13U /*!< PMU INTENCLR: Event Counter 13 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT13_ENABLE_Msk (1UL << PMU_INTENCLR_CNT13_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 13 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT14_ENABLE_Pos 14U /*!< PMU INTENCLR: Event Counter 14 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT14_ENABLE_Msk (1UL << PMU_INTENCLR_CNT14_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 14 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT15_ENABLE_Pos 15U /*!< PMU INTENCLR: Event Counter 15 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT15_ENABLE_Msk (1UL << PMU_INTENCLR_CNT15_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 15 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT16_ENABLE_Pos 16U /*!< PMU INTENCLR: Event Counter 16 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT16_ENABLE_Msk (1UL << PMU_INTENCLR_CNT16_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 16 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT17_ENABLE_Pos 17U /*!< PMU INTENCLR: Event Counter 17 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT17_ENABLE_Msk (1UL << PMU_INTENCLR_CNT17_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 17 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT18_ENABLE_Pos 18U /*!< PMU INTENCLR: Event Counter 18 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT18_ENABLE_Msk (1UL << PMU_INTENCLR_CNT18_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 18 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT19_ENABLE_Pos 19U /*!< PMU INTENCLR: Event Counter 19 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT19_ENABLE_Msk (1UL << PMU_INTENCLR_CNT19_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 19 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT20_ENABLE_Pos 20U /*!< PMU INTENCLR: Event Counter 20 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT20_ENABLE_Msk (1UL << PMU_INTENCLR_CNT20_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 20 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT21_ENABLE_Pos 21U /*!< PMU INTENCLR: Event Counter 21 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT21_ENABLE_Msk (1UL << PMU_INTENCLR_CNT21_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 21 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT22_ENABLE_Pos 22U /*!< PMU INTENCLR: Event Counter 22 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT22_ENABLE_Msk (1UL << PMU_INTENCLR_CNT22_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 22 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT23_ENABLE_Pos 23U /*!< PMU INTENCLR: Event Counter 23 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT23_ENABLE_Msk (1UL << PMU_INTENCLR_CNT23_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 23 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT24_ENABLE_Pos 24U /*!< PMU INTENCLR: Event Counter 24 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT24_ENABLE_Msk (1UL << PMU_INTENCLR_CNT24_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 24 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT25_ENABLE_Pos 25U /*!< PMU INTENCLR: Event Counter 25 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT25_ENABLE_Msk (1UL << PMU_INTENCLR_CNT25_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 25 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT26_ENABLE_Pos 26U /*!< PMU INTENCLR: Event Counter 26 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT26_ENABLE_Msk (1UL << PMU_INTENCLR_CNT26_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 26 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT27_ENABLE_Pos 27U /*!< PMU INTENCLR: Event Counter 27 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT27_ENABLE_Msk (1UL << PMU_INTENCLR_CNT27_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 27 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT28_ENABLE_Pos 28U /*!< PMU INTENCLR: Event Counter 28 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT28_ENABLE_Msk (1UL << PMU_INTENCLR_CNT28_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 28 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT29_ENABLE_Pos 29U /*!< PMU INTENCLR: Event Counter 29 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT29_ENABLE_Msk (1UL << PMU_INTENCLR_CNT29_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 29 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CNT30_ENABLE_Pos 30U /*!< PMU INTENCLR: Event Counter 30 Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CNT30_ENABLE_Msk (1UL << PMU_INTENCLR_CNT30_ENABLE_Pos) /*!< PMU INTENCLR: Event Counter 30 Interrupt Enable Clear Mask */ - -#define PMU_INTENCLR_CYCCNT_ENABLE_Pos 31U /*!< PMU INTENCLR: Cycle Counter Interrupt Enable Clear Position */ -#define PMU_INTENCLR_CYCCNT_ENABLE_Msk (1UL << PMU_INTENCLR_CYCCNT_ENABLE_Pos) /*!< PMU INTENCLR: Cycle Counter Interrupt Enable Clear Mask */ - -/** \brief PMU Overflow Flag Status Set Register Definitions */ - -#define PMU_OVSSET_CNT0_STATUS_Pos 0U /*!< PMU OVSSET: Event Counter 0 Overflow Set Position */ -#define PMU_OVSSET_CNT0_STATUS_Msk (1UL /*<< PMU_OVSSET_CNT0_STATUS_Pos*/) /*!< PMU OVSSET: Event Counter 0 Overflow Set Mask */ - -#define PMU_OVSSET_CNT1_STATUS_Pos 1U /*!< PMU OVSSET: Event Counter 1 Overflow Set Position */ -#define PMU_OVSSET_CNT1_STATUS_Msk (1UL << PMU_OVSSET_CNT1_STATUS_Pos) /*!< PMU OVSSET: Event Counter 1 Overflow Set Mask */ - -#define PMU_OVSSET_CNT2_STATUS_Pos 2U /*!< PMU OVSSET: Event Counter 2 Overflow Set Position */ -#define PMU_OVSSET_CNT2_STATUS_Msk (1UL << PMU_OVSSET_CNT2_STATUS_Pos) /*!< PMU OVSSET: Event Counter 2 Overflow Set Mask */ - -#define PMU_OVSSET_CNT3_STATUS_Pos 3U /*!< PMU OVSSET: Event Counter 3 Overflow Set Position */ -#define PMU_OVSSET_CNT3_STATUS_Msk (1UL << PMU_OVSSET_CNT3_STATUS_Pos) /*!< PMU OVSSET: Event Counter 3 Overflow Set Mask */ - -#define PMU_OVSSET_CNT4_STATUS_Pos 4U /*!< PMU OVSSET: Event Counter 4 Overflow Set Position */ -#define PMU_OVSSET_CNT4_STATUS_Msk (1UL << PMU_OVSSET_CNT4_STATUS_Pos) /*!< PMU OVSSET: Event Counter 4 Overflow Set Mask */ - -#define PMU_OVSSET_CNT5_STATUS_Pos 5U /*!< PMU OVSSET: Event Counter 5 Overflow Set Position */ -#define PMU_OVSSET_CNT5_STATUS_Msk (1UL << PMU_OVSSET_CNT5_STATUS_Pos) /*!< PMU OVSSET: Event Counter 5 Overflow Set Mask */ - -#define PMU_OVSSET_CNT6_STATUS_Pos 6U /*!< PMU OVSSET: Event Counter 6 Overflow Set Position */ -#define PMU_OVSSET_CNT6_STATUS_Msk (1UL << PMU_OVSSET_CNT6_STATUS_Pos) /*!< PMU OVSSET: Event Counter 6 Overflow Set Mask */ - -#define PMU_OVSSET_CNT7_STATUS_Pos 7U /*!< PMU OVSSET: Event Counter 7 Overflow Set Position */ -#define PMU_OVSSET_CNT7_STATUS_Msk (1UL << PMU_OVSSET_CNT7_STATUS_Pos) /*!< PMU OVSSET: Event Counter 7 Overflow Set Mask */ - -#define PMU_OVSSET_CNT8_STATUS_Pos 8U /*!< PMU OVSSET: Event Counter 8 Overflow Set Position */ -#define PMU_OVSSET_CNT8_STATUS_Msk (1UL << PMU_OVSSET_CNT8_STATUS_Pos) /*!< PMU OVSSET: Event Counter 8 Overflow Set Mask */ - -#define PMU_OVSSET_CNT9_STATUS_Pos 9U /*!< PMU OVSSET: Event Counter 9 Overflow Set Position */ -#define PMU_OVSSET_CNT9_STATUS_Msk (1UL << PMU_OVSSET_CNT9_STATUS_Pos) /*!< PMU OVSSET: Event Counter 9 Overflow Set Mask */ - -#define PMU_OVSSET_CNT10_STATUS_Pos 10U /*!< PMU OVSSET: Event Counter 10 Overflow Set Position */ -#define PMU_OVSSET_CNT10_STATUS_Msk (1UL << PMU_OVSSET_CNT10_STATUS_Pos) /*!< PMU OVSSET: Event Counter 10 Overflow Set Mask */ - -#define PMU_OVSSET_CNT11_STATUS_Pos 11U /*!< PMU OVSSET: Event Counter 11 Overflow Set Position */ -#define PMU_OVSSET_CNT11_STATUS_Msk (1UL << PMU_OVSSET_CNT11_STATUS_Pos) /*!< PMU OVSSET: Event Counter 11 Overflow Set Mask */ - -#define PMU_OVSSET_CNT12_STATUS_Pos 12U /*!< PMU OVSSET: Event Counter 12 Overflow Set Position */ -#define PMU_OVSSET_CNT12_STATUS_Msk (1UL << PMU_OVSSET_CNT12_STATUS_Pos) /*!< PMU OVSSET: Event Counter 12 Overflow Set Mask */ - -#define PMU_OVSSET_CNT13_STATUS_Pos 13U /*!< PMU OVSSET: Event Counter 13 Overflow Set Position */ -#define PMU_OVSSET_CNT13_STATUS_Msk (1UL << PMU_OVSSET_CNT13_STATUS_Pos) /*!< PMU OVSSET: Event Counter 13 Overflow Set Mask */ - -#define PMU_OVSSET_CNT14_STATUS_Pos 14U /*!< PMU OVSSET: Event Counter 14 Overflow Set Position */ -#define PMU_OVSSET_CNT14_STATUS_Msk (1UL << PMU_OVSSET_CNT14_STATUS_Pos) /*!< PMU OVSSET: Event Counter 14 Overflow Set Mask */ - -#define PMU_OVSSET_CNT15_STATUS_Pos 15U /*!< PMU OVSSET: Event Counter 15 Overflow Set Position */ -#define PMU_OVSSET_CNT15_STATUS_Msk (1UL << PMU_OVSSET_CNT15_STATUS_Pos) /*!< PMU OVSSET: Event Counter 15 Overflow Set Mask */ - -#define PMU_OVSSET_CNT16_STATUS_Pos 16U /*!< PMU OVSSET: Event Counter 16 Overflow Set Position */ -#define PMU_OVSSET_CNT16_STATUS_Msk (1UL << PMU_OVSSET_CNT16_STATUS_Pos) /*!< PMU OVSSET: Event Counter 16 Overflow Set Mask */ - -#define PMU_OVSSET_CNT17_STATUS_Pos 17U /*!< PMU OVSSET: Event Counter 17 Overflow Set Position */ -#define PMU_OVSSET_CNT17_STATUS_Msk (1UL << PMU_OVSSET_CNT17_STATUS_Pos) /*!< PMU OVSSET: Event Counter 17 Overflow Set Mask */ - -#define PMU_OVSSET_CNT18_STATUS_Pos 18U /*!< PMU OVSSET: Event Counter 18 Overflow Set Position */ -#define PMU_OVSSET_CNT18_STATUS_Msk (1UL << PMU_OVSSET_CNT18_STATUS_Pos) /*!< PMU OVSSET: Event Counter 18 Overflow Set Mask */ - -#define PMU_OVSSET_CNT19_STATUS_Pos 19U /*!< PMU OVSSET: Event Counter 19 Overflow Set Position */ -#define PMU_OVSSET_CNT19_STATUS_Msk (1UL << PMU_OVSSET_CNT19_STATUS_Pos) /*!< PMU OVSSET: Event Counter 19 Overflow Set Mask */ - -#define PMU_OVSSET_CNT20_STATUS_Pos 20U /*!< PMU OVSSET: Event Counter 20 Overflow Set Position */ -#define PMU_OVSSET_CNT20_STATUS_Msk (1UL << PMU_OVSSET_CNT20_STATUS_Pos) /*!< PMU OVSSET: Event Counter 20 Overflow Set Mask */ - -#define PMU_OVSSET_CNT21_STATUS_Pos 21U /*!< PMU OVSSET: Event Counter 21 Overflow Set Position */ -#define PMU_OVSSET_CNT21_STATUS_Msk (1UL << PMU_OVSSET_CNT21_STATUS_Pos) /*!< PMU OVSSET: Event Counter 21 Overflow Set Mask */ - -#define PMU_OVSSET_CNT22_STATUS_Pos 22U /*!< PMU OVSSET: Event Counter 22 Overflow Set Position */ -#define PMU_OVSSET_CNT22_STATUS_Msk (1UL << PMU_OVSSET_CNT22_STATUS_Pos) /*!< PMU OVSSET: Event Counter 22 Overflow Set Mask */ - -#define PMU_OVSSET_CNT23_STATUS_Pos 23U /*!< PMU OVSSET: Event Counter 23 Overflow Set Position */ -#define PMU_OVSSET_CNT23_STATUS_Msk (1UL << PMU_OVSSET_CNT23_STATUS_Pos) /*!< PMU OVSSET: Event Counter 23 Overflow Set Mask */ - -#define PMU_OVSSET_CNT24_STATUS_Pos 24U /*!< PMU OVSSET: Event Counter 24 Overflow Set Position */ -#define PMU_OVSSET_CNT24_STATUS_Msk (1UL << PMU_OVSSET_CNT24_STATUS_Pos) /*!< PMU OVSSET: Event Counter 24 Overflow Set Mask */ - -#define PMU_OVSSET_CNT25_STATUS_Pos 25U /*!< PMU OVSSET: Event Counter 25 Overflow Set Position */ -#define PMU_OVSSET_CNT25_STATUS_Msk (1UL << PMU_OVSSET_CNT25_STATUS_Pos) /*!< PMU OVSSET: Event Counter 25 Overflow Set Mask */ - -#define PMU_OVSSET_CNT26_STATUS_Pos 26U /*!< PMU OVSSET: Event Counter 26 Overflow Set Position */ -#define PMU_OVSSET_CNT26_STATUS_Msk (1UL << PMU_OVSSET_CNT26_STATUS_Pos) /*!< PMU OVSSET: Event Counter 26 Overflow Set Mask */ - -#define PMU_OVSSET_CNT27_STATUS_Pos 27U /*!< PMU OVSSET: Event Counter 27 Overflow Set Position */ -#define PMU_OVSSET_CNT27_STATUS_Msk (1UL << PMU_OVSSET_CNT27_STATUS_Pos) /*!< PMU OVSSET: Event Counter 27 Overflow Set Mask */ - -#define PMU_OVSSET_CNT28_STATUS_Pos 28U /*!< PMU OVSSET: Event Counter 28 Overflow Set Position */ -#define PMU_OVSSET_CNT28_STATUS_Msk (1UL << PMU_OVSSET_CNT28_STATUS_Pos) /*!< PMU OVSSET: Event Counter 28 Overflow Set Mask */ - -#define PMU_OVSSET_CNT29_STATUS_Pos 29U /*!< PMU OVSSET: Event Counter 29 Overflow Set Position */ -#define PMU_OVSSET_CNT29_STATUS_Msk (1UL << PMU_OVSSET_CNT29_STATUS_Pos) /*!< PMU OVSSET: Event Counter 29 Overflow Set Mask */ - -#define PMU_OVSSET_CNT30_STATUS_Pos 30U /*!< PMU OVSSET: Event Counter 30 Overflow Set Position */ -#define PMU_OVSSET_CNT30_STATUS_Msk (1UL << PMU_OVSSET_CNT30_STATUS_Pos) /*!< PMU OVSSET: Event Counter 30 Overflow Set Mask */ - -#define PMU_OVSSET_CYCCNT_STATUS_Pos 31U /*!< PMU OVSSET: Cycle Counter Overflow Set Position */ -#define PMU_OVSSET_CYCCNT_STATUS_Msk (1UL << PMU_OVSSET_CYCCNT_STATUS_Pos) /*!< PMU OVSSET: Cycle Counter Overflow Set Mask */ - -/** \brief PMU Overflow Flag Status Clear Register Definitions */ - -#define PMU_OVSCLR_CNT0_STATUS_Pos 0U /*!< PMU OVSCLR: Event Counter 0 Overflow Clear Position */ -#define PMU_OVSCLR_CNT0_STATUS_Msk (1UL /*<< PMU_OVSCLR_CNT0_STATUS_Pos*/) /*!< PMU OVSCLR: Event Counter 0 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT1_STATUS_Pos 1U /*!< PMU OVSCLR: Event Counter 1 Overflow Clear Position */ -#define PMU_OVSCLR_CNT1_STATUS_Msk (1UL << PMU_OVSCLR_CNT1_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 1 Overflow Clear */ - -#define PMU_OVSCLR_CNT2_STATUS_Pos 2U /*!< PMU OVSCLR: Event Counter 2 Overflow Clear Position */ -#define PMU_OVSCLR_CNT2_STATUS_Msk (1UL << PMU_OVSCLR_CNT2_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 2 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT3_STATUS_Pos 3U /*!< PMU OVSCLR: Event Counter 3 Overflow Clear Position */ -#define PMU_OVSCLR_CNT3_STATUS_Msk (1UL << PMU_OVSCLR_CNT3_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 3 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT4_STATUS_Pos 4U /*!< PMU OVSCLR: Event Counter 4 Overflow Clear Position */ -#define PMU_OVSCLR_CNT4_STATUS_Msk (1UL << PMU_OVSCLR_CNT4_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 4 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT5_STATUS_Pos 5U /*!< PMU OVSCLR: Event Counter 5 Overflow Clear Position */ -#define PMU_OVSCLR_CNT5_STATUS_Msk (1UL << PMU_OVSCLR_CNT5_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 5 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT6_STATUS_Pos 6U /*!< PMU OVSCLR: Event Counter 6 Overflow Clear Position */ -#define PMU_OVSCLR_CNT6_STATUS_Msk (1UL << PMU_OVSCLR_CNT6_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 6 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT7_STATUS_Pos 7U /*!< PMU OVSCLR: Event Counter 7 Overflow Clear Position */ -#define PMU_OVSCLR_CNT7_STATUS_Msk (1UL << PMU_OVSCLR_CNT7_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 7 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT8_STATUS_Pos 8U /*!< PMU OVSCLR: Event Counter 8 Overflow Clear Position */ -#define PMU_OVSCLR_CNT8_STATUS_Msk (1UL << PMU_OVSCLR_CNT8_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 8 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT9_STATUS_Pos 9U /*!< PMU OVSCLR: Event Counter 9 Overflow Clear Position */ -#define PMU_OVSCLR_CNT9_STATUS_Msk (1UL << PMU_OVSCLR_CNT9_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 9 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT10_STATUS_Pos 10U /*!< PMU OVSCLR: Event Counter 10 Overflow Clear Position */ -#define PMU_OVSCLR_CNT10_STATUS_Msk (1UL << PMU_OVSCLR_CNT10_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 10 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT11_STATUS_Pos 11U /*!< PMU OVSCLR: Event Counter 11 Overflow Clear Position */ -#define PMU_OVSCLR_CNT11_STATUS_Msk (1UL << PMU_OVSCLR_CNT11_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 11 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT12_STATUS_Pos 12U /*!< PMU OVSCLR: Event Counter 12 Overflow Clear Position */ -#define PMU_OVSCLR_CNT12_STATUS_Msk (1UL << PMU_OVSCLR_CNT12_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 12 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT13_STATUS_Pos 13U /*!< PMU OVSCLR: Event Counter 13 Overflow Clear Position */ -#define PMU_OVSCLR_CNT13_STATUS_Msk (1UL << PMU_OVSCLR_CNT13_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 13 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT14_STATUS_Pos 14U /*!< PMU OVSCLR: Event Counter 14 Overflow Clear Position */ -#define PMU_OVSCLR_CNT14_STATUS_Msk (1UL << PMU_OVSCLR_CNT14_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 14 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT15_STATUS_Pos 15U /*!< PMU OVSCLR: Event Counter 15 Overflow Clear Position */ -#define PMU_OVSCLR_CNT15_STATUS_Msk (1UL << PMU_OVSCLR_CNT15_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 15 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT16_STATUS_Pos 16U /*!< PMU OVSCLR: Event Counter 16 Overflow Clear Position */ -#define PMU_OVSCLR_CNT16_STATUS_Msk (1UL << PMU_OVSCLR_CNT16_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 16 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT17_STATUS_Pos 17U /*!< PMU OVSCLR: Event Counter 17 Overflow Clear Position */ -#define PMU_OVSCLR_CNT17_STATUS_Msk (1UL << PMU_OVSCLR_CNT17_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 17 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT18_STATUS_Pos 18U /*!< PMU OVSCLR: Event Counter 18 Overflow Clear Position */ -#define PMU_OVSCLR_CNT18_STATUS_Msk (1UL << PMU_OVSCLR_CNT18_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 18 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT19_STATUS_Pos 19U /*!< PMU OVSCLR: Event Counter 19 Overflow Clear Position */ -#define PMU_OVSCLR_CNT19_STATUS_Msk (1UL << PMU_OVSCLR_CNT19_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 19 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT20_STATUS_Pos 20U /*!< PMU OVSCLR: Event Counter 20 Overflow Clear Position */ -#define PMU_OVSCLR_CNT20_STATUS_Msk (1UL << PMU_OVSCLR_CNT20_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 20 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT21_STATUS_Pos 21U /*!< PMU OVSCLR: Event Counter 21 Overflow Clear Position */ -#define PMU_OVSCLR_CNT21_STATUS_Msk (1UL << PMU_OVSCLR_CNT21_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 21 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT22_STATUS_Pos 22U /*!< PMU OVSCLR: Event Counter 22 Overflow Clear Position */ -#define PMU_OVSCLR_CNT22_STATUS_Msk (1UL << PMU_OVSCLR_CNT22_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 22 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT23_STATUS_Pos 23U /*!< PMU OVSCLR: Event Counter 23 Overflow Clear Position */ -#define PMU_OVSCLR_CNT23_STATUS_Msk (1UL << PMU_OVSCLR_CNT23_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 23 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT24_STATUS_Pos 24U /*!< PMU OVSCLR: Event Counter 24 Overflow Clear Position */ -#define PMU_OVSCLR_CNT24_STATUS_Msk (1UL << PMU_OVSCLR_CNT24_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 24 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT25_STATUS_Pos 25U /*!< PMU OVSCLR: Event Counter 25 Overflow Clear Position */ -#define PMU_OVSCLR_CNT25_STATUS_Msk (1UL << PMU_OVSCLR_CNT25_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 25 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT26_STATUS_Pos 26U /*!< PMU OVSCLR: Event Counter 26 Overflow Clear Position */ -#define PMU_OVSCLR_CNT26_STATUS_Msk (1UL << PMU_OVSCLR_CNT26_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 26 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT27_STATUS_Pos 27U /*!< PMU OVSCLR: Event Counter 27 Overflow Clear Position */ -#define PMU_OVSCLR_CNT27_STATUS_Msk (1UL << PMU_OVSCLR_CNT27_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 27 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT28_STATUS_Pos 28U /*!< PMU OVSCLR: Event Counter 28 Overflow Clear Position */ -#define PMU_OVSCLR_CNT28_STATUS_Msk (1UL << PMU_OVSCLR_CNT28_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 28 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT29_STATUS_Pos 29U /*!< PMU OVSCLR: Event Counter 29 Overflow Clear Position */ -#define PMU_OVSCLR_CNT29_STATUS_Msk (1UL << PMU_OVSCLR_CNT29_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 29 Overflow Clear Mask */ - -#define PMU_OVSCLR_CNT30_STATUS_Pos 30U /*!< PMU OVSCLR: Event Counter 30 Overflow Clear Position */ -#define PMU_OVSCLR_CNT30_STATUS_Msk (1UL << PMU_OVSCLR_CNT30_STATUS_Pos) /*!< PMU OVSCLR: Event Counter 30 Overflow Clear Mask */ - -#define PMU_OVSCLR_CYCCNT_STATUS_Pos 31U /*!< PMU OVSCLR: Cycle Counter Overflow Clear Position */ -#define PMU_OVSCLR_CYCCNT_STATUS_Msk (1UL << PMU_OVSCLR_CYCCNT_STATUS_Pos) /*!< PMU OVSCLR: Cycle Counter Overflow Clear Mask */ - -/** \brief PMU Software Increment Counter */ - -#define PMU_SWINC_CNT0_Pos 0U /*!< PMU SWINC: Event Counter 0 Software Increment Position */ -#define PMU_SWINC_CNT0_Msk (1UL /*<< PMU_SWINC_CNT0_Pos */) /*!< PMU SWINC: Event Counter 0 Software Increment Mask */ - -#define PMU_SWINC_CNT1_Pos 1U /*!< PMU SWINC: Event Counter 1 Software Increment Position */ -#define PMU_SWINC_CNT1_Msk (1UL << PMU_SWINC_CNT1_Pos) /*!< PMU SWINC: Event Counter 1 Software Increment Mask */ - -#define PMU_SWINC_CNT2_Pos 2U /*!< PMU SWINC: Event Counter 2 Software Increment Position */ -#define PMU_SWINC_CNT2_Msk (1UL << PMU_SWINC_CNT2_Pos) /*!< PMU SWINC: Event Counter 2 Software Increment Mask */ - -#define PMU_SWINC_CNT3_Pos 3U /*!< PMU SWINC: Event Counter 3 Software Increment Position */ -#define PMU_SWINC_CNT3_Msk (1UL << PMU_SWINC_CNT3_Pos) /*!< PMU SWINC: Event Counter 3 Software Increment Mask */ - -#define PMU_SWINC_CNT4_Pos 4U /*!< PMU SWINC: Event Counter 4 Software Increment Position */ -#define PMU_SWINC_CNT4_Msk (1UL << PMU_SWINC_CNT4_Pos) /*!< PMU SWINC: Event Counter 4 Software Increment Mask */ - -#define PMU_SWINC_CNT5_Pos 5U /*!< PMU SWINC: Event Counter 5 Software Increment Position */ -#define PMU_SWINC_CNT5_Msk (1UL << PMU_SWINC_CNT5_Pos) /*!< PMU SWINC: Event Counter 5 Software Increment Mask */ - -#define PMU_SWINC_CNT6_Pos 6U /*!< PMU SWINC: Event Counter 6 Software Increment Position */ -#define PMU_SWINC_CNT6_Msk (1UL << PMU_SWINC_CNT6_Pos) /*!< PMU SWINC: Event Counter 6 Software Increment Mask */ - -#define PMU_SWINC_CNT7_Pos 7U /*!< PMU SWINC: Event Counter 7 Software Increment Position */ -#define PMU_SWINC_CNT7_Msk (1UL << PMU_SWINC_CNT7_Pos) /*!< PMU SWINC: Event Counter 7 Software Increment Mask */ - -#define PMU_SWINC_CNT8_Pos 8U /*!< PMU SWINC: Event Counter 8 Software Increment Position */ -#define PMU_SWINC_CNT8_Msk (1UL << PMU_SWINC_CNT8_Pos) /*!< PMU SWINC: Event Counter 8 Software Increment Mask */ - -#define PMU_SWINC_CNT9_Pos 9U /*!< PMU SWINC: Event Counter 9 Software Increment Position */ -#define PMU_SWINC_CNT9_Msk (1UL << PMU_SWINC_CNT9_Pos) /*!< PMU SWINC: Event Counter 9 Software Increment Mask */ - -#define PMU_SWINC_CNT10_Pos 10U /*!< PMU SWINC: Event Counter 10 Software Increment Position */ -#define PMU_SWINC_CNT10_Msk (1UL << PMU_SWINC_CNT10_Pos) /*!< PMU SWINC: Event Counter 10 Software Increment Mask */ - -#define PMU_SWINC_CNT11_Pos 11U /*!< PMU SWINC: Event Counter 11 Software Increment Position */ -#define PMU_SWINC_CNT11_Msk (1UL << PMU_SWINC_CNT11_Pos) /*!< PMU SWINC: Event Counter 11 Software Increment Mask */ - -#define PMU_SWINC_CNT12_Pos 12U /*!< PMU SWINC: Event Counter 12 Software Increment Position */ -#define PMU_SWINC_CNT12_Msk (1UL << PMU_SWINC_CNT12_Pos) /*!< PMU SWINC: Event Counter 12 Software Increment Mask */ - -#define PMU_SWINC_CNT13_Pos 13U /*!< PMU SWINC: Event Counter 13 Software Increment Position */ -#define PMU_SWINC_CNT13_Msk (1UL << PMU_SWINC_CNT13_Pos) /*!< PMU SWINC: Event Counter 13 Software Increment Mask */ - -#define PMU_SWINC_CNT14_Pos 14U /*!< PMU SWINC: Event Counter 14 Software Increment Position */ -#define PMU_SWINC_CNT14_Msk (1UL << PMU_SWINC_CNT14_Pos) /*!< PMU SWINC: Event Counter 14 Software Increment Mask */ - -#define PMU_SWINC_CNT15_Pos 15U /*!< PMU SWINC: Event Counter 15 Software Increment Position */ -#define PMU_SWINC_CNT15_Msk (1UL << PMU_SWINC_CNT15_Pos) /*!< PMU SWINC: Event Counter 15 Software Increment Mask */ - -#define PMU_SWINC_CNT16_Pos 16U /*!< PMU SWINC: Event Counter 16 Software Increment Position */ -#define PMU_SWINC_CNT16_Msk (1UL << PMU_SWINC_CNT16_Pos) /*!< PMU SWINC: Event Counter 16 Software Increment Mask */ - -#define PMU_SWINC_CNT17_Pos 17U /*!< PMU SWINC: Event Counter 17 Software Increment Position */ -#define PMU_SWINC_CNT17_Msk (1UL << PMU_SWINC_CNT17_Pos) /*!< PMU SWINC: Event Counter 17 Software Increment Mask */ - -#define PMU_SWINC_CNT18_Pos 18U /*!< PMU SWINC: Event Counter 18 Software Increment Position */ -#define PMU_SWINC_CNT18_Msk (1UL << PMU_SWINC_CNT18_Pos) /*!< PMU SWINC: Event Counter 18 Software Increment Mask */ - -#define PMU_SWINC_CNT19_Pos 19U /*!< PMU SWINC: Event Counter 19 Software Increment Position */ -#define PMU_SWINC_CNT19_Msk (1UL << PMU_SWINC_CNT19_Pos) /*!< PMU SWINC: Event Counter 19 Software Increment Mask */ - -#define PMU_SWINC_CNT20_Pos 20U /*!< PMU SWINC: Event Counter 20 Software Increment Position */ -#define PMU_SWINC_CNT20_Msk (1UL << PMU_SWINC_CNT20_Pos) /*!< PMU SWINC: Event Counter 20 Software Increment Mask */ - -#define PMU_SWINC_CNT21_Pos 21U /*!< PMU SWINC: Event Counter 21 Software Increment Position */ -#define PMU_SWINC_CNT21_Msk (1UL << PMU_SWINC_CNT21_Pos) /*!< PMU SWINC: Event Counter 21 Software Increment Mask */ - -#define PMU_SWINC_CNT22_Pos 22U /*!< PMU SWINC: Event Counter 22 Software Increment Position */ -#define PMU_SWINC_CNT22_Msk (1UL << PMU_SWINC_CNT22_Pos) /*!< PMU SWINC: Event Counter 22 Software Increment Mask */ - -#define PMU_SWINC_CNT23_Pos 23U /*!< PMU SWINC: Event Counter 23 Software Increment Position */ -#define PMU_SWINC_CNT23_Msk (1UL << PMU_SWINC_CNT23_Pos) /*!< PMU SWINC: Event Counter 23 Software Increment Mask */ - -#define PMU_SWINC_CNT24_Pos 24U /*!< PMU SWINC: Event Counter 24 Software Increment Position */ -#define PMU_SWINC_CNT24_Msk (1UL << PMU_SWINC_CNT24_Pos) /*!< PMU SWINC: Event Counter 24 Software Increment Mask */ - -#define PMU_SWINC_CNT25_Pos 25U /*!< PMU SWINC: Event Counter 25 Software Increment Position */ -#define PMU_SWINC_CNT25_Msk (1UL << PMU_SWINC_CNT25_Pos) /*!< PMU SWINC: Event Counter 25 Software Increment Mask */ - -#define PMU_SWINC_CNT26_Pos 26U /*!< PMU SWINC: Event Counter 26 Software Increment Position */ -#define PMU_SWINC_CNT26_Msk (1UL << PMU_SWINC_CNT26_Pos) /*!< PMU SWINC: Event Counter 26 Software Increment Mask */ - -#define PMU_SWINC_CNT27_Pos 27U /*!< PMU SWINC: Event Counter 27 Software Increment Position */ -#define PMU_SWINC_CNT27_Msk (1UL << PMU_SWINC_CNT27_Pos) /*!< PMU SWINC: Event Counter 27 Software Increment Mask */ - -#define PMU_SWINC_CNT28_Pos 28U /*!< PMU SWINC: Event Counter 28 Software Increment Position */ -#define PMU_SWINC_CNT28_Msk (1UL << PMU_SWINC_CNT28_Pos) /*!< PMU SWINC: Event Counter 28 Software Increment Mask */ - -#define PMU_SWINC_CNT29_Pos 29U /*!< PMU SWINC: Event Counter 29 Software Increment Position */ -#define PMU_SWINC_CNT29_Msk (1UL << PMU_SWINC_CNT29_Pos) /*!< PMU SWINC: Event Counter 29 Software Increment Mask */ - -#define PMU_SWINC_CNT30_Pos 30U /*!< PMU SWINC: Event Counter 30 Software Increment Position */ -#define PMU_SWINC_CNT30_Msk (1UL << PMU_SWINC_CNT30_Pos) /*!< PMU SWINC: Event Counter 30 Software Increment Mask */ - -/** \brief PMU Control Register Definitions */ - -#define PMU_CTRL_ENABLE_Pos 0U /*!< PMU CTRL: ENABLE Position */ -#define PMU_CTRL_ENABLE_Msk (1UL /*<< PMU_CTRL_ENABLE_Pos*/) /*!< PMU CTRL: ENABLE Mask */ - -#define PMU_CTRL_EVENTCNT_RESET_Pos 1U /*!< PMU CTRL: Event Counter Reset Position */ -#define PMU_CTRL_EVENTCNT_RESET_Msk (1UL << PMU_CTRL_EVENTCNT_RESET_Pos) /*!< PMU CTRL: Event Counter Reset Mask */ - -#define PMU_CTRL_CYCCNT_RESET_Pos 2U /*!< PMU CTRL: Cycle Counter Reset Position */ -#define PMU_CTRL_CYCCNT_RESET_Msk (1UL << PMU_CTRL_CYCCNT_RESET_Pos) /*!< PMU CTRL: Cycle Counter Reset Mask */ - -#define PMU_CTRL_CYCCNT_DISABLE_Pos 5U /*!< PMU CTRL: Disable Cycle Counter Position */ -#define PMU_CTRL_CYCCNT_DISABLE_Msk (1UL << PMU_CTRL_CYCCNT_DISABLE_Pos) /*!< PMU CTRL: Disable Cycle Counter Mask */ - -#define PMU_CTRL_FRZ_ON_OV_Pos 9U /*!< PMU CTRL: Freeze-on-overflow Position */ -#define PMU_CTRL_FRZ_ON_OV_Msk (1UL << PMU_CTRL_FRZ_ON_OVERFLOW_Pos) /*!< PMU CTRL: Freeze-on-overflow Mask */ - -#define PMU_CTRL_TRACE_ON_OV_Pos 11U /*!< PMU CTRL: Trace-on-overflow Position */ -#define PMU_CTRL_TRACE_ON_OV_Msk (1UL << PMU_CTRL_TRACE_ON_OVERFLOW_Pos) /*!< PMU CTRL: Trace-on-overflow Mask */ - -/** \brief PMU Type Register Definitions */ - -#define PMU_TYPE_NUM_CNTS_Pos 0U /*!< PMU TYPE: Number of Counters Position */ -#define PMU_TYPE_NUM_CNTS_Msk (0xFFUL /*<< PMU_TYPE_NUM_CNTS_Pos*/) /*!< PMU TYPE: Number of Counters Mask */ - -#define PMU_TYPE_SIZE_CNTS_Pos 8U /*!< PMU TYPE: Size of Counters Position */ -#define PMU_TYPE_SIZE_CNTS_Msk (0x3FUL << PMU_TYPE_SIZE_CNTS_Pos) /*!< PMU TYPE: Size of Counters Mask */ - -#define PMU_TYPE_CYCCNT_PRESENT_Pos 14U /*!< PMU TYPE: Cycle Counter Present Position */ -#define PMU_TYPE_CYCCNT_PRESENT_Msk (1UL << PMU_TYPE_CYCCNT_PRESENT_Pos) /*!< PMU TYPE: Cycle Counter Present Mask */ - -#define PMU_TYPE_FRZ_OV_SUPPORT_Pos 21U /*!< PMU TYPE: Freeze-on-overflow Support Position */ -#define PMU_TYPE_FRZ_OV_SUPPORT_Msk (1UL << PMU_TYPE_FRZ_OV_SUPPORT_Pos) /*!< PMU TYPE: Freeze-on-overflow Support Mask */ - -#define PMU_TYPE_TRACE_ON_OV_SUPPORT_Pos 23U /*!< PMU TYPE: Trace-on-overflow Support Position */ -#define PMU_TYPE_TRACE_ON_OV_SUPPORT_Msk (1UL << PMU_TYPE_FRZ_OV_SUPPORT_Pos) /*!< PMU TYPE: Trace-on-overflow Support Mask */ - -/** \brief PMU Authentication Status Register Definitions */ - -#define PMU_AUTHSTATUS_NSID_Pos 0U /*!< PMU AUTHSTATUS: Non-secure Invasive Debug Position */ -#define PMU_AUTHSTATUS_NSID_Msk (0x3UL /*<< PMU_AUTHSTATUS_NSID_Pos*/) /*!< PMU AUTHSTATUS: Non-secure Invasive Debug Mask */ - -#define PMU_AUTHSTATUS_NSNID_Pos 2U /*!< PMU AUTHSTATUS: Non-secure Non-invasive Debug Position */ -#define PMU_AUTHSTATUS_NSNID_Msk (0x3UL << PMU_AUTHSTATUS_NSNID_Pos) /*!< PMU AUTHSTATUS: Non-secure Non-invasive Debug Mask */ - -#define PMU_AUTHSTATUS_SID_Pos 4U /*!< PMU AUTHSTATUS: Secure Invasive Debug Position */ -#define PMU_AUTHSTATUS_SID_Msk (0x3UL << PMU_AUTHSTATUS_SID_Pos) /*!< PMU AUTHSTATUS: Secure Invasive Debug Mask */ - -#define PMU_AUTHSTATUS_SNID_Pos 6U /*!< PMU AUTHSTATUS: Secure Non-invasive Debug Position */ -#define PMU_AUTHSTATUS_SNID_Msk (0x3UL << PMU_AUTHSTATUS_SNID_Pos) /*!< PMU AUTHSTATUS: Secure Non-invasive Debug Mask */ - -#define PMU_AUTHSTATUS_NSUID_Pos 16U /*!< PMU AUTHSTATUS: Non-secure Unprivileged Invasive Debug Position */ -#define PMU_AUTHSTATUS_NSUID_Msk (0x3UL << PMU_AUTHSTATUS_NSUID_Pos) /*!< PMU AUTHSTATUS: Non-secure Unprivileged Invasive Debug Mask */ - -#define PMU_AUTHSTATUS_NSUNID_Pos 18U /*!< PMU AUTHSTATUS: Non-secure Unprivileged Non-invasive Debug Position */ -#define PMU_AUTHSTATUS_NSUNID_Msk (0x3UL << PMU_AUTHSTATUS_NSUNID_Pos) /*!< PMU AUTHSTATUS: Non-secure Unprivileged Non-invasive Debug Mask */ - -#define PMU_AUTHSTATUS_SUID_Pos 20U /*!< PMU AUTHSTATUS: Secure Unprivileged Invasive Debug Position */ -#define PMU_AUTHSTATUS_SUID_Msk (0x3UL << PMU_AUTHSTATUS_SUID_Pos) /*!< PMU AUTHSTATUS: Secure Unprivileged Invasive Debug Mask */ - -#define PMU_AUTHSTATUS_SUNID_Pos 22U /*!< PMU AUTHSTATUS: Secure Unprivileged Non-invasive Debug Position */ -#define PMU_AUTHSTATUS_SUNID_Msk (0x3UL << PMU_AUTHSTATUS_SUNID_Pos) /*!< PMU AUTHSTATUS: Secure Unprivileged Non-invasive Debug Mask */ - - -/*@} end of group CMSIS_PMU */ -#endif - -#if defined (__MPU_PRESENT) && (__MPU_PRESENT == 1U) -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_MPU Memory Protection Unit (MPU) - \brief Type definitions for the Memory Protection Unit (MPU) - @{ - */ - -/** - \brief Structure type to access the Memory Protection Unit (MPU). - */ -typedef struct -{ - __IM uint32_t TYPE; /*!< Offset: 0x000 (R/ ) MPU Type Register */ - __IOM uint32_t CTRL; /*!< Offset: 0x004 (R/W) MPU Control Register */ - __IOM uint32_t RNR; /*!< Offset: 0x008 (R/W) MPU Region Number Register */ - __IOM uint32_t RBAR; /*!< Offset: 0x00C (R/W) MPU Region Base Address Register */ - __IOM uint32_t RLAR; /*!< Offset: 0x010 (R/W) MPU Region Limit Address Register */ - __IOM uint32_t RBAR_A1; /*!< Offset: 0x014 (R/W) MPU Region Base Address Register Alias 1 */ - __IOM uint32_t RLAR_A1; /*!< Offset: 0x018 (R/W) MPU Region Limit Address Register Alias 1 */ - __IOM uint32_t RBAR_A2; /*!< Offset: 0x01C (R/W) MPU Region Base Address Register Alias 2 */ - __IOM uint32_t RLAR_A2; /*!< Offset: 0x020 (R/W) MPU Region Limit Address Register Alias 2 */ - __IOM uint32_t RBAR_A3; /*!< Offset: 0x024 (R/W) MPU Region Base Address Register Alias 3 */ - __IOM uint32_t RLAR_A3; /*!< Offset: 0x028 (R/W) MPU Region Limit Address Register Alias 3 */ - uint32_t RESERVED0[1]; - union { - __IOM uint32_t MAIR[2]; - struct { - __IOM uint32_t MAIR0; /*!< Offset: 0x030 (R/W) MPU Memory Attribute Indirection Register 0 */ - __IOM uint32_t MAIR1; /*!< Offset: 0x034 (R/W) MPU Memory Attribute Indirection Register 1 */ - }; - }; -} MPU_Type; - -#define MPU_TYPE_RALIASES 4U - -/* MPU Type Register Definitions */ -#define MPU_TYPE_IREGION_Pos 16U /*!< MPU TYPE: IREGION Position */ -#define MPU_TYPE_IREGION_Msk (0xFFUL << MPU_TYPE_IREGION_Pos) /*!< MPU TYPE: IREGION Mask */ - -#define MPU_TYPE_DREGION_Pos 8U /*!< MPU TYPE: DREGION Position */ -#define MPU_TYPE_DREGION_Msk (0xFFUL << MPU_TYPE_DREGION_Pos) /*!< MPU TYPE: DREGION Mask */ - -#define MPU_TYPE_SEPARATE_Pos 0U /*!< MPU TYPE: SEPARATE Position */ -#define MPU_TYPE_SEPARATE_Msk (1UL /*<< MPU_TYPE_SEPARATE_Pos*/) /*!< MPU TYPE: SEPARATE Mask */ - -/* MPU Control Register Definitions */ -#define MPU_CTRL_PRIVDEFENA_Pos 2U /*!< MPU CTRL: PRIVDEFENA Position */ -#define MPU_CTRL_PRIVDEFENA_Msk (1UL << MPU_CTRL_PRIVDEFENA_Pos) /*!< MPU CTRL: PRIVDEFENA Mask */ - -#define MPU_CTRL_HFNMIENA_Pos 1U /*!< MPU CTRL: HFNMIENA Position */ -#define MPU_CTRL_HFNMIENA_Msk (1UL << MPU_CTRL_HFNMIENA_Pos) /*!< MPU CTRL: HFNMIENA Mask */ - -#define MPU_CTRL_ENABLE_Pos 0U /*!< MPU CTRL: ENABLE Position */ -#define MPU_CTRL_ENABLE_Msk (1UL /*<< MPU_CTRL_ENABLE_Pos*/) /*!< MPU CTRL: ENABLE Mask */ - -/* MPU Region Number Register Definitions */ -#define MPU_RNR_REGION_Pos 0U /*!< MPU RNR: REGION Position */ -#define MPU_RNR_REGION_Msk (0xFFUL /*<< MPU_RNR_REGION_Pos*/) /*!< MPU RNR: REGION Mask */ - -/* MPU Region Base Address Register Definitions */ -#define MPU_RBAR_BASE_Pos 5U /*!< MPU RBAR: BASE Position */ -#define MPU_RBAR_BASE_Msk (0x7FFFFFFUL << MPU_RBAR_BASE_Pos) /*!< MPU RBAR: BASE Mask */ - -#define MPU_RBAR_SH_Pos 3U /*!< MPU RBAR: SH Position */ -#define MPU_RBAR_SH_Msk (0x3UL << MPU_RBAR_SH_Pos) /*!< MPU RBAR: SH Mask */ - -#define MPU_RBAR_AP_Pos 1U /*!< MPU RBAR: AP Position */ -#define MPU_RBAR_AP_Msk (0x3UL << MPU_RBAR_AP_Pos) /*!< MPU RBAR: AP Mask */ - -#define MPU_RBAR_XN_Pos 0U /*!< MPU RBAR: XN Position */ -#define MPU_RBAR_XN_Msk (01UL /*<< MPU_RBAR_XN_Pos*/) /*!< MPU RBAR: XN Mask */ - -/* MPU Region Limit Address Register Definitions */ -#define MPU_RLAR_LIMIT_Pos 5U /*!< MPU RLAR: LIMIT Position */ -#define MPU_RLAR_LIMIT_Msk (0x7FFFFFFUL << MPU_RLAR_LIMIT_Pos) /*!< MPU RLAR: LIMIT Mask */ - -#define MPU_RLAR_PXN_Pos 4U /*!< MPU RLAR: PXN Position */ -#define MPU_RLAR_PXN_Msk (1UL << MPU_RLAR_PXN_Pos) /*!< MPU RLAR: PXN Mask */ - -#define MPU_RLAR_AttrIndx_Pos 1U /*!< MPU RLAR: AttrIndx Position */ -#define MPU_RLAR_AttrIndx_Msk (7UL << MPU_RLAR_AttrIndx_Pos) /*!< MPU RLAR: AttrIndx Mask */ - -#define MPU_RLAR_EN_Pos 0U /*!< MPU RLAR: Region enable bit Position */ -#define MPU_RLAR_EN_Msk (1UL /*<< MPU_RLAR_EN_Pos*/) /*!< MPU RLAR: Region enable bit Disable Mask */ - -/* MPU Memory Attribute Indirection Register 0 Definitions */ -#define MPU_MAIR0_Attr3_Pos 24U /*!< MPU MAIR0: Attr3 Position */ -#define MPU_MAIR0_Attr3_Msk (0xFFUL << MPU_MAIR0_Attr3_Pos) /*!< MPU MAIR0: Attr3 Mask */ - -#define MPU_MAIR0_Attr2_Pos 16U /*!< MPU MAIR0: Attr2 Position */ -#define MPU_MAIR0_Attr2_Msk (0xFFUL << MPU_MAIR0_Attr2_Pos) /*!< MPU MAIR0: Attr2 Mask */ - -#define MPU_MAIR0_Attr1_Pos 8U /*!< MPU MAIR0: Attr1 Position */ -#define MPU_MAIR0_Attr1_Msk (0xFFUL << MPU_MAIR0_Attr1_Pos) /*!< MPU MAIR0: Attr1 Mask */ - -#define MPU_MAIR0_Attr0_Pos 0U /*!< MPU MAIR0: Attr0 Position */ -#define MPU_MAIR0_Attr0_Msk (0xFFUL /*<< MPU_MAIR0_Attr0_Pos*/) /*!< MPU MAIR0: Attr0 Mask */ - -/* MPU Memory Attribute Indirection Register 1 Definitions */ -#define MPU_MAIR1_Attr7_Pos 24U /*!< MPU MAIR1: Attr7 Position */ -#define MPU_MAIR1_Attr7_Msk (0xFFUL << MPU_MAIR1_Attr7_Pos) /*!< MPU MAIR1: Attr7 Mask */ - -#define MPU_MAIR1_Attr6_Pos 16U /*!< MPU MAIR1: Attr6 Position */ -#define MPU_MAIR1_Attr6_Msk (0xFFUL << MPU_MAIR1_Attr6_Pos) /*!< MPU MAIR1: Attr6 Mask */ - -#define MPU_MAIR1_Attr5_Pos 8U /*!< MPU MAIR1: Attr5 Position */ -#define MPU_MAIR1_Attr5_Msk (0xFFUL << MPU_MAIR1_Attr5_Pos) /*!< MPU MAIR1: Attr5 Mask */ - -#define MPU_MAIR1_Attr4_Pos 0U /*!< MPU MAIR1: Attr4 Position */ -#define MPU_MAIR1_Attr4_Msk (0xFFUL /*<< MPU_MAIR1_Attr4_Pos*/) /*!< MPU MAIR1: Attr4 Mask */ - -/*@} end of group CMSIS_MPU */ -#endif - - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_SAU Security Attribution Unit (SAU) - \brief Type definitions for the Security Attribution Unit (SAU) - @{ - */ - -/** - \brief Structure type to access the Security Attribution Unit (SAU). - */ -typedef struct -{ - __IOM uint32_t CTRL; /*!< Offset: 0x000 (R/W) SAU Control Register */ - __IM uint32_t TYPE; /*!< Offset: 0x004 (R/ ) SAU Type Register */ -#if defined (__SAUREGION_PRESENT) && (__SAUREGION_PRESENT == 1U) - __IOM uint32_t RNR; /*!< Offset: 0x008 (R/W) SAU Region Number Register */ - __IOM uint32_t RBAR; /*!< Offset: 0x00C (R/W) SAU Region Base Address Register */ - __IOM uint32_t RLAR; /*!< Offset: 0x010 (R/W) SAU Region Limit Address Register */ -#else - uint32_t RESERVED0[3]; -#endif - __IOM uint32_t SFSR; /*!< Offset: 0x014 (R/W) Secure Fault Status Register */ - __IOM uint32_t SFAR; /*!< Offset: 0x018 (R/W) Secure Fault Address Register */ -} SAU_Type; - -/* SAU Control Register Definitions */ -#define SAU_CTRL_ALLNS_Pos 1U /*!< SAU CTRL: ALLNS Position */ -#define SAU_CTRL_ALLNS_Msk (1UL << SAU_CTRL_ALLNS_Pos) /*!< SAU CTRL: ALLNS Mask */ - -#define SAU_CTRL_ENABLE_Pos 0U /*!< SAU CTRL: ENABLE Position */ -#define SAU_CTRL_ENABLE_Msk (1UL /*<< SAU_CTRL_ENABLE_Pos*/) /*!< SAU CTRL: ENABLE Mask */ - -/* SAU Type Register Definitions */ -#define SAU_TYPE_SREGION_Pos 0U /*!< SAU TYPE: SREGION Position */ -#define SAU_TYPE_SREGION_Msk (0xFFUL /*<< SAU_TYPE_SREGION_Pos*/) /*!< SAU TYPE: SREGION Mask */ - -#if defined (__SAUREGION_PRESENT) && (__SAUREGION_PRESENT == 1U) -/* SAU Region Number Register Definitions */ -#define SAU_RNR_REGION_Pos 0U /*!< SAU RNR: REGION Position */ -#define SAU_RNR_REGION_Msk (0xFFUL /*<< SAU_RNR_REGION_Pos*/) /*!< SAU RNR: REGION Mask */ - -/* SAU Region Base Address Register Definitions */ -#define SAU_RBAR_BADDR_Pos 5U /*!< SAU RBAR: BADDR Position */ -#define SAU_RBAR_BADDR_Msk (0x7FFFFFFUL << SAU_RBAR_BADDR_Pos) /*!< SAU RBAR: BADDR Mask */ - -/* SAU Region Limit Address Register Definitions */ -#define SAU_RLAR_LADDR_Pos 5U /*!< SAU RLAR: LADDR Position */ -#define SAU_RLAR_LADDR_Msk (0x7FFFFFFUL << SAU_RLAR_LADDR_Pos) /*!< SAU RLAR: LADDR Mask */ - -#define SAU_RLAR_NSC_Pos 1U /*!< SAU RLAR: NSC Position */ -#define SAU_RLAR_NSC_Msk (1UL << SAU_RLAR_NSC_Pos) /*!< SAU RLAR: NSC Mask */ - -#define SAU_RLAR_ENABLE_Pos 0U /*!< SAU RLAR: ENABLE Position */ -#define SAU_RLAR_ENABLE_Msk (1UL /*<< SAU_RLAR_ENABLE_Pos*/) /*!< SAU RLAR: ENABLE Mask */ - -#endif /* defined (__SAUREGION_PRESENT) && (__SAUREGION_PRESENT == 1U) */ - -/* Secure Fault Status Register Definitions */ -#define SAU_SFSR_LSERR_Pos 7U /*!< SAU SFSR: LSERR Position */ -#define SAU_SFSR_LSERR_Msk (1UL << SAU_SFSR_LSERR_Pos) /*!< SAU SFSR: LSERR Mask */ - -#define SAU_SFSR_SFARVALID_Pos 6U /*!< SAU SFSR: SFARVALID Position */ -#define SAU_SFSR_SFARVALID_Msk (1UL << SAU_SFSR_SFARVALID_Pos) /*!< SAU SFSR: SFARVALID Mask */ - -#define SAU_SFSR_LSPERR_Pos 5U /*!< SAU SFSR: LSPERR Position */ -#define SAU_SFSR_LSPERR_Msk (1UL << SAU_SFSR_LSPERR_Pos) /*!< SAU SFSR: LSPERR Mask */ - -#define SAU_SFSR_INVTRAN_Pos 4U /*!< SAU SFSR: INVTRAN Position */ -#define SAU_SFSR_INVTRAN_Msk (1UL << SAU_SFSR_INVTRAN_Pos) /*!< SAU SFSR: INVTRAN Mask */ - -#define SAU_SFSR_AUVIOL_Pos 3U /*!< SAU SFSR: AUVIOL Position */ -#define SAU_SFSR_AUVIOL_Msk (1UL << SAU_SFSR_AUVIOL_Pos) /*!< SAU SFSR: AUVIOL Mask */ - -#define SAU_SFSR_INVER_Pos 2U /*!< SAU SFSR: INVER Position */ -#define SAU_SFSR_INVER_Msk (1UL << SAU_SFSR_INVER_Pos) /*!< SAU SFSR: INVER Mask */ - -#define SAU_SFSR_INVIS_Pos 1U /*!< SAU SFSR: INVIS Position */ -#define SAU_SFSR_INVIS_Msk (1UL << SAU_SFSR_INVIS_Pos) /*!< SAU SFSR: INVIS Mask */ - -#define SAU_SFSR_INVEP_Pos 0U /*!< SAU SFSR: INVEP Position */ -#define SAU_SFSR_INVEP_Msk (1UL /*<< SAU_SFSR_INVEP_Pos*/) /*!< SAU SFSR: INVEP Mask */ - -/*@} end of group CMSIS_SAU */ -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_FPU Floating Point Unit (FPU) - \brief Type definitions for the Floating Point Unit (FPU) - @{ - */ - -/** - \brief Structure type to access the Floating Point Unit (FPU). - */ -typedef struct -{ - uint32_t RESERVED0[1U]; - __IOM uint32_t FPCCR; /*!< Offset: 0x004 (R/W) Floating-Point Context Control Register */ - __IOM uint32_t FPCAR; /*!< Offset: 0x008 (R/W) Floating-Point Context Address Register */ - __IOM uint32_t FPDSCR; /*!< Offset: 0x00C (R/W) Floating-Point Default Status Control Register */ - __IM uint32_t MVFR0; /*!< Offset: 0x010 (R/ ) Media and VFP Feature Register 0 */ - __IM uint32_t MVFR1; /*!< Offset: 0x014 (R/ ) Media and VFP Feature Register 1 */ - __IM uint32_t MVFR2; /*!< Offset: 0x018 (R/ ) Media and VFP Feature Register 2 */ -} FPU_Type; - -/* Floating-Point Context Control Register Definitions */ -#define FPU_FPCCR_ASPEN_Pos 31U /*!< FPCCR: ASPEN bit Position */ -#define FPU_FPCCR_ASPEN_Msk (1UL << FPU_FPCCR_ASPEN_Pos) /*!< FPCCR: ASPEN bit Mask */ - -#define FPU_FPCCR_LSPEN_Pos 30U /*!< FPCCR: LSPEN Position */ -#define FPU_FPCCR_LSPEN_Msk (1UL << FPU_FPCCR_LSPEN_Pos) /*!< FPCCR: LSPEN bit Mask */ - -#define FPU_FPCCR_LSPENS_Pos 29U /*!< FPCCR: LSPENS Position */ -#define FPU_FPCCR_LSPENS_Msk (1UL << FPU_FPCCR_LSPENS_Pos) /*!< FPCCR: LSPENS bit Mask */ - -#define FPU_FPCCR_CLRONRET_Pos 28U /*!< FPCCR: CLRONRET Position */ -#define FPU_FPCCR_CLRONRET_Msk (1UL << FPU_FPCCR_CLRONRET_Pos) /*!< FPCCR: CLRONRET bit Mask */ - -#define FPU_FPCCR_CLRONRETS_Pos 27U /*!< FPCCR: CLRONRETS Position */ -#define FPU_FPCCR_CLRONRETS_Msk (1UL << FPU_FPCCR_CLRONRETS_Pos) /*!< FPCCR: CLRONRETS bit Mask */ - -#define FPU_FPCCR_TS_Pos 26U /*!< FPCCR: TS Position */ -#define FPU_FPCCR_TS_Msk (1UL << FPU_FPCCR_TS_Pos) /*!< FPCCR: TS bit Mask */ - -#define FPU_FPCCR_UFRDY_Pos 10U /*!< FPCCR: UFRDY Position */ -#define FPU_FPCCR_UFRDY_Msk (1UL << FPU_FPCCR_UFRDY_Pos) /*!< FPCCR: UFRDY bit Mask */ - -#define FPU_FPCCR_SPLIMVIOL_Pos 9U /*!< FPCCR: SPLIMVIOL Position */ -#define FPU_FPCCR_SPLIMVIOL_Msk (1UL << FPU_FPCCR_SPLIMVIOL_Pos) /*!< FPCCR: SPLIMVIOL bit Mask */ - -#define FPU_FPCCR_MONRDY_Pos 8U /*!< FPCCR: MONRDY Position */ -#define FPU_FPCCR_MONRDY_Msk (1UL << FPU_FPCCR_MONRDY_Pos) /*!< FPCCR: MONRDY bit Mask */ - -#define FPU_FPCCR_SFRDY_Pos 7U /*!< FPCCR: SFRDY Position */ -#define FPU_FPCCR_SFRDY_Msk (1UL << FPU_FPCCR_SFRDY_Pos) /*!< FPCCR: SFRDY bit Mask */ - -#define FPU_FPCCR_BFRDY_Pos 6U /*!< FPCCR: BFRDY Position */ -#define FPU_FPCCR_BFRDY_Msk (1UL << FPU_FPCCR_BFRDY_Pos) /*!< FPCCR: BFRDY bit Mask */ - -#define FPU_FPCCR_MMRDY_Pos 5U /*!< FPCCR: MMRDY Position */ -#define FPU_FPCCR_MMRDY_Msk (1UL << FPU_FPCCR_MMRDY_Pos) /*!< FPCCR: MMRDY bit Mask */ - -#define FPU_FPCCR_HFRDY_Pos 4U /*!< FPCCR: HFRDY Position */ -#define FPU_FPCCR_HFRDY_Msk (1UL << FPU_FPCCR_HFRDY_Pos) /*!< FPCCR: HFRDY bit Mask */ - -#define FPU_FPCCR_THREAD_Pos 3U /*!< FPCCR: processor mode bit Position */ -#define FPU_FPCCR_THREAD_Msk (1UL << FPU_FPCCR_THREAD_Pos) /*!< FPCCR: processor mode active bit Mask */ - -#define FPU_FPCCR_S_Pos 2U /*!< FPCCR: Security status of the FP context bit Position */ -#define FPU_FPCCR_S_Msk (1UL << FPU_FPCCR_S_Pos) /*!< FPCCR: Security status of the FP context bit Mask */ - -#define FPU_FPCCR_USER_Pos 1U /*!< FPCCR: privilege level bit Position */ -#define FPU_FPCCR_USER_Msk (1UL << FPU_FPCCR_USER_Pos) /*!< FPCCR: privilege level bit Mask */ - -#define FPU_FPCCR_LSPACT_Pos 0U /*!< FPCCR: Lazy state preservation active bit Position */ -#define FPU_FPCCR_LSPACT_Msk (1UL /*<< FPU_FPCCR_LSPACT_Pos*/) /*!< FPCCR: Lazy state preservation active bit Mask */ - -/* Floating-Point Context Address Register Definitions */ -#define FPU_FPCAR_ADDRESS_Pos 3U /*!< FPCAR: ADDRESS bit Position */ -#define FPU_FPCAR_ADDRESS_Msk (0x1FFFFFFFUL << FPU_FPCAR_ADDRESS_Pos) /*!< FPCAR: ADDRESS bit Mask */ - -/* Floating-Point Default Status Control Register Definitions */ -#define FPU_FPDSCR_AHP_Pos 26U /*!< FPDSCR: AHP bit Position */ -#define FPU_FPDSCR_AHP_Msk (1UL << FPU_FPDSCR_AHP_Pos) /*!< FPDSCR: AHP bit Mask */ - -#define FPU_FPDSCR_DN_Pos 25U /*!< FPDSCR: DN bit Position */ -#define FPU_FPDSCR_DN_Msk (1UL << FPU_FPDSCR_DN_Pos) /*!< FPDSCR: DN bit Mask */ - -#define FPU_FPDSCR_FZ_Pos 24U /*!< FPDSCR: FZ bit Position */ -#define FPU_FPDSCR_FZ_Msk (1UL << FPU_FPDSCR_FZ_Pos) /*!< FPDSCR: FZ bit Mask */ - -#define FPU_FPDSCR_RMode_Pos 22U /*!< FPDSCR: RMode bit Position */ -#define FPU_FPDSCR_RMode_Msk (3UL << FPU_FPDSCR_RMode_Pos) /*!< FPDSCR: RMode bit Mask */ - -#define FPU_FPDSCR_FZ16_Pos 19U /*!< FPDSCR: FZ16 bit Position */ -#define FPU_FPDSCR_FZ16_Msk (1UL << FPU_FPDSCR_FZ16_Pos) /*!< FPDSCR: FZ16 bit Mask */ - -#define FPU_FPDSCR_LTPSIZE_Pos 16U /*!< FPDSCR: LTPSIZE bit Position */ -#define FPU_FPDSCR_LTPSIZE_Msk (7UL << FPU_FPDSCR_LTPSIZE_Pos) /*!< FPDSCR: LTPSIZE bit Mask */ - -/* Media and VFP Feature Register 0 Definitions */ -#define FPU_MVFR0_FPRound_Pos 28U /*!< MVFR0: FPRound bits Position */ -#define FPU_MVFR0_FPRound_Msk (0xFUL << FPU_MVFR0_FPRound_Pos) /*!< MVFR0: FPRound bits Mask */ - -#define FPU_MVFR0_FPSqrt_Pos 20U /*!< MVFR0: FPSqrt bits Position */ -#define FPU_MVFR0_FPSqrt_Msk (0xFUL << FPU_MVFR0_FPSqrt_Pos) /*!< MVFR0: FPSqrt bits Mask */ - -#define FPU_MVFR0_FPDivide_Pos 16U /*!< MVFR0: FPDivide bits Position */ -#define FPU_MVFR0_FPDivide_Msk (0xFUL << FPU_MVFR0_FPDivide_Pos) /*!< MVFR0: Divide bits Mask */ - -#define FPU_MVFR0_FPDP_Pos 8U /*!< MVFR0: FPDP bits Position */ -#define FPU_MVFR0_FPDP_Msk (0xFUL << FPU_MVFR0_FPDP_Pos) /*!< MVFR0: FPDP bits Mask */ - -#define FPU_MVFR0_FPSP_Pos 4U /*!< MVFR0: FPSP bits Position */ -#define FPU_MVFR0_FPSP_Msk (0xFUL << FPU_MVFR0_FPSP_Pos) /*!< MVFR0: FPSP bits Mask */ - -#define FPU_MVFR0_SIMDReg_Pos 0U /*!< MVFR0: SIMDReg bits Position */ -#define FPU_MVFR0_SIMDReg_Msk (0xFUL /*<< FPU_MVFR0_SIMDReg_Pos*/) /*!< MVFR0: SIMDReg bits Mask */ - -/* Media and VFP Feature Register 1 Definitions */ -#define FPU_MVFR1_FMAC_Pos 28U /*!< MVFR1: FMAC bits Position */ -#define FPU_MVFR1_FMAC_Msk (0xFUL << FPU_MVFR1_FMAC_Pos) /*!< MVFR1: FMAC bits Mask */ - -#define FPU_MVFR1_FPHP_Pos 24U /*!< MVFR1: FPHP bits Position */ -#define FPU_MVFR1_FPHP_Msk (0xFUL << FPU_MVFR1_FPHP_Pos) /*!< MVFR1: FPHP bits Mask */ - -#define FPU_MVFR1_FP16_Pos 20U /*!< MVFR1: FP16 bits Position */ -#define FPU_MVFR1_FP16_Msk (0xFUL << FPU_MVFR1_FP16_Pos) /*!< MVFR1: FP16 bits Mask */ - -#define FPU_MVFR1_MVE_Pos 8U /*!< MVFR1: MVE bits Position */ -#define FPU_MVFR1_MVE_Msk (0xFUL << FPU_MVFR1_MVE_Pos) /*!< MVFR1: MVE bits Mask */ - -#define FPU_MVFR1_FPDNaN_Pos 4U /*!< MVFR1: FPDNaN bits Position */ -#define FPU_MVFR1_FPDNaN_Msk (0xFUL << FPU_MVFR1_FPDNaN_Pos) /*!< MVFR1: FPDNaN bits Mask */ - -#define FPU_MVFR1_FPFtZ_Pos 0U /*!< MVFR1: FPFtZ bits Position */ -#define FPU_MVFR1_FPFtZ_Msk (0xFUL /*<< FPU_MVFR1_FPFtZ_Pos*/) /*!< MVFR1: FPFtZ bits Mask */ - -/* Media and VFP Feature Register 2 Definitions */ -#define FPU_MVFR2_FPMisc_Pos 4U /*!< MVFR2: FPMisc bits Position */ -#define FPU_MVFR2_FPMisc_Msk (0xFUL << FPU_MVFR2_FPMisc_Pos) /*!< MVFR2: FPMisc bits Mask */ - -/*@} end of group CMSIS_FPU */ - -/* CoreDebug is deprecated. replaced by DCB (Debug Control Block) */ -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_CoreDebug Core Debug Registers (CoreDebug) - \brief Type definitions for the Core Debug Registers - @{ - */ - -/** - \brief \deprecated Structure type to access the Core Debug Register (CoreDebug). - */ -typedef struct -{ - __IOM uint32_t DHCSR; /*!< Offset: 0x000 (R/W) Debug Halting Control and Status Register */ - __OM uint32_t DCRSR; /*!< Offset: 0x004 ( /W) Debug Core Register Selector Register */ - __IOM uint32_t DCRDR; /*!< Offset: 0x008 (R/W) Debug Core Register Data Register */ - __IOM uint32_t DEMCR; /*!< Offset: 0x00C (R/W) Debug Exception and Monitor Control Register */ - __OM uint32_t DSCEMCR; /*!< Offset: 0x010 ( /W) Debug Set Clear Exception and Monitor Control Register */ - __IOM uint32_t DAUTHCTRL; /*!< Offset: 0x014 (R/W) Debug Authentication Control Register */ - __IOM uint32_t DSCSR; /*!< Offset: 0x018 (R/W) Debug Security Control and Status Register */ -} CoreDebug_Type; - -/* Debug Halting Control and Status Register Definitions */ -#define CoreDebug_DHCSR_DBGKEY_Pos 16U /*!< \deprecated CoreDebug DHCSR: DBGKEY Position */ -#define CoreDebug_DHCSR_DBGKEY_Msk (0xFFFFUL << CoreDebug_DHCSR_DBGKEY_Pos) /*!< \deprecated CoreDebug DHCSR: DBGKEY Mask */ - -#define CoreDebug_DHCSR_S_RESTART_ST_Pos 26U /*!< \deprecated CoreDebug DHCSR: S_RESTART_ST Position */ -#define CoreDebug_DHCSR_S_RESTART_ST_Msk (1UL << CoreDebug_DHCSR_S_RESTART_ST_Pos) /*!< \deprecated CoreDebug DHCSR: S_RESTART_ST Mask */ - -#define CoreDebug_DHCSR_S_RESET_ST_Pos 25U /*!< \deprecated CoreDebug DHCSR: S_RESET_ST Position */ -#define CoreDebug_DHCSR_S_RESET_ST_Msk (1UL << CoreDebug_DHCSR_S_RESET_ST_Pos) /*!< \deprecated CoreDebug DHCSR: S_RESET_ST Mask */ - -#define CoreDebug_DHCSR_S_RETIRE_ST_Pos 24U /*!< \deprecated CoreDebug DHCSR: S_RETIRE_ST Position */ -#define CoreDebug_DHCSR_S_RETIRE_ST_Msk (1UL << CoreDebug_DHCSR_S_RETIRE_ST_Pos) /*!< \deprecated CoreDebug DHCSR: S_RETIRE_ST Mask */ - -#define CoreDebug_DHCSR_S_FPD_Pos 23U /*!< \deprecated CoreDebug DHCSR: S_FPD Position */ -#define CoreDebug_DHCSR_S_FPD_Msk (1UL << CoreDebug_DHCSR_S_FPD_Pos) /*!< \deprecated CoreDebug DHCSR: S_FPD Mask */ - -#define CoreDebug_DHCSR_S_SUIDE_Pos 22U /*!< \deprecated CoreDebug DHCSR: S_SUIDE Position */ -#define CoreDebug_DHCSR_S_SUIDE_Msk (1UL << CoreDebug_DHCSR_S_SUIDE_Pos) /*!< \deprecated CoreDebug DHCSR: S_SUIDE Mask */ - -#define CoreDebug_DHCSR_S_NSUIDE_Pos 21U /*!< \deprecated CoreDebug DHCSR: S_NSUIDE Position */ -#define CoreDebug_DHCSR_S_NSUIDE_Msk (1UL << CoreDebug_DHCSR_S_NSUIDE_Pos) /*!< \deprecated CoreDebug DHCSR: S_NSUIDE Mask */ - -#define CoreDebug_DHCSR_S_SDE_Pos 20U /*!< \deprecated CoreDebug DHCSR: S_SDE Position */ -#define CoreDebug_DHCSR_S_SDE_Msk (1UL << CoreDebug_DHCSR_S_SDE_Pos) /*!< \deprecated CoreDebug DHCSR: S_SDE Mask */ - -#define CoreDebug_DHCSR_S_LOCKUP_Pos 19U /*!< \deprecated CoreDebug DHCSR: S_LOCKUP Position */ -#define CoreDebug_DHCSR_S_LOCKUP_Msk (1UL << CoreDebug_DHCSR_S_LOCKUP_Pos) /*!< \deprecated CoreDebug DHCSR: S_LOCKUP Mask */ - -#define CoreDebug_DHCSR_S_SLEEP_Pos 18U /*!< \deprecated CoreDebug DHCSR: S_SLEEP Position */ -#define CoreDebug_DHCSR_S_SLEEP_Msk (1UL << CoreDebug_DHCSR_S_SLEEP_Pos) /*!< \deprecated CoreDebug DHCSR: S_SLEEP Mask */ - -#define CoreDebug_DHCSR_S_HALT_Pos 17U /*!< \deprecated CoreDebug DHCSR: S_HALT Position */ -#define CoreDebug_DHCSR_S_HALT_Msk (1UL << CoreDebug_DHCSR_S_HALT_Pos) /*!< \deprecated CoreDebug DHCSR: S_HALT Mask */ - -#define CoreDebug_DHCSR_S_REGRDY_Pos 16U /*!< \deprecated CoreDebug DHCSR: S_REGRDY Position */ -#define CoreDebug_DHCSR_S_REGRDY_Msk (1UL << CoreDebug_DHCSR_S_REGRDY_Pos) /*!< \deprecated CoreDebug DHCSR: S_REGRDY Mask */ - -#define CoreDebug_DHCSR_C_PMOV_Pos 6U /*!< \deprecated CoreDebug DHCSR: C_PMOV Position */ -#define CoreDebug_DHCSR_C_PMOV_Msk (1UL << CoreDebug_DHCSR_C_PMOV_Pos) /*!< \deprecated CoreDebug DHCSR: C_PMOV Mask */ - -#define CoreDebug_DHCSR_C_SNAPSTALL_Pos 5U /*!< \deprecated CoreDebug DHCSR: C_SNAPSTALL Position */ -#define CoreDebug_DHCSR_C_SNAPSTALL_Msk (1UL << CoreDebug_DHCSR_C_SNAPSTALL_Pos) /*!< \deprecated CoreDebug DHCSR: C_SNAPSTALL Mask */ - -#define CoreDebug_DHCSR_C_MASKINTS_Pos 3U /*!< \deprecated CoreDebug DHCSR: C_MASKINTS Position */ -#define CoreDebug_DHCSR_C_MASKINTS_Msk (1UL << CoreDebug_DHCSR_C_MASKINTS_Pos) /*!< \deprecated CoreDebug DHCSR: C_MASKINTS Mask */ - -#define CoreDebug_DHCSR_C_STEP_Pos 2U /*!< \deprecated CoreDebug DHCSR: C_STEP Position */ -#define CoreDebug_DHCSR_C_STEP_Msk (1UL << CoreDebug_DHCSR_C_STEP_Pos) /*!< \deprecated CoreDebug DHCSR: C_STEP Mask */ - -#define CoreDebug_DHCSR_C_HALT_Pos 1U /*!< \deprecated CoreDebug DHCSR: C_HALT Position */ -#define CoreDebug_DHCSR_C_HALT_Msk (1UL << CoreDebug_DHCSR_C_HALT_Pos) /*!< \deprecated CoreDebug DHCSR: C_HALT Mask */ - -#define CoreDebug_DHCSR_C_DEBUGEN_Pos 0U /*!< \deprecated CoreDebug DHCSR: C_DEBUGEN Position */ -#define CoreDebug_DHCSR_C_DEBUGEN_Msk (1UL /*<< CoreDebug_DHCSR_C_DEBUGEN_Pos*/) /*!< \deprecated CoreDebug DHCSR: C_DEBUGEN Mask */ - -/* Debug Core Register Selector Register Definitions */ -#define CoreDebug_DCRSR_REGWnR_Pos 16U /*!< \deprecated CoreDebug DCRSR: REGWnR Position */ -#define CoreDebug_DCRSR_REGWnR_Msk (1UL << CoreDebug_DCRSR_REGWnR_Pos) /*!< \deprecated CoreDebug DCRSR: REGWnR Mask */ - -#define CoreDebug_DCRSR_REGSEL_Pos 0U /*!< \deprecated CoreDebug DCRSR: REGSEL Position */ -#define CoreDebug_DCRSR_REGSEL_Msk (0x1FUL /*<< CoreDebug_DCRSR_REGSEL_Pos*/) /*!< \deprecated CoreDebug DCRSR: REGSEL Mask */ - -/* Debug Exception and Monitor Control Register Definitions */ -#define CoreDebug_DEMCR_TRCENA_Pos 24U /*!< \deprecated CoreDebug DEMCR: TRCENA Position */ -#define CoreDebug_DEMCR_TRCENA_Msk (1UL << CoreDebug_DEMCR_TRCENA_Pos) /*!< \deprecated CoreDebug DEMCR: TRCENA Mask */ - -#define CoreDebug_DEMCR_MON_REQ_Pos 19U /*!< \deprecated CoreDebug DEMCR: MON_REQ Position */ -#define CoreDebug_DEMCR_MON_REQ_Msk (1UL << CoreDebug_DEMCR_MON_REQ_Pos) /*!< \deprecated CoreDebug DEMCR: MON_REQ Mask */ - -#define CoreDebug_DEMCR_MON_STEP_Pos 18U /*!< \deprecated CoreDebug DEMCR: MON_STEP Position */ -#define CoreDebug_DEMCR_MON_STEP_Msk (1UL << CoreDebug_DEMCR_MON_STEP_Pos) /*!< \deprecated CoreDebug DEMCR: MON_STEP Mask */ - -#define CoreDebug_DEMCR_MON_PEND_Pos 17U /*!< \deprecated CoreDebug DEMCR: MON_PEND Position */ -#define CoreDebug_DEMCR_MON_PEND_Msk (1UL << CoreDebug_DEMCR_MON_PEND_Pos) /*!< \deprecated CoreDebug DEMCR: MON_PEND Mask */ - -#define CoreDebug_DEMCR_MON_EN_Pos 16U /*!< \deprecated CoreDebug DEMCR: MON_EN Position */ -#define CoreDebug_DEMCR_MON_EN_Msk (1UL << CoreDebug_DEMCR_MON_EN_Pos) /*!< \deprecated CoreDebug DEMCR: MON_EN Mask */ - -#define CoreDebug_DEMCR_VC_HARDERR_Pos 10U /*!< \deprecated CoreDebug DEMCR: VC_HARDERR Position */ -#define CoreDebug_DEMCR_VC_HARDERR_Msk (1UL << CoreDebug_DEMCR_VC_HARDERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_HARDERR Mask */ - -#define CoreDebug_DEMCR_VC_INTERR_Pos 9U /*!< \deprecated CoreDebug DEMCR: VC_INTERR Position */ -#define CoreDebug_DEMCR_VC_INTERR_Msk (1UL << CoreDebug_DEMCR_VC_INTERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_INTERR Mask */ - -#define CoreDebug_DEMCR_VC_BUSERR_Pos 8U /*!< \deprecated CoreDebug DEMCR: VC_BUSERR Position */ -#define CoreDebug_DEMCR_VC_BUSERR_Msk (1UL << CoreDebug_DEMCR_VC_BUSERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_BUSERR Mask */ - -#define CoreDebug_DEMCR_VC_STATERR_Pos 7U /*!< \deprecated CoreDebug DEMCR: VC_STATERR Position */ -#define CoreDebug_DEMCR_VC_STATERR_Msk (1UL << CoreDebug_DEMCR_VC_STATERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_STATERR Mask */ - -#define CoreDebug_DEMCR_VC_CHKERR_Pos 6U /*!< \deprecated CoreDebug DEMCR: VC_CHKERR Position */ -#define CoreDebug_DEMCR_VC_CHKERR_Msk (1UL << CoreDebug_DEMCR_VC_CHKERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_CHKERR Mask */ - -#define CoreDebug_DEMCR_VC_NOCPERR_Pos 5U /*!< \deprecated CoreDebug DEMCR: VC_NOCPERR Position */ -#define CoreDebug_DEMCR_VC_NOCPERR_Msk (1UL << CoreDebug_DEMCR_VC_NOCPERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_NOCPERR Mask */ - -#define CoreDebug_DEMCR_VC_MMERR_Pos 4U /*!< \deprecated CoreDebug DEMCR: VC_MMERR Position */ -#define CoreDebug_DEMCR_VC_MMERR_Msk (1UL << CoreDebug_DEMCR_VC_MMERR_Pos) /*!< \deprecated CoreDebug DEMCR: VC_MMERR Mask */ - -#define CoreDebug_DEMCR_VC_CORERESET_Pos 0U /*!< \deprecated CoreDebug DEMCR: VC_CORERESET Position */ -#define CoreDebug_DEMCR_VC_CORERESET_Msk (1UL /*<< CoreDebug_DEMCR_VC_CORERESET_Pos*/) /*!< \deprecated CoreDebug DEMCR: VC_CORERESET Mask */ - -/* Debug Set Clear Exception and Monitor Control Register Definitions */ -#define CoreDebug_DSCEMCR_CLR_MON_REQ_Pos 19U /*!< \deprecated CoreDebug DSCEMCR: CLR_MON_REQ, Position */ -#define CoreDebug_DSCEMCR_CLR_MON_REQ_Msk (1UL << CoreDebug_DSCEMCR_CLR_MON_REQ_Pos) /*!< \deprecated CoreDebug DSCEMCR: CLR_MON_REQ, Mask */ - -#define CoreDebug_DSCEMCR_CLR_MON_PEND_Pos 17U /*!< \deprecated CoreDebug DSCEMCR: CLR_MON_PEND, Position */ -#define CoreDebug_DSCEMCR_CLR_MON_PEND_Msk (1UL << CoreDebug_DSCEMCR_CLR_MON_PEND_Pos) /*!< \deprecated CoreDebug DSCEMCR: CLR_MON_PEND, Mask */ - -#define CoreDebug_DSCEMCR_SET_MON_REQ_Pos 3U /*!< \deprecated CoreDebug DSCEMCR: SET_MON_REQ, Position */ -#define CoreDebug_DSCEMCR_SET_MON_REQ_Msk (1UL << CoreDebug_DSCEMCR_SET_MON_REQ_Pos) /*!< \deprecated CoreDebug DSCEMCR: SET_MON_REQ, Mask */ - -#define CoreDebug_DSCEMCR_SET_MON_PEND_Pos 1U /*!< \deprecated CoreDebug DSCEMCR: SET_MON_PEND, Position */ -#define CoreDebug_DSCEMCR_SET_MON_PEND_Msk (1UL << CoreDebug_DSCEMCR_SET_MON_PEND_Pos) /*!< \deprecated CoreDebug DSCEMCR: SET_MON_PEND, Mask */ - -/* Debug Authentication Control Register Definitions */ -#define CoreDebug_DAUTHCTRL_UIDEN_Pos 10U /*!< \deprecated CoreDebug DAUTHCTRL: UIDEN, Position */ -#define CoreDebug_DAUTHCTRL_UIDEN_Msk (1UL << CoreDebug_DAUTHCTRL_UIDEN_Pos) /*!< \deprecated CoreDebug DAUTHCTRL: UIDEN, Mask */ - -#define CoreDebug_DAUTHCTRL_UIDAPEN_Pos 9U /*!< \deprecated CoreDebug DAUTHCTRL: UIDAPEN, Position */ -#define CoreDebug_DAUTHCTRL_UIDAPEN_Msk (1UL << CoreDebug_DAUTHCTRL_UIDAPEN_Pos) /*!< \deprecated CoreDebug DAUTHCTRL: UIDAPEN, Mask */ - -#define CoreDebug_DAUTHCTRL_FSDMA_Pos 8U /*!< \deprecated CoreDebug DAUTHCTRL: FSDMA, Position */ -#define CoreDebug_DAUTHCTRL_FSDMA_Msk (1UL << CoreDebug_DAUTHCTRL_FSDMA_Pos) /*!< \deprecated CoreDebug DAUTHCTRL: FSDMA, Mask */ - -#define CoreDebug_DAUTHCTRL_INTSPNIDEN_Pos 3U /*!< \deprecated CoreDebug DAUTHCTRL: INTSPNIDEN, Position */ -#define CoreDebug_DAUTHCTRL_INTSPNIDEN_Msk (1UL << CoreDebug_DAUTHCTRL_INTSPNIDEN_Pos) /*!< \deprecated CoreDebug DAUTHCTRL: INTSPNIDEN, Mask */ - -#define CoreDebug_DAUTHCTRL_SPNIDENSEL_Pos 2U /*!< \deprecated CoreDebug DAUTHCTRL: SPNIDENSEL Position */ -#define CoreDebug_DAUTHCTRL_SPNIDENSEL_Msk (1UL << CoreDebug_DAUTHCTRL_SPNIDENSEL_Pos) /*!< \deprecated CoreDebug DAUTHCTRL: SPNIDENSEL Mask */ - -#define CoreDebug_DAUTHCTRL_INTSPIDEN_Pos 1U /*!< \deprecated CoreDebug DAUTHCTRL: INTSPIDEN Position */ -#define CoreDebug_DAUTHCTRL_INTSPIDEN_Msk (1UL << CoreDebug_DAUTHCTRL_INTSPIDEN_Pos) /*!< \deprecated CoreDebug DAUTHCTRL: INTSPIDEN Mask */ - -#define CoreDebug_DAUTHCTRL_SPIDENSEL_Pos 0U /*!< \deprecated CoreDebug DAUTHCTRL: SPIDENSEL Position */ -#define CoreDebug_DAUTHCTRL_SPIDENSEL_Msk (1UL /*<< CoreDebug_DAUTHCTRL_SPIDENSEL_Pos*/) /*!< \deprecated CoreDebug DAUTHCTRL: SPIDENSEL Mask */ - -/* Debug Security Control and Status Register Definitions */ -#define CoreDebug_DSCSR_CDS_Pos 16U /*!< \deprecated CoreDebug DSCSR: CDS Position */ -#define CoreDebug_DSCSR_CDS_Msk (1UL << CoreDebug_DSCSR_CDS_Pos) /*!< \deprecated CoreDebug DSCSR: CDS Mask */ - -#define CoreDebug_DSCSR_SBRSEL_Pos 1U /*!< \deprecated CoreDebug DSCSR: SBRSEL Position */ -#define CoreDebug_DSCSR_SBRSEL_Msk (1UL << CoreDebug_DSCSR_SBRSEL_Pos) /*!< \deprecated CoreDebug DSCSR: SBRSEL Mask */ - -#define CoreDebug_DSCSR_SBRSELEN_Pos 0U /*!< \deprecated CoreDebug DSCSR: SBRSELEN Position */ -#define CoreDebug_DSCSR_SBRSELEN_Msk (1UL /*<< CoreDebug_DSCSR_SBRSELEN_Pos*/) /*!< \deprecated CoreDebug DSCSR: SBRSELEN Mask */ - -/*@} end of group CMSIS_CoreDebug */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_DCB Debug Control Block - \brief Type definitions for the Debug Control Block Registers - @{ - */ - -/** - \brief Structure type to access the Debug Control Block Registers (DCB). - */ -typedef struct -{ - __IOM uint32_t DHCSR; /*!< Offset: 0x000 (R/W) Debug Halting Control and Status Register */ - __OM uint32_t DCRSR; /*!< Offset: 0x004 ( /W) Debug Core Register Selector Register */ - __IOM uint32_t DCRDR; /*!< Offset: 0x008 (R/W) Debug Core Register Data Register */ - __IOM uint32_t DEMCR; /*!< Offset: 0x00C (R/W) Debug Exception and Monitor Control Register */ - __OM uint32_t DSCEMCR; /*!< Offset: 0x010 ( /W) Debug Set Clear Exception and Monitor Control Register */ - __IOM uint32_t DAUTHCTRL; /*!< Offset: 0x014 (R/W) Debug Authentication Control Register */ - __IOM uint32_t DSCSR; /*!< Offset: 0x018 (R/W) Debug Security Control and Status Register */ -} DCB_Type; - -/* DHCSR, Debug Halting Control and Status Register Definitions */ -#define DCB_DHCSR_DBGKEY_Pos 16U /*!< DCB DHCSR: Debug key Position */ -#define DCB_DHCSR_DBGKEY_Msk (0xFFFFUL << DCB_DHCSR_DBGKEY_Pos) /*!< DCB DHCSR: Debug key Mask */ - -#define DCB_DHCSR_S_RESTART_ST_Pos 26U /*!< DCB DHCSR: Restart sticky status Position */ -#define DCB_DHCSR_S_RESTART_ST_Msk (0x1UL << DCB_DHCSR_S_RESTART_ST_Pos) /*!< DCB DHCSR: Restart sticky status Mask */ - -#define DCB_DHCSR_S_RESET_ST_Pos 25U /*!< DCB DHCSR: Reset sticky status Position */ -#define DCB_DHCSR_S_RESET_ST_Msk (0x1UL << DCB_DHCSR_S_RESET_ST_Pos) /*!< DCB DHCSR: Reset sticky status Mask */ - -#define DCB_DHCSR_S_RETIRE_ST_Pos 24U /*!< DCB DHCSR: Retire sticky status Position */ -#define DCB_DHCSR_S_RETIRE_ST_Msk (0x1UL << DCB_DHCSR_S_RETIRE_ST_Pos) /*!< DCB DHCSR: Retire sticky status Mask */ - -#define DCB_DHCSR_S_FPD_Pos 23U /*!< DCB DHCSR: Floating-point registers Debuggable Position */ -#define DCB_DHCSR_S_FPD_Msk (0x1UL << DCB_DHCSR_S_FPD_Pos) /*!< DCB DHCSR: Floating-point registers Debuggable Mask */ - -#define DCB_DHCSR_S_SUIDE_Pos 22U /*!< DCB DHCSR: Secure unprivileged halting debug enabled Position */ -#define DCB_DHCSR_S_SUIDE_Msk (0x1UL << DCB_DHCSR_S_SUIDE_Pos) /*!< DCB DHCSR: Secure unprivileged halting debug enabled Mask */ - -#define DCB_DHCSR_S_NSUIDE_Pos 21U /*!< DCB DHCSR: Non-secure unprivileged halting debug enabled Position */ -#define DCB_DHCSR_S_NSUIDE_Msk (0x1UL << DCB_DHCSR_S_NSUIDE_Pos) /*!< DCB DHCSR: Non-secure unprivileged halting debug enabled Mask */ - -#define DCB_DHCSR_S_SDE_Pos 20U /*!< DCB DHCSR: Secure debug enabled Position */ -#define DCB_DHCSR_S_SDE_Msk (0x1UL << DCB_DHCSR_S_SDE_Pos) /*!< DCB DHCSR: Secure debug enabled Mask */ - -#define DCB_DHCSR_S_LOCKUP_Pos 19U /*!< DCB DHCSR: Lockup status Position */ -#define DCB_DHCSR_S_LOCKUP_Msk (0x1UL << DCB_DHCSR_S_LOCKUP_Pos) /*!< DCB DHCSR: Lockup status Mask */ - -#define DCB_DHCSR_S_SLEEP_Pos 18U /*!< DCB DHCSR: Sleeping status Position */ -#define DCB_DHCSR_S_SLEEP_Msk (0x1UL << DCB_DHCSR_S_SLEEP_Pos) /*!< DCB DHCSR: Sleeping status Mask */ - -#define DCB_DHCSR_S_HALT_Pos 17U /*!< DCB DHCSR: Halted status Position */ -#define DCB_DHCSR_S_HALT_Msk (0x1UL << DCB_DHCSR_S_HALT_Pos) /*!< DCB DHCSR: Halted status Mask */ - -#define DCB_DHCSR_S_REGRDY_Pos 16U /*!< DCB DHCSR: Register ready status Position */ -#define DCB_DHCSR_S_REGRDY_Msk (0x1UL << DCB_DHCSR_S_REGRDY_Pos) /*!< DCB DHCSR: Register ready status Mask */ - -#define DCB_DHCSR_C_PMOV_Pos 6U /*!< DCB DHCSR: Halt on PMU overflow control Position */ -#define DCB_DHCSR_C_PMOV_Msk (0x1UL << DCB_DHCSR_C_PMOV_Pos) /*!< DCB DHCSR: Halt on PMU overflow control Mask */ - -#define DCB_DHCSR_C_SNAPSTALL_Pos 5U /*!< DCB DHCSR: Snap stall control Position */ -#define DCB_DHCSR_C_SNAPSTALL_Msk (0x1UL << DCB_DHCSR_C_SNAPSTALL_Pos) /*!< DCB DHCSR: Snap stall control Mask */ - -#define DCB_DHCSR_C_MASKINTS_Pos 3U /*!< DCB DHCSR: Mask interrupts control Position */ -#define DCB_DHCSR_C_MASKINTS_Msk (0x1UL << DCB_DHCSR_C_MASKINTS_Pos) /*!< DCB DHCSR: Mask interrupts control Mask */ - -#define DCB_DHCSR_C_STEP_Pos 2U /*!< DCB DHCSR: Step control Position */ -#define DCB_DHCSR_C_STEP_Msk (0x1UL << DCB_DHCSR_C_STEP_Pos) /*!< DCB DHCSR: Step control Mask */ - -#define DCB_DHCSR_C_HALT_Pos 1U /*!< DCB DHCSR: Halt control Position */ -#define DCB_DHCSR_C_HALT_Msk (0x1UL << DCB_DHCSR_C_HALT_Pos) /*!< DCB DHCSR: Halt control Mask */ - -#define DCB_DHCSR_C_DEBUGEN_Pos 0U /*!< DCB DHCSR: Debug enable control Position */ -#define DCB_DHCSR_C_DEBUGEN_Msk (0x1UL /*<< DCB_DHCSR_C_DEBUGEN_Pos*/) /*!< DCB DHCSR: Debug enable control Mask */ - -/* DCRSR, Debug Core Register Select Register Definitions */ -#define DCB_DCRSR_REGWnR_Pos 16U /*!< DCB DCRSR: Register write/not-read Position */ -#define DCB_DCRSR_REGWnR_Msk (0x1UL << DCB_DCRSR_REGWnR_Pos) /*!< DCB DCRSR: Register write/not-read Mask */ - -#define DCB_DCRSR_REGSEL_Pos 0U /*!< DCB DCRSR: Register selector Position */ -#define DCB_DCRSR_REGSEL_Msk (0x7FUL /*<< DCB_DCRSR_REGSEL_Pos*/) /*!< DCB DCRSR: Register selector Mask */ - -/* DCRDR, Debug Core Register Data Register Definitions */ -#define DCB_DCRDR_DBGTMP_Pos 0U /*!< DCB DCRDR: Data temporary buffer Position */ -#define DCB_DCRDR_DBGTMP_Msk (0xFFFFFFFFUL /*<< DCB_DCRDR_DBGTMP_Pos*/) /*!< DCB DCRDR: Data temporary buffer Mask */ - -/* DEMCR, Debug Exception and Monitor Control Register Definitions */ -#define DCB_DEMCR_TRCENA_Pos 24U /*!< DCB DEMCR: Trace enable Position */ -#define DCB_DEMCR_TRCENA_Msk (0x1UL << DCB_DEMCR_TRCENA_Pos) /*!< DCB DEMCR: Trace enable Mask */ - -#define DCB_DEMCR_MONPRKEY_Pos 23U /*!< DCB DEMCR: Monitor pend req key Position */ -#define DCB_DEMCR_MONPRKEY_Msk (0x1UL << DCB_DEMCR_MONPRKEY_Pos) /*!< DCB DEMCR: Monitor pend req key Mask */ - -#define DCB_DEMCR_UMON_EN_Pos 21U /*!< DCB DEMCR: Unprivileged monitor enable Position */ -#define DCB_DEMCR_UMON_EN_Msk (0x1UL << DCB_DEMCR_UMON_EN_Pos) /*!< DCB DEMCR: Unprivileged monitor enable Mask */ - -#define DCB_DEMCR_SDME_Pos 20U /*!< DCB DEMCR: Secure DebugMonitor enable Position */ -#define DCB_DEMCR_SDME_Msk (0x1UL << DCB_DEMCR_SDME_Pos) /*!< DCB DEMCR: Secure DebugMonitor enable Mask */ - -#define DCB_DEMCR_MON_REQ_Pos 19U /*!< DCB DEMCR: Monitor request Position */ -#define DCB_DEMCR_MON_REQ_Msk (0x1UL << DCB_DEMCR_MON_REQ_Pos) /*!< DCB DEMCR: Monitor request Mask */ - -#define DCB_DEMCR_MON_STEP_Pos 18U /*!< DCB DEMCR: Monitor step Position */ -#define DCB_DEMCR_MON_STEP_Msk (0x1UL << DCB_DEMCR_MON_STEP_Pos) /*!< DCB DEMCR: Monitor step Mask */ - -#define DCB_DEMCR_MON_PEND_Pos 17U /*!< DCB DEMCR: Monitor pend Position */ -#define DCB_DEMCR_MON_PEND_Msk (0x1UL << DCB_DEMCR_MON_PEND_Pos) /*!< DCB DEMCR: Monitor pend Mask */ - -#define DCB_DEMCR_MON_EN_Pos 16U /*!< DCB DEMCR: Monitor enable Position */ -#define DCB_DEMCR_MON_EN_Msk (0x1UL << DCB_DEMCR_MON_EN_Pos) /*!< DCB DEMCR: Monitor enable Mask */ - -#define DCB_DEMCR_VC_SFERR_Pos 11U /*!< DCB DEMCR: Vector Catch SecureFault Position */ -#define DCB_DEMCR_VC_SFERR_Msk (0x1UL << DCB_DEMCR_VC_SFERR_Pos) /*!< DCB DEMCR: Vector Catch SecureFault Mask */ - -#define DCB_DEMCR_VC_HARDERR_Pos 10U /*!< DCB DEMCR: Vector Catch HardFault errors Position */ -#define DCB_DEMCR_VC_HARDERR_Msk (0x1UL << DCB_DEMCR_VC_HARDERR_Pos) /*!< DCB DEMCR: Vector Catch HardFault errors Mask */ - -#define DCB_DEMCR_VC_INTERR_Pos 9U /*!< DCB DEMCR: Vector Catch interrupt errors Position */ -#define DCB_DEMCR_VC_INTERR_Msk (0x1UL << DCB_DEMCR_VC_INTERR_Pos) /*!< DCB DEMCR: Vector Catch interrupt errors Mask */ - -#define DCB_DEMCR_VC_BUSERR_Pos 8U /*!< DCB DEMCR: Vector Catch BusFault errors Position */ -#define DCB_DEMCR_VC_BUSERR_Msk (0x1UL << DCB_DEMCR_VC_BUSERR_Pos) /*!< DCB DEMCR: Vector Catch BusFault errors Mask */ - -#define DCB_DEMCR_VC_STATERR_Pos 7U /*!< DCB DEMCR: Vector Catch state errors Position */ -#define DCB_DEMCR_VC_STATERR_Msk (0x1UL << DCB_DEMCR_VC_STATERR_Pos) /*!< DCB DEMCR: Vector Catch state errors Mask */ - -#define DCB_DEMCR_VC_CHKERR_Pos 6U /*!< DCB DEMCR: Vector Catch check errors Position */ -#define DCB_DEMCR_VC_CHKERR_Msk (0x1UL << DCB_DEMCR_VC_CHKERR_Pos) /*!< DCB DEMCR: Vector Catch check errors Mask */ - -#define DCB_DEMCR_VC_NOCPERR_Pos 5U /*!< DCB DEMCR: Vector Catch NOCP errors Position */ -#define DCB_DEMCR_VC_NOCPERR_Msk (0x1UL << DCB_DEMCR_VC_NOCPERR_Pos) /*!< DCB DEMCR: Vector Catch NOCP errors Mask */ - -#define DCB_DEMCR_VC_MMERR_Pos 4U /*!< DCB DEMCR: Vector Catch MemManage errors Position */ -#define DCB_DEMCR_VC_MMERR_Msk (0x1UL << DCB_DEMCR_VC_MMERR_Pos) /*!< DCB DEMCR: Vector Catch MemManage errors Mask */ - -#define DCB_DEMCR_VC_CORERESET_Pos 0U /*!< DCB DEMCR: Vector Catch Core reset Position */ -#define DCB_DEMCR_VC_CORERESET_Msk (0x1UL /*<< DCB_DEMCR_VC_CORERESET_Pos*/) /*!< DCB DEMCR: Vector Catch Core reset Mask */ - -/* DSCEMCR, Debug Set Clear Exception and Monitor Control Register Definitions */ -#define DCB_DSCEMCR_CLR_MON_REQ_Pos 19U /*!< DCB DSCEMCR: Clear monitor request Position */ -#define DCB_DSCEMCR_CLR_MON_REQ_Msk (0x1UL << DCB_DSCEMCR_CLR_MON_REQ_Pos) /*!< DCB DSCEMCR: Clear monitor request Mask */ - -#define DCB_DSCEMCR_CLR_MON_PEND_Pos 17U /*!< DCB DSCEMCR: Clear monitor pend Position */ -#define DCB_DSCEMCR_CLR_MON_PEND_Msk (0x1UL << DCB_DSCEMCR_CLR_MON_PEND_Pos) /*!< DCB DSCEMCR: Clear monitor pend Mask */ - -#define DCB_DSCEMCR_SET_MON_REQ_Pos 3U /*!< DCB DSCEMCR: Set monitor request Position */ -#define DCB_DSCEMCR_SET_MON_REQ_Msk (0x1UL << DCB_DSCEMCR_SET_MON_REQ_Pos) /*!< DCB DSCEMCR: Set monitor request Mask */ - -#define DCB_DSCEMCR_SET_MON_PEND_Pos 1U /*!< DCB DSCEMCR: Set monitor pend Position */ -#define DCB_DSCEMCR_SET_MON_PEND_Msk (0x1UL << DCB_DSCEMCR_SET_MON_PEND_Pos) /*!< DCB DSCEMCR: Set monitor pend Mask */ - -/* DAUTHCTRL, Debug Authentication Control Register Definitions */ -#define DCB_DAUTHCTRL_UIDEN_Pos 10U /*!< DCB DAUTHCTRL: Unprivileged Invasive Debug Enable Position */ -#define DCB_DAUTHCTRL_UIDEN_Msk (0x1UL << DCB_DAUTHCTRL_UIDEN_Pos) /*!< DCB DAUTHCTRL: Unprivileged Invasive Debug Enable Mask */ - -#define DCB_DAUTHCTRL_UIDAPEN_Pos 9U /*!< DCB DAUTHCTRL: Unprivileged Invasive DAP Access Enable Position */ -#define DCB_DAUTHCTRL_UIDAPEN_Msk (0x1UL << DCB_DAUTHCTRL_UIDAPEN_Pos) /*!< DCB DAUTHCTRL: Unprivileged Invasive DAP Access Enable Mask */ - -#define DCB_DAUTHCTRL_FSDMA_Pos 8U /*!< DCB DAUTHCTRL: Force Secure DebugMonitor Allowed Position */ -#define DCB_DAUTHCTRL_FSDMA_Msk (0x1UL << DCB_DAUTHCTRL_FSDMA_Pos) /*!< DCB DAUTHCTRL: Force Secure DebugMonitor Allowed Mask */ - -#define DCB_DAUTHCTRL_INTSPNIDEN_Pos 3U /*!< DCB DAUTHCTRL: Internal Secure non-invasive debug enable Position */ -#define DCB_DAUTHCTRL_INTSPNIDEN_Msk (0x1UL << DCB_DAUTHCTRL_INTSPNIDEN_Pos) /*!< DCB DAUTHCTRL: Internal Secure non-invasive debug enable Mask */ - -#define DCB_DAUTHCTRL_SPNIDENSEL_Pos 2U /*!< DCB DAUTHCTRL: Secure non-invasive debug enable select Position */ -#define DCB_DAUTHCTRL_SPNIDENSEL_Msk (0x1UL << DCB_DAUTHCTRL_SPNIDENSEL_Pos) /*!< DCB DAUTHCTRL: Secure non-invasive debug enable select Mask */ - -#define DCB_DAUTHCTRL_INTSPIDEN_Pos 1U /*!< DCB DAUTHCTRL: Internal Secure invasive debug enable Position */ -#define DCB_DAUTHCTRL_INTSPIDEN_Msk (0x1UL << DCB_DAUTHCTRL_INTSPIDEN_Pos) /*!< DCB DAUTHCTRL: Internal Secure invasive debug enable Mask */ - -#define DCB_DAUTHCTRL_SPIDENSEL_Pos 0U /*!< DCB DAUTHCTRL: Secure invasive debug enable select Position */ -#define DCB_DAUTHCTRL_SPIDENSEL_Msk (0x1UL /*<< DCB_DAUTHCTRL_SPIDENSEL_Pos*/) /*!< DCB DAUTHCTRL: Secure invasive debug enable select Mask */ - -/* DSCSR, Debug Security Control and Status Register Definitions */ -#define DCB_DSCSR_CDSKEY_Pos 17U /*!< DCB DSCSR: CDS write-enable key Position */ -#define DCB_DSCSR_CDSKEY_Msk (0x1UL << DCB_DSCSR_CDSKEY_Pos) /*!< DCB DSCSR: CDS write-enable key Mask */ - -#define DCB_DSCSR_CDS_Pos 16U /*!< DCB DSCSR: Current domain Secure Position */ -#define DCB_DSCSR_CDS_Msk (0x1UL << DCB_DSCSR_CDS_Pos) /*!< DCB DSCSR: Current domain Secure Mask */ - -#define DCB_DSCSR_SBRSEL_Pos 1U /*!< DCB DSCSR: Secure banked register select Position */ -#define DCB_DSCSR_SBRSEL_Msk (0x1UL << DCB_DSCSR_SBRSEL_Pos) /*!< DCB DSCSR: Secure banked register select Mask */ - -#define DCB_DSCSR_SBRSELEN_Pos 0U /*!< DCB DSCSR: Secure banked register select enable Position */ -#define DCB_DSCSR_SBRSELEN_Msk (0x1UL /*<< DCB_DSCSR_SBRSELEN_Pos*/) /*!< DCB DSCSR: Secure banked register select enable Mask */ - -/*@} end of group CMSIS_DCB */ - - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_DIB Debug Identification Block - \brief Type definitions for the Debug Identification Block Registers - @{ - */ - -/** - \brief Structure type to access the Debug Identification Block Registers (DIB). - */ -typedef struct -{ - __OM uint32_t DLAR; /*!< Offset: 0x000 ( /W) SCS Software Lock Access Register */ - __IM uint32_t DLSR; /*!< Offset: 0x004 (R/ ) SCS Software Lock Status Register */ - __IM uint32_t DAUTHSTATUS; /*!< Offset: 0x008 (R/ ) Debug Authentication Status Register */ - __IM uint32_t DDEVARCH; /*!< Offset: 0x00C (R/ ) SCS Device Architecture Register */ - __IM uint32_t DDEVTYPE; /*!< Offset: 0x010 (R/ ) SCS Device Type Register */ -} DIB_Type; - -/* DLAR, SCS Software Lock Access Register Definitions */ -#define DIB_DLAR_KEY_Pos 0U /*!< DIB DLAR: KEY Position */ -#define DIB_DLAR_KEY_Msk (0xFFFFFFFFUL /*<< DIB_DLAR_KEY_Pos */) /*!< DIB DLAR: KEY Mask */ - -/* DLSR, SCS Software Lock Status Register Definitions */ -#define DIB_DLSR_nTT_Pos 2U /*!< DIB DLSR: Not thirty-two bit Position */ -#define DIB_DLSR_nTT_Msk (0x1UL << DIB_DLSR_nTT_Pos ) /*!< DIB DLSR: Not thirty-two bit Mask */ - -#define DIB_DLSR_SLK_Pos 1U /*!< DIB DLSR: Software Lock status Position */ -#define DIB_DLSR_SLK_Msk (0x1UL << DIB_DLSR_SLK_Pos ) /*!< DIB DLSR: Software Lock status Mask */ - -#define DIB_DLSR_SLI_Pos 0U /*!< DIB DLSR: Software Lock implemented Position */ -#define DIB_DLSR_SLI_Msk (0x1UL /*<< DIB_DLSR_SLI_Pos*/) /*!< DIB DLSR: Software Lock implemented Mask */ - -/* DAUTHSTATUS, Debug Authentication Status Register Definitions */ -#define DIB_DAUTHSTATUS_SUNID_Pos 22U /*!< DIB DAUTHSTATUS: Secure Unprivileged Non-invasive Debug Allowed Position */ -#define DIB_DAUTHSTATUS_SUNID_Msk (0x3UL << DIB_DAUTHSTATUS_SUNID_Pos ) /*!< DIB DAUTHSTATUS: Secure Unprivileged Non-invasive Debug Allowed Mask */ - -#define DIB_DAUTHSTATUS_SUID_Pos 20U /*!< DIB DAUTHSTATUS: Secure Unprivileged Invasive Debug Allowed Position */ -#define DIB_DAUTHSTATUS_SUID_Msk (0x3UL << DIB_DAUTHSTATUS_SUID_Pos ) /*!< DIB DAUTHSTATUS: Secure Unprivileged Invasive Debug Allowed Mask */ - -#define DIB_DAUTHSTATUS_NSUNID_Pos 18U /*!< DIB DAUTHSTATUS: Non-secure Unprivileged Non-invasive Debug Allo Position */ -#define DIB_DAUTHSTATUS_NSUNID_Msk (0x3UL << DIB_DAUTHSTATUS_NSUNID_Pos ) /*!< DIB DAUTHSTATUS: Non-secure Unprivileged Non-invasive Debug Allo Mask */ - -#define DIB_DAUTHSTATUS_NSUID_Pos 16U /*!< DIB DAUTHSTATUS: Non-secure Unprivileged Invasive Debug Allowed Position */ -#define DIB_DAUTHSTATUS_NSUID_Msk (0x3UL << DIB_DAUTHSTATUS_NSUID_Pos ) /*!< DIB DAUTHSTATUS: Non-secure Unprivileged Invasive Debug Allowed Mask */ - -#define DIB_DAUTHSTATUS_SNID_Pos 6U /*!< DIB DAUTHSTATUS: Secure Non-invasive Debug Position */ -#define DIB_DAUTHSTATUS_SNID_Msk (0x3UL << DIB_DAUTHSTATUS_SNID_Pos ) /*!< DIB DAUTHSTATUS: Secure Non-invasive Debug Mask */ - -#define DIB_DAUTHSTATUS_SID_Pos 4U /*!< DIB DAUTHSTATUS: Secure Invasive Debug Position */ -#define DIB_DAUTHSTATUS_SID_Msk (0x3UL << DIB_DAUTHSTATUS_SID_Pos ) /*!< DIB DAUTHSTATUS: Secure Invasive Debug Mask */ - -#define DIB_DAUTHSTATUS_NSNID_Pos 2U /*!< DIB DAUTHSTATUS: Non-secure Non-invasive Debug Position */ -#define DIB_DAUTHSTATUS_NSNID_Msk (0x3UL << DIB_DAUTHSTATUS_NSNID_Pos ) /*!< DIB DAUTHSTATUS: Non-secure Non-invasive Debug Mask */ - -#define DIB_DAUTHSTATUS_NSID_Pos 0U /*!< DIB DAUTHSTATUS: Non-secure Invasive Debug Position */ -#define DIB_DAUTHSTATUS_NSID_Msk (0x3UL /*<< DIB_DAUTHSTATUS_NSID_Pos*/) /*!< DIB DAUTHSTATUS: Non-secure Invasive Debug Mask */ - -/* DDEVARCH, SCS Device Architecture Register Definitions */ -#define DIB_DDEVARCH_ARCHITECT_Pos 21U /*!< DIB DDEVARCH: Architect Position */ -#define DIB_DDEVARCH_ARCHITECT_Msk (0x7FFUL << DIB_DDEVARCH_ARCHITECT_Pos ) /*!< DIB DDEVARCH: Architect Mask */ - -#define DIB_DDEVARCH_PRESENT_Pos 20U /*!< DIB DDEVARCH: DEVARCH Present Position */ -#define DIB_DDEVARCH_PRESENT_Msk (0x1FUL << DIB_DDEVARCH_PRESENT_Pos ) /*!< DIB DDEVARCH: DEVARCH Present Mask */ - -#define DIB_DDEVARCH_REVISION_Pos 16U /*!< DIB DDEVARCH: Revision Position */ -#define DIB_DDEVARCH_REVISION_Msk (0xFUL << DIB_DDEVARCH_REVISION_Pos ) /*!< DIB DDEVARCH: Revision Mask */ - -#define DIB_DDEVARCH_ARCHVER_Pos 12U /*!< DIB DDEVARCH: Architecture Version Position */ -#define DIB_DDEVARCH_ARCHVER_Msk (0xFUL << DIB_DDEVARCH_ARCHVER_Pos ) /*!< DIB DDEVARCH: Architecture Version Mask */ - -#define DIB_DDEVARCH_ARCHPART_Pos 0U /*!< DIB DDEVARCH: Architecture Part Position */ -#define DIB_DDEVARCH_ARCHPART_Msk (0xFFFUL /*<< DIB_DDEVARCH_ARCHPART_Pos*/) /*!< DIB DDEVARCH: Architecture Part Mask */ - -/* DDEVTYPE, SCS Device Type Register Definitions */ -#define DIB_DDEVTYPE_SUB_Pos 4U /*!< DIB DDEVTYPE: Sub-type Position */ -#define DIB_DDEVTYPE_SUB_Msk (0xFUL << DIB_DDEVTYPE_SUB_Pos ) /*!< DIB DDEVTYPE: Sub-type Mask */ - -#define DIB_DDEVTYPE_MAJOR_Pos 0U /*!< DIB DDEVTYPE: Major type Position */ -#define DIB_DDEVTYPE_MAJOR_Msk (0xFUL /*<< DIB_DDEVTYPE_MAJOR_Pos*/) /*!< DIB DDEVTYPE: Major type Mask */ - - -/*@} end of group CMSIS_DIB */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_core_bitfield Core register bit field macros - \brief Macros for use with bit field definitions (xxx_Pos, xxx_Msk). - @{ - */ - -/** - \brief Mask and shift a bit field value for use in a register bit range. - \param[in] field Name of the register bit field. - \param[in] value Value of the bit field. This parameter is interpreted as an uint32_t type. - \return Masked and shifted value. -*/ -#define _VAL2FLD(field, value) (((uint32_t)(value) << field ## _Pos) & field ## _Msk) - -/** - \brief Mask and shift a register value to extract a bit filed value. - \param[in] field Name of the register bit field. - \param[in] value Value of register. This parameter is interpreted as an uint32_t type. - \return Masked and shifted bit field value. -*/ -#define _FLD2VAL(field, value) (((uint32_t)(value) & field ## _Msk) >> field ## _Pos) - -/*@} end of group CMSIS_core_bitfield */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_core_base Core Definitions - \brief Definitions for base addresses, unions, and structures. - @{ - */ - -/* Memory mapping of Core Hardware */ - #define SCS_BASE (0xE000E000UL) /*!< System Control Space Base Address */ - #define ITM_BASE (0xE0000000UL) /*!< ITM Base Address */ - #define DWT_BASE (0xE0001000UL) /*!< DWT Base Address */ - #define PWRMODCTL_BASE (0xE001E300UL) /*!< Power Mode Control Base Address */ - #define TPI_BASE (0xE0040000UL) /*!< TPI Base Address */ - #define CoreDebug_BASE (0xE000EDF0UL) /*!< \deprecated Core Debug Base Address */ - #define DCB_BASE (0xE000EDF0UL) /*!< DCB Base Address */ - #define DIB_BASE (0xE000EFB0UL) /*!< DIB Base Address */ - #define SysTick_BASE (SCS_BASE + 0x0010UL) /*!< SysTick Base Address */ - #define NVIC_BASE (SCS_BASE + 0x0100UL) /*!< NVIC Base Address */ - #define SCB_BASE (SCS_BASE + 0x0D00UL) /*!< System Control Block Base Address */ - - #define SCnSCB ((SCnSCB_Type *) SCS_BASE ) /*!< System control Register not in SCB */ - #define SCB ((SCB_Type *) SCB_BASE ) /*!< SCB configuration struct */ - #define SysTick ((SysTick_Type *) SysTick_BASE ) /*!< SysTick configuration struct */ - #define NVIC ((NVIC_Type *) NVIC_BASE ) /*!< NVIC configuration struct */ - #define ITM ((ITM_Type *) ITM_BASE ) /*!< ITM configuration struct */ - #define DWT ((DWT_Type *) DWT_BASE ) /*!< DWT configuration struct */ - #define TPI ((TPI_Type *) TPI_BASE ) /*!< TPI configuration struct */ - #define PWRMODCTL ((PwrModCtl_Type *) PWRMODCTL_BASE ) /*!< Power Mode Control configuration struct */ - #define CoreDebug ((CoreDebug_Type *) CoreDebug_BASE ) /*!< \deprecated Core Debug configuration struct */ - #define DCB ((DCB_Type *) DCB_BASE ) /*!< DCB configuration struct */ - #define DIB ((DIB_Type *) DIB_BASE ) /*!< DIB configuration struct */ - - #if defined (__MPU_PRESENT) && (__MPU_PRESENT == 1U) - #define MPU_BASE (SCS_BASE + 0x0D90UL) /*!< Memory Protection Unit */ - #define MPU ((MPU_Type *) MPU_BASE ) /*!< Memory Protection Unit */ - #endif - - #if defined (__PMU_PRESENT) && (__PMU_PRESENT == 1U) - #define PMU_BASE (0xE0003000UL) /*!< PMU Base Address */ - #define PMU ((PMU_Type *) PMU_BASE ) /*!< PMU configuration struct */ - #endif - - #if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) - #define SAU_BASE (SCS_BASE + 0x0DD0UL) /*!< Security Attribution Unit */ - #define SAU ((SAU_Type *) SAU_BASE ) /*!< Security Attribution Unit */ - #endif - - #define FPU_BASE (SCS_BASE + 0x0F30UL) /*!< Floating Point Unit */ - #define FPU ((FPU_Type *) FPU_BASE ) /*!< Floating Point Unit */ - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) - #define SCS_BASE_NS (0xE002E000UL) /*!< System Control Space Base Address (non-secure address space) */ - #define CoreDebug_BASE_NS (0xE002EDF0UL) /*!< \deprecated Core Debug Base Address (non-secure address space) */ - #define DCB_BASE_NS (0xE002EDF0UL) /*!< DCB Base Address (non-secure address space) */ - #define DIB_BASE_NS (0xE002EFB0UL) /*!< DIB Base Address (non-secure address space) */ - #define SysTick_BASE_NS (SCS_BASE_NS + 0x0010UL) /*!< SysTick Base Address (non-secure address space) */ - #define NVIC_BASE_NS (SCS_BASE_NS + 0x0100UL) /*!< NVIC Base Address (non-secure address space) */ - #define SCB_BASE_NS (SCS_BASE_NS + 0x0D00UL) /*!< System Control Block Base Address (non-secure address space) */ - - #define SCnSCB_NS ((SCnSCB_Type *) SCS_BASE_NS ) /*!< System control Register not in SCB(non-secure address space) */ - #define SCB_NS ((SCB_Type *) SCB_BASE_NS ) /*!< SCB configuration struct (non-secure address space) */ - #define SysTick_NS ((SysTick_Type *) SysTick_BASE_NS ) /*!< SysTick configuration struct (non-secure address space) */ - #define NVIC_NS ((NVIC_Type *) NVIC_BASE_NS ) /*!< NVIC configuration struct (non-secure address space) */ - #define CoreDebug_NS ((CoreDebug_Type *) CoreDebug_BASE_NS) /*!< \deprecated Core Debug configuration struct (non-secure address space) */ - #define DCB_NS ((DCB_Type *) DCB_BASE_NS ) /*!< DCB configuration struct (non-secure address space) */ - #define DIB_NS ((DIB_Type *) DIB_BASE_NS ) /*!< DIB configuration struct (non-secure address space) */ - - #if defined (__MPU_PRESENT) && (__MPU_PRESENT == 1U) - #define MPU_BASE_NS (SCS_BASE_NS + 0x0D90UL) /*!< Memory Protection Unit (non-secure address space) */ - #define MPU_NS ((MPU_Type *) MPU_BASE_NS ) /*!< Memory Protection Unit (non-secure address space) */ - #endif - - #define FPU_BASE_NS (SCS_BASE_NS + 0x0F30UL) /*!< Floating Point Unit (non-secure address space) */ - #define FPU_NS ((FPU_Type *) FPU_BASE_NS ) /*!< Floating Point Unit (non-secure address space) */ - -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ -/*@} */ - - -/** - \ingroup CMSIS_core_register - \defgroup CMSIS_register_aliases Backwards Compatibility Aliases - \brief Register alias definitions for backwards compatibility. - @{ - */ -#define ID_ADR (ID_AFR) /*!< SCB Auxiliary Feature Register */ -/*@} */ - - -/******************************************************************************* - * Hardware Abstraction Layer - Core Function Interface contains: - - Core NVIC Functions - - Core SysTick Functions - - Core Debug Functions - - Core Register Access Functions - ******************************************************************************/ -/** - \defgroup CMSIS_Core_FunctionInterface Functions and Instructions Reference -*/ - - - -/* ########################## NVIC functions #################################### */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_NVICFunctions NVIC Functions - \brief Functions that manage interrupts and exceptions via the NVIC. - @{ - */ - -#ifdef CMSIS_NVIC_VIRTUAL - #ifndef CMSIS_NVIC_VIRTUAL_HEADER_FILE - #define CMSIS_NVIC_VIRTUAL_HEADER_FILE "cmsis_nvic_virtual.h" - #endif - #include CMSIS_NVIC_VIRTUAL_HEADER_FILE -#else - #define NVIC_SetPriorityGrouping __NVIC_SetPriorityGrouping - #define NVIC_GetPriorityGrouping __NVIC_GetPriorityGrouping - #define NVIC_EnableIRQ __NVIC_EnableIRQ - #define NVIC_GetEnableIRQ __NVIC_GetEnableIRQ - #define NVIC_DisableIRQ __NVIC_DisableIRQ - #define NVIC_GetPendingIRQ __NVIC_GetPendingIRQ - #define NVIC_SetPendingIRQ __NVIC_SetPendingIRQ - #define NVIC_ClearPendingIRQ __NVIC_ClearPendingIRQ - #define NVIC_GetActive __NVIC_GetActive - #define NVIC_SetPriority __NVIC_SetPriority - #define NVIC_GetPriority __NVIC_GetPriority - #define NVIC_SystemReset __NVIC_SystemReset -#endif /* CMSIS_NVIC_VIRTUAL */ - -#ifdef CMSIS_VECTAB_VIRTUAL - #ifndef CMSIS_VECTAB_VIRTUAL_HEADER_FILE - #define CMSIS_VECTAB_VIRTUAL_HEADER_FILE "cmsis_vectab_virtual.h" - #endif - #include CMSIS_VECTAB_VIRTUAL_HEADER_FILE -#else - #define NVIC_SetVector __NVIC_SetVector - #define NVIC_GetVector __NVIC_GetVector -#endif /* (CMSIS_VECTAB_VIRTUAL) */ - -#define NVIC_USER_IRQ_OFFSET 16 - - -/* Special LR values for Secure/Non-Secure call handling and exception handling */ - -/* Function Return Payload (from ARMv8-M Architecture Reference Manual) LR value on entry from Secure BLXNS */ -#define FNC_RETURN (0xFEFFFFFFUL) /* bit [0] ignored when processing a branch */ - -/* The following EXC_RETURN mask values are used to evaluate the LR on exception entry */ -#define EXC_RETURN_PREFIX (0xFF000000UL) /* bits [31:24] set to indicate an EXC_RETURN value */ -#define EXC_RETURN_S (0x00000040UL) /* bit [6] stack used to push registers: 0=Non-secure 1=Secure */ -#define EXC_RETURN_DCRS (0x00000020UL) /* bit [5] stacking rules for called registers: 0=skipped 1=saved */ -#define EXC_RETURN_FTYPE (0x00000010UL) /* bit [4] allocate stack for floating-point context: 0=done 1=skipped */ -#define EXC_RETURN_MODE (0x00000008UL) /* bit [3] processor mode for return: 0=Handler mode 1=Thread mode */ -#define EXC_RETURN_SPSEL (0x00000004UL) /* bit [2] stack pointer used to restore context: 0=MSP 1=PSP */ -#define EXC_RETURN_ES (0x00000001UL) /* bit [0] security state exception was taken to: 0=Non-secure 1=Secure */ - -/* Integrity Signature (from ARMv8-M Architecture Reference Manual) for exception context stacking */ -#if defined (__FPU_PRESENT) && (__FPU_PRESENT == 1U) /* Value for processors with floating-point extension: */ -#define EXC_INTEGRITY_SIGNATURE (0xFEFA125AUL) /* bit [0] SFTC must match LR bit[4] EXC_RETURN_FTYPE */ -#else -#define EXC_INTEGRITY_SIGNATURE (0xFEFA125BUL) /* Value for processors without floating-point extension */ -#endif - - -/** - \brief Set Priority Grouping - \details Sets the priority grouping field using the required unlock sequence. - The parameter PriorityGroup is assigned to the field SCB->AIRCR [10:8] PRIGROUP field. - Only values from 0..7 are used. - In case of a conflict between priority grouping and available - priority bits (__NVIC_PRIO_BITS), the smallest possible priority group is set. - \param [in] PriorityGroup Priority grouping field. - */ -__STATIC_INLINE void __NVIC_SetPriorityGrouping(uint32_t PriorityGroup) -{ - uint32_t reg_value; - uint32_t PriorityGroupTmp = (PriorityGroup & (uint32_t)0x07UL); /* only values 0..7 are used */ - - reg_value = SCB->AIRCR; /* read old register configuration */ - reg_value &= ~((uint32_t)(SCB_AIRCR_VECTKEY_Msk | SCB_AIRCR_PRIGROUP_Msk)); /* clear bits to change */ - reg_value = (reg_value | - ((uint32_t)0x5FAUL << SCB_AIRCR_VECTKEY_Pos) | - (PriorityGroupTmp << SCB_AIRCR_PRIGROUP_Pos) ); /* Insert write key and priority group */ - SCB->AIRCR = reg_value; -} - - -/** - \brief Get Priority Grouping - \details Reads the priority grouping field from the NVIC Interrupt Controller. - \return Priority grouping field (SCB->AIRCR [10:8] PRIGROUP field). - */ -__STATIC_INLINE uint32_t __NVIC_GetPriorityGrouping(void) -{ - return ((uint32_t)((SCB->AIRCR & SCB_AIRCR_PRIGROUP_Msk) >> SCB_AIRCR_PRIGROUP_Pos)); -} - - -/** - \brief Enable Interrupt - \details Enables a device specific interrupt in the NVIC interrupt controller. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void __NVIC_EnableIRQ(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - __COMPILER_BARRIER(); - NVIC->ISER[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - __COMPILER_BARRIER(); - } -} - - -/** - \brief Get Interrupt Enable status - \details Returns a device specific interrupt enable status from the NVIC interrupt controller. - \param [in] IRQn Device specific interrupt number. - \return 0 Interrupt is not enabled. - \return 1 Interrupt is enabled. - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t __NVIC_GetEnableIRQ(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC->ISER[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Disable Interrupt - \details Disables a device specific interrupt in the NVIC interrupt controller. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void __NVIC_DisableIRQ(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC->ICER[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - __DSB(); - __ISB(); - } -} - - -/** - \brief Get Pending Interrupt - \details Reads the NVIC pending register and returns the pending bit for the specified device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 Interrupt status is not pending. - \return 1 Interrupt status is pending. - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t __NVIC_GetPendingIRQ(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC->ISPR[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Set Pending Interrupt - \details Sets the pending bit of a device specific interrupt in the NVIC pending register. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void __NVIC_SetPendingIRQ(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC->ISPR[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - } -} - - -/** - \brief Clear Pending Interrupt - \details Clears the pending bit of a device specific interrupt in the NVIC pending register. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void __NVIC_ClearPendingIRQ(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC->ICPR[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - } -} - - -/** - \brief Get Active Interrupt - \details Reads the active register in the NVIC and returns the active bit for the device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 Interrupt status is not active. - \return 1 Interrupt status is active. - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t __NVIC_GetActive(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC->IABR[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -/** - \brief Get Interrupt Target State - \details Reads the interrupt target field in the NVIC and returns the interrupt target bit for the device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 if interrupt is assigned to Secure - \return 1 if interrupt is assigned to Non Secure - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t NVIC_GetTargetState(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC->ITNS[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Set Interrupt Target State - \details Sets the interrupt target field in the NVIC and returns the interrupt target bit for the device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 if interrupt is assigned to Secure - 1 if interrupt is assigned to Non Secure - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t NVIC_SetTargetState(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC->ITNS[(((uint32_t)IRQn) >> 5UL)] |= ((uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL))); - return((uint32_t)(((NVIC->ITNS[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Clear Interrupt Target State - \details Clears the interrupt target field in the NVIC and returns the interrupt target bit for the device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 if interrupt is assigned to Secure - 1 if interrupt is assigned to Non Secure - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t NVIC_ClearTargetState(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC->ITNS[(((uint32_t)IRQn) >> 5UL)] &= ~((uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL))); - return((uint32_t)(((NVIC->ITNS[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ - - -/** - \brief Set Interrupt Priority - \details Sets the priority of a device specific interrupt or a processor exception. - The interrupt number can be positive to specify a device specific interrupt, - or negative to specify a processor exception. - \param [in] IRQn Interrupt number. - \param [in] priority Priority to set. - \note The priority cannot be set for every processor exception. - */ -__STATIC_INLINE void __NVIC_SetPriority(IRQn_Type IRQn, uint32_t priority) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC->IPR[((uint32_t)IRQn)] = (uint8_t)((priority << (8U - __NVIC_PRIO_BITS)) & (uint32_t)0xFFUL); - } - else - { - SCB->SHPR[(((uint32_t)IRQn) & 0xFUL)-4UL] = (uint8_t)((priority << (8U - __NVIC_PRIO_BITS)) & (uint32_t)0xFFUL); - } -} - - -/** - \brief Get Interrupt Priority - \details Reads the priority of a device specific interrupt or a processor exception. - The interrupt number can be positive to specify a device specific interrupt, - or negative to specify a processor exception. - \param [in] IRQn Interrupt number. - \return Interrupt Priority. - Value is aligned automatically to the implemented priority bits of the microcontroller. - */ -__STATIC_INLINE uint32_t __NVIC_GetPriority(IRQn_Type IRQn) -{ - - if ((int32_t)(IRQn) >= 0) - { - return(((uint32_t)NVIC->IPR[((uint32_t)IRQn)] >> (8U - __NVIC_PRIO_BITS))); - } - else - { - return(((uint32_t)SCB->SHPR[(((uint32_t)IRQn) & 0xFUL)-4UL] >> (8U - __NVIC_PRIO_BITS))); - } -} - - -/** - \brief Encode Priority - \details Encodes the priority for an interrupt with the given priority group, - preemptive priority value, and subpriority value. - In case of a conflict between priority grouping and available - priority bits (__NVIC_PRIO_BITS), the smallest possible priority group is set. - \param [in] PriorityGroup Used priority group. - \param [in] PreemptPriority Preemptive priority value (starting from 0). - \param [in] SubPriority Subpriority value (starting from 0). - \return Encoded priority. Value can be used in the function \ref NVIC_SetPriority(). - */ -__STATIC_INLINE uint32_t NVIC_EncodePriority (uint32_t PriorityGroup, uint32_t PreemptPriority, uint32_t SubPriority) -{ - uint32_t PriorityGroupTmp = (PriorityGroup & (uint32_t)0x07UL); /* only values 0..7 are used */ - uint32_t PreemptPriorityBits; - uint32_t SubPriorityBits; - - PreemptPriorityBits = ((7UL - PriorityGroupTmp) > (uint32_t)(__NVIC_PRIO_BITS)) ? (uint32_t)(__NVIC_PRIO_BITS) : (uint32_t)(7UL - PriorityGroupTmp); - SubPriorityBits = ((PriorityGroupTmp + (uint32_t)(__NVIC_PRIO_BITS)) < (uint32_t)7UL) ? (uint32_t)0UL : (uint32_t)((PriorityGroupTmp - 7UL) + (uint32_t)(__NVIC_PRIO_BITS)); - - return ( - ((PreemptPriority & (uint32_t)((1UL << (PreemptPriorityBits)) - 1UL)) << SubPriorityBits) | - ((SubPriority & (uint32_t)((1UL << (SubPriorityBits )) - 1UL))) - ); -} - - -/** - \brief Decode Priority - \details Decodes an interrupt priority value with a given priority group to - preemptive priority value and subpriority value. - In case of a conflict between priority grouping and available - priority bits (__NVIC_PRIO_BITS) the smallest possible priority group is set. - \param [in] Priority Priority value, which can be retrieved with the function \ref NVIC_GetPriority(). - \param [in] PriorityGroup Used priority group. - \param [out] pPreemptPriority Preemptive priority value (starting from 0). - \param [out] pSubPriority Subpriority value (starting from 0). - */ -__STATIC_INLINE void NVIC_DecodePriority (uint32_t Priority, uint32_t PriorityGroup, uint32_t* const pPreemptPriority, uint32_t* const pSubPriority) -{ - uint32_t PriorityGroupTmp = (PriorityGroup & (uint32_t)0x07UL); /* only values 0..7 are used */ - uint32_t PreemptPriorityBits; - uint32_t SubPriorityBits; - - PreemptPriorityBits = ((7UL - PriorityGroupTmp) > (uint32_t)(__NVIC_PRIO_BITS)) ? (uint32_t)(__NVIC_PRIO_BITS) : (uint32_t)(7UL - PriorityGroupTmp); - SubPriorityBits = ((PriorityGroupTmp + (uint32_t)(__NVIC_PRIO_BITS)) < (uint32_t)7UL) ? (uint32_t)0UL : (uint32_t)((PriorityGroupTmp - 7UL) + (uint32_t)(__NVIC_PRIO_BITS)); - - *pPreemptPriority = (Priority >> SubPriorityBits) & (uint32_t)((1UL << (PreemptPriorityBits)) - 1UL); - *pSubPriority = (Priority ) & (uint32_t)((1UL << (SubPriorityBits )) - 1UL); -} - - -/** - \brief Set Interrupt Vector - \details Sets an interrupt vector in SRAM based interrupt vector table. - The interrupt number can be positive to specify a device specific interrupt, - or negative to specify a processor exception. - VTOR must been relocated to SRAM before. - \param [in] IRQn Interrupt number - \param [in] vector Address of interrupt handler function - */ -__STATIC_INLINE void __NVIC_SetVector(IRQn_Type IRQn, uint32_t vector) -{ - uint32_t *vectors = (uint32_t *)SCB->VTOR; - vectors[(int32_t)IRQn + NVIC_USER_IRQ_OFFSET] = vector; - __DSB(); -} - - -/** - \brief Get Interrupt Vector - \details Reads an interrupt vector from interrupt vector table. - The interrupt number can be positive to specify a device specific interrupt, - or negative to specify a processor exception. - \param [in] IRQn Interrupt number. - \return Address of interrupt handler function - */ -__STATIC_INLINE uint32_t __NVIC_GetVector(IRQn_Type IRQn) -{ - uint32_t *vectors = (uint32_t *)SCB->VTOR; - return vectors[(int32_t)IRQn + NVIC_USER_IRQ_OFFSET]; -} - - -/** - \brief System Reset - \details Initiates a system reset request to reset the MCU. - */ -__NO_RETURN __STATIC_INLINE void __NVIC_SystemReset(void) -{ - __DSB(); /* Ensure all outstanding memory accesses included - buffered write are completed before reset */ - SCB->AIRCR = (uint32_t)((0x5FAUL << SCB_AIRCR_VECTKEY_Pos) | - (SCB->AIRCR & SCB_AIRCR_PRIGROUP_Msk) | - SCB_AIRCR_SYSRESETREQ_Msk ); /* Keep priority group unchanged */ - __DSB(); /* Ensure completion of memory access */ - - for(;;) /* wait until reset */ - { - __NOP(); - } -} - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -/** - \brief Set Priority Grouping (non-secure) - \details Sets the non-secure priority grouping field when in secure state using the required unlock sequence. - The parameter PriorityGroup is assigned to the field SCB->AIRCR [10:8] PRIGROUP field. - Only values from 0..7 are used. - In case of a conflict between priority grouping and available - priority bits (__NVIC_PRIO_BITS), the smallest possible priority group is set. - \param [in] PriorityGroup Priority grouping field. - */ -__STATIC_INLINE void TZ_NVIC_SetPriorityGrouping_NS(uint32_t PriorityGroup) -{ - uint32_t reg_value; - uint32_t PriorityGroupTmp = (PriorityGroup & (uint32_t)0x07UL); /* only values 0..7 are used */ - - reg_value = SCB_NS->AIRCR; /* read old register configuration */ - reg_value &= ~((uint32_t)(SCB_AIRCR_VECTKEY_Msk | SCB_AIRCR_PRIGROUP_Msk)); /* clear bits to change */ - reg_value = (reg_value | - ((uint32_t)0x5FAUL << SCB_AIRCR_VECTKEY_Pos) | - (PriorityGroupTmp << SCB_AIRCR_PRIGROUP_Pos) ); /* Insert write key and priority group */ - SCB_NS->AIRCR = reg_value; -} - - -/** - \brief Get Priority Grouping (non-secure) - \details Reads the priority grouping field from the non-secure NVIC when in secure state. - \return Priority grouping field (SCB->AIRCR [10:8] PRIGROUP field). - */ -__STATIC_INLINE uint32_t TZ_NVIC_GetPriorityGrouping_NS(void) -{ - return ((uint32_t)((SCB_NS->AIRCR & SCB_AIRCR_PRIGROUP_Msk) >> SCB_AIRCR_PRIGROUP_Pos)); -} - - -/** - \brief Enable Interrupt (non-secure) - \details Enables a device specific interrupt in the non-secure NVIC interrupt controller when in secure state. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void TZ_NVIC_EnableIRQ_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC_NS->ISER[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - } -} - - -/** - \brief Get Interrupt Enable status (non-secure) - \details Returns a device specific interrupt enable status from the non-secure NVIC interrupt controller when in secure state. - \param [in] IRQn Device specific interrupt number. - \return 0 Interrupt is not enabled. - \return 1 Interrupt is enabled. - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t TZ_NVIC_GetEnableIRQ_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC_NS->ISER[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Disable Interrupt (non-secure) - \details Disables a device specific interrupt in the non-secure NVIC interrupt controller when in secure state. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void TZ_NVIC_DisableIRQ_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC_NS->ICER[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - } -} - - -/** - \brief Get Pending Interrupt (non-secure) - \details Reads the NVIC pending register in the non-secure NVIC when in secure state and returns the pending bit for the specified device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 Interrupt status is not pending. - \return 1 Interrupt status is pending. - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t TZ_NVIC_GetPendingIRQ_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC_NS->ISPR[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Set Pending Interrupt (non-secure) - \details Sets the pending bit of a device specific interrupt in the non-secure NVIC pending register when in secure state. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void TZ_NVIC_SetPendingIRQ_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC_NS->ISPR[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - } -} - - -/** - \brief Clear Pending Interrupt (non-secure) - \details Clears the pending bit of a device specific interrupt in the non-secure NVIC pending register when in secure state. - \param [in] IRQn Device specific interrupt number. - \note IRQn must not be negative. - */ -__STATIC_INLINE void TZ_NVIC_ClearPendingIRQ_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC_NS->ICPR[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); - } -} - - -/** - \brief Get Active Interrupt (non-secure) - \details Reads the active register in non-secure NVIC when in secure state and returns the active bit for the device specific interrupt. - \param [in] IRQn Device specific interrupt number. - \return 0 Interrupt status is not active. - \return 1 Interrupt status is active. - \note IRQn must not be negative. - */ -__STATIC_INLINE uint32_t TZ_NVIC_GetActive_NS(IRQn_Type IRQn) -{ - if ((int32_t)(IRQn) >= 0) - { - return((uint32_t)(((NVIC_NS->IABR[(((uint32_t)IRQn) >> 5UL)] & (1UL << (((uint32_t)IRQn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)); - } - else - { - return(0U); - } -} - - -/** - \brief Set Interrupt Priority (non-secure) - \details Sets the priority of a non-secure device specific interrupt or a non-secure processor exception when in secure state. - The interrupt number can be positive to specify a device specific interrupt, - or negative to specify a processor exception. - \param [in] IRQn Interrupt number. - \param [in] priority Priority to set. - \note The priority cannot be set for every non-secure processor exception. - */ -__STATIC_INLINE void TZ_NVIC_SetPriority_NS(IRQn_Type IRQn, uint32_t priority) -{ - if ((int32_t)(IRQn) >= 0) - { - NVIC_NS->IPR[((uint32_t)IRQn)] = (uint8_t)((priority << (8U - __NVIC_PRIO_BITS)) & (uint32_t)0xFFUL); - } - else - { - SCB_NS->SHPR[(((uint32_t)IRQn) & 0xFUL)-4UL] = (uint8_t)((priority << (8U - __NVIC_PRIO_BITS)) & (uint32_t)0xFFUL); - } -} - - -/** - \brief Get Interrupt Priority (non-secure) - \details Reads the priority of a non-secure device specific interrupt or a non-secure processor exception when in secure state. - The interrupt number can be positive to specify a device specific interrupt, - or negative to specify a processor exception. - \param [in] IRQn Interrupt number. - \return Interrupt Priority. Value is aligned automatically to the implemented priority bits of the microcontroller. - */ -__STATIC_INLINE uint32_t TZ_NVIC_GetPriority_NS(IRQn_Type IRQn) -{ - - if ((int32_t)(IRQn) >= 0) - { - return(((uint32_t)NVIC_NS->IPR[((uint32_t)IRQn)] >> (8U - __NVIC_PRIO_BITS))); - } - else - { - return(((uint32_t)SCB_NS->SHPR[(((uint32_t)IRQn) & 0xFUL)-4UL] >> (8U - __NVIC_PRIO_BITS))); - } -} -#endif /* defined (__ARM_FEATURE_CMSE) &&(__ARM_FEATURE_CMSE == 3U) */ - -/*@} end of CMSIS_Core_NVICFunctions */ - -/* ########################## MPU functions #################################### */ - -#if defined (__MPU_PRESENT) && (__MPU_PRESENT == 1U) - -#include "mpu_armv8.h" - -#endif - -/* ########################## PMU functions and events #################################### */ - -#if defined (__PMU_PRESENT) && (__PMU_PRESENT == 1U) - -#include "pmu_armv8.h" - -/** - \brief Cortex-M55 PMU events - \note Architectural PMU events can be found in pmu_armv8.h -*/ - -#define ARMCM55_PMU_ECC_ERR 0xC000 /*!< Any ECC error */ -#define ARMCM55_PMU_ECC_ERR_FATAL 0xC001 /*!< Any fatal ECC error */ -#define ARMCM55_PMU_ECC_ERR_DCACHE 0xC010 /*!< Any ECC error in the data cache */ -#define ARMCM55_PMU_ECC_ERR_ICACHE 0xC011 /*!< Any ECC error in the instruction cache */ -#define ARMCM55_PMU_ECC_ERR_FATAL_DCACHE 0xC012 /*!< Any fatal ECC error in the data cache */ -#define ARMCM55_PMU_ECC_ERR_FATAL_ICACHE 0xC013 /*!< Any fatal ECC error in the instruction cache*/ -#define ARMCM55_PMU_ECC_ERR_DTCM 0xC020 /*!< Any ECC error in the DTCM */ -#define ARMCM55_PMU_ECC_ERR_ITCM 0xC021 /*!< Any ECC error in the ITCM */ -#define ARMCM55_PMU_ECC_ERR_FATAL_DTCM 0xC022 /*!< Any fatal ECC error in the DTCM */ -#define ARMCM55_PMU_ECC_ERR_FATAL_ITCM 0xC023 /*!< Any fatal ECC error in the ITCM */ -#define ARMCM55_PMU_PF_LINEFILL 0xC100 /*!< A prefetcher starts a line-fill */ -#define ARMCM55_PMU_PF_CANCEL 0xC101 /*!< A prefetcher stops prefetching */ -#define ARMCM55_PMU_PF_DROP_LINEFILL 0xC102 /*!< A linefill triggered by a prefetcher has been dropped because of lack of buffering */ -#define ARMCM55_PMU_NWAMODE_ENTER 0xC200 /*!< No write-allocate mode entry */ -#define ARMCM55_PMU_NWAMODE 0xC201 /*!< Write-allocate store is not allocated into the data cache due to no-write-allocate mode */ -#define ARMCM55_PMU_SAHB_ACCESS 0xC300 /*!< Read or write access on the S-AHB interface to the TCM */ -#define ARMCM55_PMU_DOSTIMEOUT_DOUBLE 0xC400 /*!< Denial of Service timeout has fired twice and caused buffers to drain to allow forward progress */ -#define ARMCM55_PMU_DOSTIMEOUT_TRIPLE 0xC401 /*!< Denial of Service timeout has fired three times and blocked the LSU to force forward progress */ - -#endif - -/* ########################## FPU functions #################################### */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_FpuFunctions FPU Functions - \brief Function that provides FPU type. - @{ - */ - -/** - \brief get FPU type - \details returns the FPU type - \returns - - \b 0: No FPU - - \b 1: Single precision FPU - - \b 2: Double + Single precision FPU - */ -__STATIC_INLINE uint32_t SCB_GetFPUType(void) -{ - uint32_t mvfr0; - - mvfr0 = FPU->MVFR0; - if ((mvfr0 & (FPU_MVFR0_FPSP_Msk | FPU_MVFR0_FPDP_Msk)) == 0x220U) - { - return 2U; /* Double + Single precision FPU */ - } - else if ((mvfr0 & (FPU_MVFR0_FPSP_Msk | FPU_MVFR0_FPDP_Msk)) == 0x020U) - { - return 1U; /* Single precision FPU */ - } - else - { - return 0U; /* No FPU */ - } -} - - -/*@} end of CMSIS_Core_FpuFunctions */ - -/* ########################## MVE functions #################################### */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_MveFunctions MVE Functions - \brief Function that provides MVE type. - @{ - */ - -/** - \brief get MVE type - \details returns the MVE type - \returns - - \b 0: No Vector Extension (MVE) - - \b 1: Integer Vector Extension (MVE-I) - - \b 2: Floating-point Vector Extension (MVE-F) - */ -__STATIC_INLINE uint32_t SCB_GetMVEType(void) -{ - const uint32_t mvfr1 = FPU->MVFR1; - if ((mvfr1 & FPU_MVFR1_MVE_Msk) == (0x2U << FPU_MVFR1_MVE_Pos)) - { - return 2U; - } - else if ((mvfr1 & FPU_MVFR1_MVE_Msk) == (0x1U << FPU_MVFR1_MVE_Pos)) - { - return 1U; - } - else - { - return 0U; - } -} - - -/*@} end of CMSIS_Core_MveFunctions */ - - -/* ########################## Cache functions #################################### */ - -#if ((defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U)) || \ - (defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U))) -#include "cachel1_armv7.h" -#endif - - -/* ########################## SAU functions #################################### */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_SAUFunctions SAU Functions - \brief Functions that configure the SAU. - @{ - */ - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) - -/** - \brief Enable SAU - \details Enables the Security Attribution Unit (SAU). - */ -__STATIC_INLINE void TZ_SAU_Enable(void) -{ - SAU->CTRL |= (SAU_CTRL_ENABLE_Msk); -} - - - -/** - \brief Disable SAU - \details Disables the Security Attribution Unit (SAU). - */ -__STATIC_INLINE void TZ_SAU_Disable(void) -{ - SAU->CTRL &= ~(SAU_CTRL_ENABLE_Msk); -} - -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ - -/*@} end of CMSIS_Core_SAUFunctions */ - - - - -/* ################################## Debug Control function ############################################ */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_DCBFunctions Debug Control Functions - \brief Functions that access the Debug Control Block. - @{ - */ - - -/** - \brief Set Debug Authentication Control Register - \details writes to Debug Authentication Control register. - \param [in] value value to be writen. - */ -__STATIC_INLINE void DCB_SetAuthCtrl(uint32_t value) -{ - __DSB(); - __ISB(); - DCB->DAUTHCTRL = value; - __DSB(); - __ISB(); -} - - -/** - \brief Get Debug Authentication Control Register - \details Reads Debug Authentication Control register. - \return Debug Authentication Control Register. - */ -__STATIC_INLINE uint32_t DCB_GetAuthCtrl(void) -{ - return (DCB->DAUTHCTRL); -} - - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -/** - \brief Set Debug Authentication Control Register (non-secure) - \details writes to non-secure Debug Authentication Control register when in secure state. - \param [in] value value to be writen - */ -__STATIC_INLINE void TZ_DCB_SetAuthCtrl_NS(uint32_t value) -{ - __DSB(); - __ISB(); - DCB_NS->DAUTHCTRL = value; - __DSB(); - __ISB(); -} - - -/** - \brief Get Debug Authentication Control Register (non-secure) - \details Reads non-secure Debug Authentication Control register when in secure state. - \return Debug Authentication Control Register. - */ -__STATIC_INLINE uint32_t TZ_DCB_GetAuthCtrl_NS(void) -{ - return (DCB_NS->DAUTHCTRL); -} -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ - -/*@} end of CMSIS_Core_DCBFunctions */ - - - - -/* ################################## Debug Identification function ############################################ */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_DIBFunctions Debug Identification Functions - \brief Functions that access the Debug Identification Block. - @{ - */ - - -/** - \brief Get Debug Authentication Status Register - \details Reads Debug Authentication Status register. - \return Debug Authentication Status Register. - */ -__STATIC_INLINE uint32_t DIB_GetAuthStatus(void) -{ - return (DIB->DAUTHSTATUS); -} - - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -/** - \brief Get Debug Authentication Status Register (non-secure) - \details Reads non-secure Debug Authentication Status register when in secure state. - \return Debug Authentication Status Register. - */ -__STATIC_INLINE uint32_t TZ_DIB_GetAuthStatus_NS(void) -{ - return (DIB_NS->DAUTHSTATUS); -} -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ - -/*@} end of CMSIS_Core_DCBFunctions */ - - - - -/* ################################## SysTick function ############################################ */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_Core_SysTickFunctions SysTick Functions - \brief Functions that configure the System. - @{ - */ - -#if defined (__Vendor_SysTickConfig) && (__Vendor_SysTickConfig == 0U) - -/** - \brief System Tick Configuration - \details Initializes the System Timer and its interrupt, and starts the System Tick Timer. - Counter is in free running mode to generate periodic interrupts. - \param [in] ticks Number of ticks between two interrupts. - \return 0 Function succeeded. - \return 1 Function failed. - \note When the variable __Vendor_SysTickConfig is set to 1, then the - function SysTick_Config is not included. In this case, the file device.h - must contain a vendor-specific implementation of this function. - */ -__STATIC_INLINE uint32_t SysTick_Config(uint32_t ticks) -{ - if ((ticks - 1UL) > SysTick_LOAD_RELOAD_Msk) - { - return (1UL); /* Reload value impossible */ - } - - SysTick->LOAD = (uint32_t)(ticks - 1UL); /* set reload register */ - NVIC_SetPriority (SysTick_IRQn, (1UL << __NVIC_PRIO_BITS) - 1UL); /* set Priority for Systick Interrupt */ - SysTick->VAL = 0UL; /* Load the SysTick Counter Value */ - SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk | - SysTick_CTRL_TICKINT_Msk; /* Enable SysTick IRQ */ - SysTick->CTRL |= SysTick_CTRL_ENABLE_Msk; /* Enable SysTick IRQ and SysTick Timer */ - return (0UL); /* Function successful */ -} - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -/** - \brief System Tick Configuration (non-secure) - \details Initializes the non-secure System Timer and its interrupt when in secure state, and starts the System Tick Timer. - Counter is in free running mode to generate periodic interrupts. - \param [in] ticks Number of ticks between two interrupts. - \return 0 Function succeeded. - \return 1 Function failed. - \note When the variable __Vendor_SysTickConfig is set to 1, then the - function TZ_SysTick_Config_NS is not included. In this case, the file device.h - must contain a vendor-specific implementation of this function. - - */ -__STATIC_INLINE uint32_t TZ_SysTick_Config_NS(uint32_t ticks) -{ - if ((ticks - 1UL) > SysTick_LOAD_RELOAD_Msk) - { - return (1UL); /* Reload value impossible */ - } - - SysTick_NS->LOAD = (uint32_t)(ticks - 1UL); /* set reload register */ - TZ_NVIC_SetPriority_NS (SysTick_IRQn, (1UL << __NVIC_PRIO_BITS) - 1UL); /* set Priority for Systick Interrupt */ - SysTick_NS->VAL = 0UL; /* Load the SysTick Counter Value */ - SysTick_NS->CTRL = SysTick_CTRL_CLKSOURCE_Msk | - SysTick_CTRL_TICKINT_Msk | - SysTick_CTRL_ENABLE_Msk; /* Enable SysTick IRQ and SysTick Timer */ - return (0UL); /* Function successful */ -} -#endif /* defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) */ - -#endif - -/*@} end of CMSIS_Core_SysTickFunctions */ - - - -/* ##################################### Debug In/Output function ########################################### */ -/** - \ingroup CMSIS_Core_FunctionInterface - \defgroup CMSIS_core_DebugFunctions ITM Functions - \brief Functions that access the ITM debug interface. - @{ - */ - -extern volatile int32_t ITM_RxBuffer; /*!< External variable to receive characters. */ -#define ITM_RXBUFFER_EMPTY ((int32_t)0x5AA55AA5U) /*!< Value identifying \ref ITM_RxBuffer is ready for next character. */ - - -/** - \brief ITM Send Character - \details Transmits a character via the ITM channel 0, and - \li Just returns when no debugger is connected that has booked the output. - \li Is blocking when a debugger is connected, but the previous character sent has not been transmitted. - \param [in] ch Character to transmit. - \returns Character to transmit. - */ -__STATIC_INLINE uint32_t ITM_SendChar (uint32_t ch) -{ - if (((ITM->TCR & ITM_TCR_ITMENA_Msk) != 0UL) && /* ITM enabled */ - ((ITM->TER & 1UL ) != 0UL) ) /* ITM Port #0 enabled */ - { - while (ITM->PORT[0U].u32 == 0UL) - { - __NOP(); - } - ITM->PORT[0U].u8 = (uint8_t)ch; - } - return (ch); -} - - -/** - \brief ITM Receive Character - \details Inputs a character via the external variable \ref ITM_RxBuffer. - \return Received character. - \return -1 No character pending. - */ -__STATIC_INLINE int32_t ITM_ReceiveChar (void) -{ - int32_t ch = -1; /* no character available */ - - if (ITM_RxBuffer != ITM_RXBUFFER_EMPTY) - { - ch = ITM_RxBuffer; - ITM_RxBuffer = ITM_RXBUFFER_EMPTY; /* ready for next character */ - } - - return (ch); -} - - -/** - \brief ITM Check Character - \details Checks whether a character is pending for reading in the variable \ref ITM_RxBuffer. - \return 0 No character available. - \return 1 Character available. - */ -__STATIC_INLINE int32_t ITM_CheckChar (void) -{ - - if (ITM_RxBuffer == ITM_RXBUFFER_EMPTY) - { - return (0); /* no character available */ - } - else - { - return (1); /* character available */ - } -} - -/*@} end of CMSIS_core_DebugFunctions */ - - - - -#ifdef __cplusplus -} -#endif - -#endif /* __CORE_CM55_H_DEPENDANT */ - -#endif /* __CMSIS_GENERIC */ diff --git a/envs/core/src/hal.c b/envs/core/src/hal.c deleted file mode 100644 index 84e667b..0000000 --- a/envs/core/src/hal.c +++ /dev/null @@ -1,52 +0,0 @@ - -#include - -/* Dependency on standard library: - * - rand() - * - printf() - * - fflush() - */ -#include -#include -#include - -uint8_t get_random_byte() -{ - return( rand() ); -} - -/* Stubs to enable/disable measurements. */ -void measure_end() -{ - asm volatile( "DBG #9" : : : "memory" ); -} - -void measure_start() -{ - asm volatile( "DBG #1" : : : "memory" ); -} - -/* Debugging stubs */ - -void debug_test_start( const char *testname ) -{ - printf( "%s ... ", testname ); - fflush( stdout ); -} - -void debug_printf(const char * format, ... ) -{ - va_list argp; - va_start( argp, format ); - vprintf( format, argp ); - va_end( argp ); -} - -void debug_test_ok() { printf( "Ok\n" ); } -void debug_test_fail() { printf( "FAIL!\n" ); } - -void hal_pmu_enable() {} -void hal_pmu_disable() {} -void hal_pmu_start_pmu_stats( void *s ) {} -void hal_pmu_finish_pmu_stats( void *s ) {} -void hal_pmu_send_stats( void *s, void const *stats ) {} diff --git a/envs/core/src/mps3-an547.mk b/envs/core/src/mps3-an547.mk deleted file mode 100644 index 0f2c658..0000000 --- a/envs/core/src/mps3-an547.mk +++ /dev/null @@ -1,56 +0,0 @@ -ifndef _HAL -_HAL := - -CROSS_PREFIX ?= arm-none-eabi -RETAINED_VARS += CROSS_PREFIX - -CC := $(CROSS_PREFIX)-gcc -AR := $(CROSS_PREFIX)-gcc-ar -LD := $(CC) -OBJCOPY := $(CROSS_PREFIX)-objcopy -SIZE := $(CROSS_PREFIX)-size - -SYSROOT := $(shell $(CC) --print-sysroot) - -CPPFLAGS += \ - --sysroot=$(SYSROOT) \ - -DARMCM55 - -ARCH_FLAGS += \ - -mcpu=cortex-m55 \ - -mthumb \ - -mfloat-abi=hard -mfpu=fpv4-sp-d16 \ - -CPPFLAGS += \ - -Iplatform/ - -CFLAGS += \ - $(ARCH_FLAGS) \ - --specs=nosys.specs - -LDSCRIPT = platform/mps3.ld - -LDFLAGS += \ - --specs=nosys.specs \ - -Wl,--wrap=_write \ - -Wl,--wrap=_read \ - -ffreestanding \ - -T$(LDSCRIPT) \ - $(ARCH_FLAGS) - -HAL_SRC += \ - platform/startup_ARMCM55.c \ - platform/system_ARMCM55.c \ - platform/semihosting.c \ - platform/uart.c -HAL_OBJ = $(call objs,$(HAL_SRC)) - -OBJ += $(HAL_OBJ) - -libhal.a: $(HAL_OBJ) - -LDLIBS += -lhal -LIBDEPS += libhal.a -TARGETS += libhal.a - -endif diff --git a/envs/core/src/mps3.ld b/envs/core/src/mps3.ld deleted file mode 100644 index 6b95ba9..0000000 --- a/envs/core/src/mps3.ld +++ /dev/null @@ -1,312 +0,0 @@ -/* - * Copyright (c) 2009-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - *-------- <<< Use Configuration Wizard in Context Menu >>> ------------------- - */ - -/*---------------------- ITCM Configuration ---------------------------------- - Flash Configuration - Flash Base Address <0x0-0xFFFFFFFF:8> - Flash Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__ROM_BASE = 0x00000000; -__ROM_SIZE = 0x00080000; - -/*--------------------- DTCM RAM Configuration ---------------------------- - RAM Configuration - RAM Base Address <0x0-0xFFFFFFFF:8> - RAM Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__RAM_BASE = 0x20000000; -__RAM_SIZE = 0x00080000; - -/*--------------------- Embedded SRAM Configuration ---------------------------- - SRAM Configuration - SRAM Base Address <0x0-0xFFFFFFFF:8> - SRAM Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__SRAM_BASE = 0x21000000; -__SRAM_SIZE = 0x00200000; - -/*--------------------- Stack / Heap Configuration ---------------------------- - Stack / Heap Configuration - Stack Size (in Bytes) <0x0-0xFFFFFFFF:8> - Heap Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__STACK_SIZE = 0x00008000; -__HEAP_SIZE = 0x00008000; - -/*--------------------- Embedded RAM Configuration ---------------------------- - DDR Configuration - DDR Base Address <0x0-0xFFFFFFFF:8> - DDR Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__DDR_BASE = 0x60000000; -__DDR_SIZE = 0x02000000; - -/* - *-------------------- <<< end of configuration section >>> ------------------- - */ - -MEMORY -{ - ITCM (rx) : ORIGIN = __ROM_BASE, LENGTH = __ROM_SIZE - DTCM (rwx) : ORIGIN = __RAM_BASE, LENGTH = __RAM_SIZE - SRAM (rwx) : ORIGIN = __SRAM_BASE, LENGTH = __SRAM_SIZE - DDR (rwx) : ORIGIN = __DDR_BASE, LENGTH = __DDR_SIZE -} - -/* Linker script to place sections and symbol values. Should be used together - * with other linker script that defines memory regions ITCM and RAM. - * It references following symbols, which must be defined in code: - * Reset_Handler : Entry of reset handler - * - * It defines following symbols, which code can use without definition: - * __exidx_start - * __exidx_end - * __copy_table_start__ - * __copy_table_end__ - * __zero_table_start__ - * __zero_table_end__ - * __etext - * __data_start__ - * __preinit_array_start - * __preinit_array_end - * __init_array_start - * __init_array_end - * __fini_array_start - * __fini_array_end - * __data_end__ - * __bss_start__ - * __bss_end__ - * __end__ - * end - * __HeapLimit - * __StackLimit - * __StackTop - * __stack - */ -ENTRY(Reset_Handler) - -SECTIONS -{ - .text : - { - KEEP(*(.vectors)) - *(.text*) - - KEEP(*(.init)) - KEEP(*(.fini)) - - /* .ctors */ - *crtbegin.o(.ctors) - *crtbegin?.o(.ctors) - *(EXCLUDE_FILE(*crtend?.o *crtend.o) .ctors) - *(SORT(.ctors.*)) - *(.ctors) - - /* .dtors */ - *crtbegin.o(.dtors) - *crtbegin?.o(.dtors) - *(EXCLUDE_FILE(*crtend?.o *crtend.o) .dtors) - *(SORT(.dtors.*)) - *(.dtors) - - *(.rodata*) - - KEEP(*(.eh_frame*)) - } > ITCM - - /* - * SG veneers: - * All SG veneers are placed in the special output section .gnu.sgstubs. Its start address - * must be set, either with the command line option ďż˝--section-startďż˝ or in a linker script, - * to indicate where to place these veneers in memory. - */ -/* - .gnu.sgstubs : - { - . = ALIGN(32); - } > ITCM -*/ - .ARM.extab : - { - *(.ARM.extab* .gnu.linkonce.armextab.*) - } > ITCM - - __exidx_start = .; - .ARM.exidx : - { - *(.ARM.exidx* .gnu.linkonce.armexidx.*) - } > ITCM - __exidx_end = .; - - .copy.table : - { - . = ALIGN(4); - __copy_table_start__ = .; - LONG (__etext) - LONG (__data_start__) - LONG (__data_end__ - __data_start__) - /* Add each additional data section here */ - __copy_table_end__ = .; - } > ITCM - - .zero.table : - { - . = ALIGN(4); - __zero_table_start__ = .; - /* Add each additional bss section here */ -/* - LONG (__bss2_start__) - LONG (__bss2_end__ - __bss2_start__) -*/ - __zero_table_end__ = .; - } > ITCM - - /** - * Location counter can end up 2byte aligned with narrow Thumb code but - * __etext is assumed by startup code to be the LMA of a section in DTCM - * which must be 4byte aligned - */ - __etext = ALIGN (4); - - .data : AT (__etext) - { - __data_start__ = .; - *(vtable) - *(.data) - *(.data.*) - - . = ALIGN(4); - /* preinit data */ - PROVIDE_HIDDEN (__preinit_array_start = .); - KEEP(*(.preinit_array)) - PROVIDE_HIDDEN (__preinit_array_end = .); - - . = ALIGN(4); - /* init data */ - PROVIDE_HIDDEN (__init_array_start = .); - KEEP(*(SORT(.init_array.*))) - KEEP(*(.init_array)) - PROVIDE_HIDDEN (__init_array_end = .); - - - . = ALIGN(4); - /* finit data */ - PROVIDE_HIDDEN (__fini_array_start = .); - KEEP(*(SORT(.fini_array.*))) - KEEP(*(.fini_array)) - PROVIDE_HIDDEN (__fini_array_end = .); - - KEEP(*(.jcr*)) - . = ALIGN(4); - /* All data end */ - __data_end__ = .; - - } > DTCM - - /* - * Secondary data section, optional - * - * Remember to add each additional data section - * to the .copy.table above to asure proper - * initialization during startup. - */ -/* - __etext2 = ALIGN (4); - - .data2 : AT (__etext2) - { - . = ALIGN(4); - __data2_start__ = .; - *(.data2) - *(.data2.*) - . = ALIGN(4); - __data2_end__ = .; - - } > RAM2 -*/ - - .sram : - { - . = ALIGN(16); - *(.bss.NoInit) - . = ALIGN(16); - } > DTCM AT > DTCM - - .bss : - { - . = ALIGN(4); - __bss_start__ = .; - *(.bss) - *(.bss.*) - *(COMMON) - . = ALIGN(4); - __bss_end__ = .; - } > DTCM AT > DTCM - - - /* - * Secondary bss section, optional - * - * Remember to add each additional bss section - * to the .zero.table above to asure proper - * initialization during startup. - */ -/* - .bss2 : - { - . = ALIGN(4); - __bss2_start__ = .; - *(.bss2) - *(.bss2.*) - . = ALIGN(4); - __bss2_end__ = .; - } > RAM2 AT > RAM2 -*/ - - .heap (COPY) : - { - . = ALIGN(8); - __end__ = .; - PROVIDE(end = .); - . = . + __HEAP_SIZE; - . = ALIGN(8); - __HeapLimit = .; - } > DTCM - - .stack (ORIGIN(DTCM) + LENGTH(DTCM) - __STACK_SIZE) (COPY) : - { - . = ALIGN(8); - __StackLimit = .; - . = . + __STACK_SIZE; - . = ALIGN(8); - __StackTop = .; - } > DTCM - PROVIDE(__stack = __StackTop); - - /* Check if data + heap + stack exceeds DTCM limit */ - ASSERT(__StackLimit >= __HeapLimit, "region DTCM overflowed with stack") -} diff --git a/envs/core/src/mpu_armv8.h b/envs/core/src/mpu_armv8.h deleted file mode 100644 index 3de16ef..0000000 --- a/envs/core/src/mpu_armv8.h +++ /dev/null @@ -1,352 +0,0 @@ -/****************************************************************************** - * @file mpu_armv8.h - * @brief CMSIS MPU API for Armv8-M and Armv8.1-M MPU - * @version V5.1.3 - * @date 03. February 2021 - ******************************************************************************/ -/* - * Copyright (c) 2017-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined ( __ICCARM__ ) - #pragma system_include /* treat file as system include file for MISRA check */ -#elif defined (__clang__) - #pragma clang system_header /* treat file as system include file */ -#endif - -#ifndef ARM_MPU_ARMV8_H -#define ARM_MPU_ARMV8_H - -/** \brief Attribute for device memory (outer only) */ -#define ARM_MPU_ATTR_DEVICE ( 0U ) - -/** \brief Attribute for non-cacheable, normal memory */ -#define ARM_MPU_ATTR_NON_CACHEABLE ( 4U ) - -/** \brief Attribute for normal memory (outer and inner) -* \param NT Non-Transient: Set to 1 for non-transient data. -* \param WB Write-Back: Set to 1 to use write-back update policy. -* \param RA Read Allocation: Set to 1 to use cache allocation on read miss. -* \param WA Write Allocation: Set to 1 to use cache allocation on write miss. -*/ -#define ARM_MPU_ATTR_MEMORY_(NT, WB, RA, WA) \ - ((((NT) & 1U) << 3U) | (((WB) & 1U) << 2U) | (((RA) & 1U) << 1U) | ((WA) & 1U)) - -/** \brief Device memory type non Gathering, non Re-ordering, non Early Write Acknowledgement */ -#define ARM_MPU_ATTR_DEVICE_nGnRnE (0U) - -/** \brief Device memory type non Gathering, non Re-ordering, Early Write Acknowledgement */ -#define ARM_MPU_ATTR_DEVICE_nGnRE (1U) - -/** \brief Device memory type non Gathering, Re-ordering, Early Write Acknowledgement */ -#define ARM_MPU_ATTR_DEVICE_nGRE (2U) - -/** \brief Device memory type Gathering, Re-ordering, Early Write Acknowledgement */ -#define ARM_MPU_ATTR_DEVICE_GRE (3U) - -/** \brief Memory Attribute -* \param O Outer memory attributes -* \param I O == ARM_MPU_ATTR_DEVICE: Device memory attributes, else: Inner memory attributes -*/ -#define ARM_MPU_ATTR(O, I) ((((O) & 0xFU) << 4U) | ((((O) & 0xFU) != 0U) ? ((I) & 0xFU) : (((I) & 0x3U) << 2U))) - -/** \brief Normal memory non-shareable */ -#define ARM_MPU_SH_NON (0U) - -/** \brief Normal memory outer shareable */ -#define ARM_MPU_SH_OUTER (2U) - -/** \brief Normal memory inner shareable */ -#define ARM_MPU_SH_INNER (3U) - -/** \brief Memory access permissions -* \param RO Read-Only: Set to 1 for read-only memory. -* \param NP Non-Privileged: Set to 1 for non-privileged memory. -*/ -#define ARM_MPU_AP_(RO, NP) ((((RO) & 1U) << 1U) | ((NP) & 1U)) - -/** \brief Region Base Address Register value -* \param BASE The base address bits [31:5] of a memory region. The value is zero extended. Effective address gets 32 byte aligned. -* \param SH Defines the Shareability domain for this memory region. -* \param RO Read-Only: Set to 1 for a read-only memory region. -* \param NP Non-Privileged: Set to 1 for a non-privileged memory region. -* \oaram XN eXecute Never: Set to 1 for a non-executable memory region. -*/ -#define ARM_MPU_RBAR(BASE, SH, RO, NP, XN) \ - (((BASE) & MPU_RBAR_BASE_Msk) | \ - (((SH) << MPU_RBAR_SH_Pos) & MPU_RBAR_SH_Msk) | \ - ((ARM_MPU_AP_(RO, NP) << MPU_RBAR_AP_Pos) & MPU_RBAR_AP_Msk) | \ - (((XN) << MPU_RBAR_XN_Pos) & MPU_RBAR_XN_Msk)) - -/** \brief Region Limit Address Register value -* \param LIMIT The limit address bits [31:5] for this memory region. The value is one extended. -* \param IDX The attribute index to be associated with this memory region. -*/ -#define ARM_MPU_RLAR(LIMIT, IDX) \ - (((LIMIT) & MPU_RLAR_LIMIT_Msk) | \ - (((IDX) << MPU_RLAR_AttrIndx_Pos) & MPU_RLAR_AttrIndx_Msk) | \ - (MPU_RLAR_EN_Msk)) - -#if defined(MPU_RLAR_PXN_Pos) - -/** \brief Region Limit Address Register with PXN value -* \param LIMIT The limit address bits [31:5] for this memory region. The value is one extended. -* \param PXN Privileged execute never. Defines whether code can be executed from this privileged region. -* \param IDX The attribute index to be associated with this memory region. -*/ -#define ARM_MPU_RLAR_PXN(LIMIT, PXN, IDX) \ - (((LIMIT) & MPU_RLAR_LIMIT_Msk) | \ - (((PXN) << MPU_RLAR_PXN_Pos) & MPU_RLAR_PXN_Msk) | \ - (((IDX) << MPU_RLAR_AttrIndx_Pos) & MPU_RLAR_AttrIndx_Msk) | \ - (MPU_RLAR_EN_Msk)) - -#endif - -/** -* Struct for a single MPU Region -*/ -typedef struct { - uint32_t RBAR; /*!< Region Base Address Register value */ - uint32_t RLAR; /*!< Region Limit Address Register value */ -} ARM_MPU_Region_t; - -/** Enable the MPU. -* \param MPU_Control Default access permissions for unconfigured regions. -*/ -__STATIC_INLINE void ARM_MPU_Enable(uint32_t MPU_Control) -{ - __DMB(); - MPU->CTRL = MPU_Control | MPU_CTRL_ENABLE_Msk; -#ifdef SCB_SHCSR_MEMFAULTENA_Msk - SCB->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk; -#endif - __DSB(); - __ISB(); -} - -/** Disable the MPU. -*/ -__STATIC_INLINE void ARM_MPU_Disable(void) -{ - __DMB(); -#ifdef SCB_SHCSR_MEMFAULTENA_Msk - SCB->SHCSR &= ~SCB_SHCSR_MEMFAULTENA_Msk; -#endif - MPU->CTRL &= ~MPU_CTRL_ENABLE_Msk; - __DSB(); - __ISB(); -} - -#ifdef MPU_NS -/** Enable the Non-secure MPU. -* \param MPU_Control Default access permissions for unconfigured regions. -*/ -__STATIC_INLINE void ARM_MPU_Enable_NS(uint32_t MPU_Control) -{ - __DMB(); - MPU_NS->CTRL = MPU_Control | MPU_CTRL_ENABLE_Msk; -#ifdef SCB_SHCSR_MEMFAULTENA_Msk - SCB_NS->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk; -#endif - __DSB(); - __ISB(); -} - -/** Disable the Non-secure MPU. -*/ -__STATIC_INLINE void ARM_MPU_Disable_NS(void) -{ - __DMB(); -#ifdef SCB_SHCSR_MEMFAULTENA_Msk - SCB_NS->SHCSR &= ~SCB_SHCSR_MEMFAULTENA_Msk; -#endif - MPU_NS->CTRL &= ~MPU_CTRL_ENABLE_Msk; - __DSB(); - __ISB(); -} -#endif - -/** Set the memory attribute encoding to the given MPU. -* \param mpu Pointer to the MPU to be configured. -* \param idx The attribute index to be set [0-7] -* \param attr The attribute value to be set. -*/ -__STATIC_INLINE void ARM_MPU_SetMemAttrEx(MPU_Type* mpu, uint8_t idx, uint8_t attr) -{ - const uint8_t reg = idx / 4U; - const uint32_t pos = ((idx % 4U) * 8U); - const uint32_t mask = 0xFFU << pos; - - if (reg >= (sizeof(mpu->MAIR) / sizeof(mpu->MAIR[0]))) { - return; // invalid index - } - - mpu->MAIR[reg] = ((mpu->MAIR[reg] & ~mask) | ((attr << pos) & mask)); -} - -/** Set the memory attribute encoding. -* \param idx The attribute index to be set [0-7] -* \param attr The attribute value to be set. -*/ -__STATIC_INLINE void ARM_MPU_SetMemAttr(uint8_t idx, uint8_t attr) -{ - ARM_MPU_SetMemAttrEx(MPU, idx, attr); -} - -#ifdef MPU_NS -/** Set the memory attribute encoding to the Non-secure MPU. -* \param idx The attribute index to be set [0-7] -* \param attr The attribute value to be set. -*/ -__STATIC_INLINE void ARM_MPU_SetMemAttr_NS(uint8_t idx, uint8_t attr) -{ - ARM_MPU_SetMemAttrEx(MPU_NS, idx, attr); -} -#endif - -/** Clear and disable the given MPU region of the given MPU. -* \param mpu Pointer to MPU to be used. -* \param rnr Region number to be cleared. -*/ -__STATIC_INLINE void ARM_MPU_ClrRegionEx(MPU_Type* mpu, uint32_t rnr) -{ - mpu->RNR = rnr; - mpu->RLAR = 0U; -} - -/** Clear and disable the given MPU region. -* \param rnr Region number to be cleared. -*/ -__STATIC_INLINE void ARM_MPU_ClrRegion(uint32_t rnr) -{ - ARM_MPU_ClrRegionEx(MPU, rnr); -} - -#ifdef MPU_NS -/** Clear and disable the given Non-secure MPU region. -* \param rnr Region number to be cleared. -*/ -__STATIC_INLINE void ARM_MPU_ClrRegion_NS(uint32_t rnr) -{ - ARM_MPU_ClrRegionEx(MPU_NS, rnr); -} -#endif - -/** Configure the given MPU region of the given MPU. -* \param mpu Pointer to MPU to be used. -* \param rnr Region number to be configured. -* \param rbar Value for RBAR register. -* \param rlar Value for RLAR register. -*/ -__STATIC_INLINE void ARM_MPU_SetRegionEx(MPU_Type* mpu, uint32_t rnr, uint32_t rbar, uint32_t rlar) -{ - mpu->RNR = rnr; - mpu->RBAR = rbar; - mpu->RLAR = rlar; -} - -/** Configure the given MPU region. -* \param rnr Region number to be configured. -* \param rbar Value for RBAR register. -* \param rlar Value for RLAR register. -*/ -__STATIC_INLINE void ARM_MPU_SetRegion(uint32_t rnr, uint32_t rbar, uint32_t rlar) -{ - ARM_MPU_SetRegionEx(MPU, rnr, rbar, rlar); -} - -#ifdef MPU_NS -/** Configure the given Non-secure MPU region. -* \param rnr Region number to be configured. -* \param rbar Value for RBAR register. -* \param rlar Value for RLAR register. -*/ -__STATIC_INLINE void ARM_MPU_SetRegion_NS(uint32_t rnr, uint32_t rbar, uint32_t rlar) -{ - ARM_MPU_SetRegionEx(MPU_NS, rnr, rbar, rlar); -} -#endif - -/** Memcpy with strictly ordered memory access, e.g. used by code in ARM_MPU_LoadEx() -* \param dst Destination data is copied to. -* \param src Source data is copied from. -* \param len Amount of data words to be copied. -*/ -__STATIC_INLINE void ARM_MPU_OrderedMemcpy(volatile uint32_t* dst, const uint32_t* __RESTRICT src, uint32_t len) -{ - uint32_t i; - for (i = 0U; i < len; ++i) - { - dst[i] = src[i]; - } -} - -/** Load the given number of MPU regions from a table to the given MPU. -* \param mpu Pointer to the MPU registers to be used. -* \param rnr First region number to be configured. -* \param table Pointer to the MPU configuration table. -* \param cnt Amount of regions to be configured. -*/ -__STATIC_INLINE void ARM_MPU_LoadEx(MPU_Type* mpu, uint32_t rnr, ARM_MPU_Region_t const* table, uint32_t cnt) -{ - const uint32_t rowWordSize = sizeof(ARM_MPU_Region_t)/4U; - if (cnt == 1U) { - mpu->RNR = rnr; - ARM_MPU_OrderedMemcpy(&(mpu->RBAR), &(table->RBAR), rowWordSize); - } else { - uint32_t rnrBase = rnr & ~(MPU_TYPE_RALIASES-1U); - uint32_t rnrOffset = rnr % MPU_TYPE_RALIASES; - - mpu->RNR = rnrBase; - while ((rnrOffset + cnt) > MPU_TYPE_RALIASES) { - uint32_t c = MPU_TYPE_RALIASES - rnrOffset; - ARM_MPU_OrderedMemcpy(&(mpu->RBAR)+(rnrOffset*2U), &(table->RBAR), c*rowWordSize); - table += c; - cnt -= c; - rnrOffset = 0U; - rnrBase += MPU_TYPE_RALIASES; - mpu->RNR = rnrBase; - } - - ARM_MPU_OrderedMemcpy(&(mpu->RBAR)+(rnrOffset*2U), &(table->RBAR), cnt*rowWordSize); - } -} - -/** Load the given number of MPU regions from a table. -* \param rnr First region number to be configured. -* \param table Pointer to the MPU configuration table. -* \param cnt Amount of regions to be configured. -*/ -__STATIC_INLINE void ARM_MPU_Load(uint32_t rnr, ARM_MPU_Region_t const* table, uint32_t cnt) -{ - ARM_MPU_LoadEx(MPU, rnr, table, cnt); -} - -#ifdef MPU_NS -/** Load the given number of MPU regions from a table to the Non-secure MPU. -* \param rnr First region number to be configured. -* \param table Pointer to the MPU configuration table. -* \param cnt Amount of regions to be configured. -*/ -__STATIC_INLINE void ARM_MPU_Load_NS(uint32_t rnr, ARM_MPU_Region_t const* table, uint32_t cnt) -{ - ARM_MPU_LoadEx(MPU_NS, rnr, table, cnt); -} -#endif - -#endif - diff --git a/envs/core/src/pmu_armv8.h b/envs/core/src/pmu_armv8.h deleted file mode 100644 index f8f3d89..0000000 --- a/envs/core/src/pmu_armv8.h +++ /dev/null @@ -1,337 +0,0 @@ -/****************************************************************************** - * @file pmu_armv8.h - * @brief CMSIS PMU API for Armv8.1-M PMU - * @version V1.0.1 - * @date 15. April 2020 - ******************************************************************************/ -/* - * Copyright (c) 2020 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined ( __ICCARM__ ) - #pragma system_include /* treat file as system include file for MISRA check */ -#elif defined (__clang__) - #pragma clang system_header /* treat file as system include file */ -#endif - -#ifndef ARM_PMU_ARMV8_H -#define ARM_PMU_ARMV8_H - -/** - * \brief PMU Events - * \note See the Armv8.1-M Architecture Reference Manual for full details on these PMU events. - * */ - -#define ARM_PMU_SW_INCR 0x0000 /*!< Software update to the PMU_SWINC register, architecturally executed and condition code check pass */ -#define ARM_PMU_L1I_CACHE_REFILL 0x0001 /*!< L1 I-Cache refill */ -#define ARM_PMU_L1D_CACHE_REFILL 0x0003 /*!< L1 D-Cache refill */ -#define ARM_PMU_L1D_CACHE 0x0004 /*!< L1 D-Cache access */ -#define ARM_PMU_LD_RETIRED 0x0006 /*!< Memory-reading instruction architecturally executed and condition code check pass */ -#define ARM_PMU_ST_RETIRED 0x0007 /*!< Memory-writing instruction architecturally executed and condition code check pass */ -#define ARM_PMU_INST_RETIRED 0x0008 /*!< Instruction architecturally executed */ -#define ARM_PMU_EXC_TAKEN 0x0009 /*!< Exception entry */ -#define ARM_PMU_EXC_RETURN 0x000A /*!< Exception return instruction architecturally executed and the condition code check pass */ -#define ARM_PMU_PC_WRITE_RETIRED 0x000C /*!< Software change to the Program Counter (PC). Instruction is architecturally executed and condition code check pass */ -#define ARM_PMU_BR_IMMED_RETIRED 0x000D /*!< Immediate branch architecturally executed */ -#define ARM_PMU_BR_RETURN_RETIRED 0x000E /*!< Function return instruction architecturally executed and the condition code check pass */ -#define ARM_PMU_UNALIGNED_LDST_RETIRED 0x000F /*!< Unaligned memory memory-reading or memory-writing instruction architecturally executed and condition code check pass */ -#define ARM_PMU_BR_MIS_PRED 0x0010 /*!< Mispredicted or not predicted branch speculatively executed */ -#define ARM_PMU_CPU_CYCLES 0x0011 /*!< Cycle */ -#define ARM_PMU_BR_PRED 0x0012 /*!< Predictable branch speculatively executed */ -#define ARM_PMU_MEM_ACCESS 0x0013 /*!< Data memory access */ -#define ARM_PMU_L1I_CACHE 0x0014 /*!< Level 1 instruction cache access */ -#define ARM_PMU_L1D_CACHE_WB 0x0015 /*!< Level 1 data cache write-back */ -#define ARM_PMU_L2D_CACHE 0x0016 /*!< Level 2 data cache access */ -#define ARM_PMU_L2D_CACHE_REFILL 0x0017 /*!< Level 2 data cache refill */ -#define ARM_PMU_L2D_CACHE_WB 0x0018 /*!< Level 2 data cache write-back */ -#define ARM_PMU_BUS_ACCESS 0x0019 /*!< Bus access */ -#define ARM_PMU_MEMORY_ERROR 0x001A /*!< Local memory error */ -#define ARM_PMU_INST_SPEC 0x001B /*!< Instruction speculatively executed */ -#define ARM_PMU_BUS_CYCLES 0x001D /*!< Bus cycles */ -#define ARM_PMU_CHAIN 0x001E /*!< For an odd numbered counter, increment when an overflow occurs on the preceding even-numbered counter on the same PE */ -#define ARM_PMU_L1D_CACHE_ALLOCATE 0x001F /*!< Level 1 data cache allocation without refill */ -#define ARM_PMU_L2D_CACHE_ALLOCATE 0x0020 /*!< Level 2 data cache allocation without refill */ -#define ARM_PMU_BR_RETIRED 0x0021 /*!< Branch instruction architecturally executed */ -#define ARM_PMU_BR_MIS_PRED_RETIRED 0x0022 /*!< Mispredicted branch instruction architecturally executed */ -#define ARM_PMU_STALL_FRONTEND 0x0023 /*!< No operation issued because of the frontend */ -#define ARM_PMU_STALL_BACKEND 0x0024 /*!< No operation issued because of the backend */ -#define ARM_PMU_L2I_CACHE 0x0027 /*!< Level 2 instruction cache access */ -#define ARM_PMU_L2I_CACHE_REFILL 0x0028 /*!< Level 2 instruction cache refill */ -#define ARM_PMU_L3D_CACHE_ALLOCATE 0x0029 /*!< Level 3 data cache allocation without refill */ -#define ARM_PMU_L3D_CACHE_REFILL 0x002A /*!< Level 3 data cache refill */ -#define ARM_PMU_L3D_CACHE 0x002B /*!< Level 3 data cache access */ -#define ARM_PMU_L3D_CACHE_WB 0x002C /*!< Level 3 data cache write-back */ -#define ARM_PMU_LL_CACHE_RD 0x0036 /*!< Last level data cache read */ -#define ARM_PMU_LL_CACHE_MISS_RD 0x0037 /*!< Last level data cache read miss */ -#define ARM_PMU_L1D_CACHE_MISS_RD 0x0039 /*!< Level 1 data cache read miss */ -#define ARM_PMU_OP_COMPLETE 0x003A /*!< Operation retired */ -#define ARM_PMU_OP_SPEC 0x003B /*!< Operation speculatively executed */ -#define ARM_PMU_STALL 0x003C /*!< Stall cycle for instruction or operation not sent for execution */ -#define ARM_PMU_STALL_OP_BACKEND 0x003D /*!< Stall cycle for instruction or operation not sent for execution due to pipeline backend */ -#define ARM_PMU_STALL_OP_FRONTEND 0x003E /*!< Stall cycle for instruction or operation not sent for execution due to pipeline frontend */ -#define ARM_PMU_STALL_OP 0x003F /*!< Instruction or operation slots not occupied each cycle */ -#define ARM_PMU_L1D_CACHE_RD 0x0040 /*!< Level 1 data cache read */ -#define ARM_PMU_LE_RETIRED 0x0100 /*!< Loop end instruction executed */ -#define ARM_PMU_LE_SPEC 0x0101 /*!< Loop end instruction speculatively executed */ -#define ARM_PMU_BF_RETIRED 0x0104 /*!< Branch future instruction architecturally executed and condition code check pass */ -#define ARM_PMU_BF_SPEC 0x0105 /*!< Branch future instruction speculatively executed and condition code check pass */ -#define ARM_PMU_LE_CANCEL 0x0108 /*!< Loop end instruction not taken */ -#define ARM_PMU_BF_CANCEL 0x0109 /*!< Branch future instruction not taken */ -#define ARM_PMU_SE_CALL_S 0x0114 /*!< Call to secure function, resulting in Security state change */ -#define ARM_PMU_SE_CALL_NS 0x0115 /*!< Call to non-secure function, resulting in Security state change */ -#define ARM_PMU_DWT_CMPMATCH0 0x0118 /*!< DWT comparator 0 match */ -#define ARM_PMU_DWT_CMPMATCH1 0x0119 /*!< DWT comparator 1 match */ -#define ARM_PMU_DWT_CMPMATCH2 0x011A /*!< DWT comparator 2 match */ -#define ARM_PMU_DWT_CMPMATCH3 0x011B /*!< DWT comparator 3 match */ -#define ARM_PMU_MVE_INST_RETIRED 0x0200 /*!< MVE instruction architecturally executed */ -#define ARM_PMU_MVE_INST_SPEC 0x0201 /*!< MVE instruction speculatively executed */ -#define ARM_PMU_MVE_FP_RETIRED 0x0204 /*!< MVE floating-point instruction architecturally executed */ -#define ARM_PMU_MVE_FP_SPEC 0x0205 /*!< MVE floating-point instruction speculatively executed */ -#define ARM_PMU_MVE_FP_HP_RETIRED 0x0208 /*!< MVE half-precision floating-point instruction architecturally executed */ -#define ARM_PMU_MVE_FP_HP_SPEC 0x0209 /*!< MVE half-precision floating-point instruction speculatively executed */ -#define ARM_PMU_MVE_FP_SP_RETIRED 0x020C /*!< MVE single-precision floating-point instruction architecturally executed */ -#define ARM_PMU_MVE_FP_SP_SPEC 0x020D /*!< MVE single-precision floating-point instruction speculatively executed */ -#define ARM_PMU_MVE_FP_MAC_RETIRED 0x0214 /*!< MVE floating-point multiply or multiply-accumulate instruction architecturally executed */ -#define ARM_PMU_MVE_FP_MAC_SPEC 0x0215 /*!< MVE floating-point multiply or multiply-accumulate instruction speculatively executed */ -#define ARM_PMU_MVE_INT_RETIRED 0x0224 /*!< MVE integer instruction architecturally executed */ -#define ARM_PMU_MVE_INT_SPEC 0x0225 /*!< MVE integer instruction speculatively executed */ -#define ARM_PMU_MVE_INT_MAC_RETIRED 0x0228 /*!< MVE multiply or multiply-accumulate instruction architecturally executed */ -#define ARM_PMU_MVE_INT_MAC_SPEC 0x0229 /*!< MVE multiply or multiply-accumulate instruction speculatively executed */ -#define ARM_PMU_MVE_LDST_RETIRED 0x0238 /*!< MVE load or store instruction architecturally executed */ -#define ARM_PMU_MVE_LDST_SPEC 0x0239 /*!< MVE load or store instruction speculatively executed */ -#define ARM_PMU_MVE_LD_RETIRED 0x023C /*!< MVE load instruction architecturally executed */ -#define ARM_PMU_MVE_LD_SPEC 0x023D /*!< MVE load instruction speculatively executed */ -#define ARM_PMU_MVE_ST_RETIRED 0x0240 /*!< MVE store instruction architecturally executed */ -#define ARM_PMU_MVE_ST_SPEC 0x0241 /*!< MVE store instruction speculatively executed */ -#define ARM_PMU_MVE_LDST_CONTIG_RETIRED 0x0244 /*!< MVE contiguous load or store instruction architecturally executed */ -#define ARM_PMU_MVE_LDST_CONTIG_SPEC 0x0245 /*!< MVE contiguous load or store instruction speculatively executed */ -#define ARM_PMU_MVE_LD_CONTIG_RETIRED 0x0248 /*!< MVE contiguous load instruction architecturally executed */ -#define ARM_PMU_MVE_LD_CONTIG_SPEC 0x0249 /*!< MVE contiguous load instruction speculatively executed */ -#define ARM_PMU_MVE_ST_CONTIG_RETIRED 0x024C /*!< MVE contiguous store instruction architecturally executed */ -#define ARM_PMU_MVE_ST_CONTIG_SPEC 0x024D /*!< MVE contiguous store instruction speculatively executed */ -#define ARM_PMU_MVE_LDST_NONCONTIG_RETIRED 0x0250 /*!< MVE non-contiguous load or store instruction architecturally executed */ -#define ARM_PMU_MVE_LDST_NONCONTIG_SPEC 0x0251 /*!< MVE non-contiguous load or store instruction speculatively executed */ -#define ARM_PMU_MVE_LD_NONCONTIG_RETIRED 0x0254 /*!< MVE non-contiguous load instruction architecturally executed */ -#define ARM_PMU_MVE_LD_NONCONTIG_SPEC 0x0255 /*!< MVE non-contiguous load instruction speculatively executed */ -#define ARM_PMU_MVE_ST_NONCONTIG_RETIRED 0x0258 /*!< MVE non-contiguous store instruction architecturally executed */ -#define ARM_PMU_MVE_ST_NONCONTIG_SPEC 0x0259 /*!< MVE non-contiguous store instruction speculatively executed */ -#define ARM_PMU_MVE_LDST_MULTI_RETIRED 0x025C /*!< MVE memory instruction targeting multiple registers architecturally executed */ -#define ARM_PMU_MVE_LDST_MULTI_SPEC 0x025D /*!< MVE memory instruction targeting multiple registers speculatively executed */ -#define ARM_PMU_MVE_LD_MULTI_RETIRED 0x0260 /*!< MVE memory load instruction targeting multiple registers architecturally executed */ -#define ARM_PMU_MVE_LD_MULTI_SPEC 0x0261 /*!< MVE memory load instruction targeting multiple registers speculatively executed */ -#define ARM_PMU_MVE_ST_MULTI_RETIRED 0x0261 /*!< MVE memory store instruction targeting multiple registers architecturally executed */ -#define ARM_PMU_MVE_ST_MULTI_SPEC 0x0265 /*!< MVE memory store instruction targeting multiple registers speculatively executed */ -#define ARM_PMU_MVE_LDST_UNALIGNED_RETIRED 0x028C /*!< MVE unaligned memory load or store instruction architecturally executed */ -#define ARM_PMU_MVE_LDST_UNALIGNED_SPEC 0x028D /*!< MVE unaligned memory load or store instruction speculatively executed */ -#define ARM_PMU_MVE_LD_UNALIGNED_RETIRED 0x0290 /*!< MVE unaligned load instruction architecturally executed */ -#define ARM_PMU_MVE_LD_UNALIGNED_SPEC 0x0291 /*!< MVE unaligned load instruction speculatively executed */ -#define ARM_PMU_MVE_ST_UNALIGNED_RETIRED 0x0294 /*!< MVE unaligned store instruction architecturally executed */ -#define ARM_PMU_MVE_ST_UNALIGNED_SPEC 0x0295 /*!< MVE unaligned store instruction speculatively executed */ -#define ARM_PMU_MVE_LDST_UNALIGNED_NONCONTIG_RETIRED 0x0298 /*!< MVE unaligned noncontiguous load or store instruction architecturally executed */ -#define ARM_PMU_MVE_LDST_UNALIGNED_NONCONTIG_SPEC 0x0299 /*!< MVE unaligned noncontiguous load or store instruction speculatively executed */ -#define ARM_PMU_MVE_VREDUCE_RETIRED 0x02A0 /*!< MVE vector reduction instruction architecturally executed */ -#define ARM_PMU_MVE_VREDUCE_SPEC 0x02A1 /*!< MVE vector reduction instruction speculatively executed */ -#define ARM_PMU_MVE_VREDUCE_FP_RETIRED 0x02A4 /*!< MVE floating-point vector reduction instruction architecturally executed */ -#define ARM_PMU_MVE_VREDUCE_FP_SPEC 0x02A5 /*!< MVE floating-point vector reduction instruction speculatively executed */ -#define ARM_PMU_MVE_VREDUCE_INT_RETIRED 0x02A8 /*!< MVE integer vector reduction instruction architecturally executed */ -#define ARM_PMU_MVE_VREDUCE_INT_SPEC 0x02A9 /*!< MVE integer vector reduction instruction speculatively executed */ -#define ARM_PMU_MVE_PRED 0x02B8 /*!< Cycles where one or more predicated beats architecturally executed */ -#define ARM_PMU_MVE_STALL 0x02CC /*!< Stall cycles caused by an MVE instruction */ -#define ARM_PMU_MVE_STALL_RESOURCE 0x02CD /*!< Stall cycles caused by an MVE instruction because of resource conflicts */ -#define ARM_PMU_MVE_STALL_RESOURCE_MEM 0x02CE /*!< Stall cycles caused by an MVE instruction because of memory resource conflicts */ -#define ARM_PMU_MVE_STALL_RESOURCE_FP 0x02CF /*!< Stall cycles caused by an MVE instruction because of floating-point resource conflicts */ -#define ARM_PMU_MVE_STALL_RESOURCE_INT 0x02D0 /*!< Stall cycles caused by an MVE instruction because of integer resource conflicts */ -#define ARM_PMU_MVE_STALL_BREAK 0x02D3 /*!< Stall cycles caused by an MVE chain break */ -#define ARM_PMU_MVE_STALL_DEPENDENCY 0x02D4 /*!< Stall cycles caused by MVE register dependency */ -#define ARM_PMU_ITCM_ACCESS 0x4007 /*!< Instruction TCM access */ -#define ARM_PMU_DTCM_ACCESS 0x4008 /*!< Data TCM access */ -#define ARM_PMU_TRCEXTOUT0 0x4010 /*!< ETM external output 0 */ -#define ARM_PMU_TRCEXTOUT1 0x4011 /*!< ETM external output 1 */ -#define ARM_PMU_TRCEXTOUT2 0x4012 /*!< ETM external output 2 */ -#define ARM_PMU_TRCEXTOUT3 0x4013 /*!< ETM external output 3 */ -#define ARM_PMU_CTI_TRIGOUT4 0x4018 /*!< Cross-trigger Interface output trigger 4 */ -#define ARM_PMU_CTI_TRIGOUT5 0x4019 /*!< Cross-trigger Interface output trigger 5 */ -#define ARM_PMU_CTI_TRIGOUT6 0x401A /*!< Cross-trigger Interface output trigger 6 */ -#define ARM_PMU_CTI_TRIGOUT7 0x401B /*!< Cross-trigger Interface output trigger 7 */ - -/** \brief PMU Functions */ - -__STATIC_INLINE void ARM_PMU_Enable(void); -__STATIC_INLINE void ARM_PMU_Disable(void); - -__STATIC_INLINE void ARM_PMU_Set_EVTYPER(uint32_t num, uint32_t type); - -__STATIC_INLINE void ARM_PMU_CYCCNT_Reset(void); -__STATIC_INLINE void ARM_PMU_EVCNTR_ALL_Reset(void); - -__STATIC_INLINE void ARM_PMU_CNTR_Enable(uint32_t mask); -__STATIC_INLINE void ARM_PMU_CNTR_Disable(uint32_t mask); - -__STATIC_INLINE uint32_t ARM_PMU_Get_CCNTR(void); -__STATIC_INLINE uint32_t ARM_PMU_Get_EVCNTR(uint32_t num); - -__STATIC_INLINE uint32_t ARM_PMU_Get_CNTR_OVS(void); -__STATIC_INLINE void ARM_PMU_Set_CNTR_OVS(uint32_t mask); - -__STATIC_INLINE void ARM_PMU_Set_CNTR_IRQ_Enable(uint32_t mask); -__STATIC_INLINE void ARM_PMU_Set_CNTR_IRQ_Disable(uint32_t mask); - -__STATIC_INLINE void ARM_PMU_CNTR_Increment(uint32_t mask); - -/** - \brief Enable the PMU -*/ -__STATIC_INLINE void ARM_PMU_Enable(void) -{ - PMU->CTRL |= PMU_CTRL_ENABLE_Msk; -} - -/** - \brief Disable the PMU -*/ -__STATIC_INLINE void ARM_PMU_Disable(void) -{ - PMU->CTRL &= ~PMU_CTRL_ENABLE_Msk; -} - -/** - \brief Set event to count for PMU eventer counter - \param [in] num Event counter (0-30) to configure - \param [in] type Event to count -*/ -__STATIC_INLINE void ARM_PMU_Set_EVTYPER(uint32_t num, uint32_t type) -{ - PMU->EVTYPER[num] = type; -} - -/** - \brief Reset cycle counter -*/ -__STATIC_INLINE void ARM_PMU_CYCCNT_Reset(void) -{ - PMU->CTRL |= PMU_CTRL_CYCCNT_RESET_Msk; -} - -/** - \brief Reset all event counters -*/ -__STATIC_INLINE void ARM_PMU_EVCNTR_ALL_Reset(void) -{ - PMU->CTRL |= PMU_CTRL_EVENTCNT_RESET_Msk; -} - -/** - \brief Enable counters - \param [in] mask Counters to enable - \note Enables one or more of the following: - - event counters (0-30) - - cycle counter -*/ -__STATIC_INLINE void ARM_PMU_CNTR_Enable(uint32_t mask) -{ - PMU->CNTENSET = mask; -} - -/** - \brief Disable counters - \param [in] mask Counters to enable - \note Disables one or more of the following: - - event counters (0-30) - - cycle counter -*/ -__STATIC_INLINE void ARM_PMU_CNTR_Disable(uint32_t mask) -{ - PMU->CNTENCLR = mask; -} - -/** - \brief Read cycle counter - \return Cycle count -*/ -__STATIC_INLINE uint32_t ARM_PMU_Get_CCNTR(void) -{ - return PMU->CCNTR; -} - -/** - \brief Read event counter - \param [in] num Event counter (0-30) to read - \return Event count -*/ -__STATIC_INLINE uint32_t ARM_PMU_Get_EVCNTR(uint32_t num) -{ - return PMU_EVCNTR_CNT_Msk & PMU->EVCNTR[num]; -} - -/** - \brief Read counter overflow status - \return Counter overflow status bits for the following: - - event counters (0-30) - - cycle counter -*/ -__STATIC_INLINE uint32_t ARM_PMU_Get_CNTR_OVS(void) -{ - return PMU->OVSSET; -} - -/** - \brief Clear counter overflow status - \param [in] mask Counter overflow status bits to clear - \note Clears overflow status bits for one or more of the following: - - event counters (0-30) - - cycle counter -*/ -__STATIC_INLINE void ARM_PMU_Set_CNTR_OVS(uint32_t mask) -{ - PMU->OVSCLR = mask; -} - -/** - \brief Enable counter overflow interrupt request - \param [in] mask Counter overflow interrupt request bits to set - \note Sets overflow interrupt request bits for one or more of the following: - - event counters (0-30) - - cycle counter -*/ -__STATIC_INLINE void ARM_PMU_Set_CNTR_IRQ_Enable(uint32_t mask) -{ - PMU->INTENSET = mask; -} - -/** - \brief Disable counter overflow interrupt request - \param [in] mask Counter overflow interrupt request bits to clear - \note Clears overflow interrupt request bits for one or more of the following: - - event counters (0-30) - - cycle counter -*/ -__STATIC_INLINE void ARM_PMU_Set_CNTR_IRQ_Disable(uint32_t mask) -{ - PMU->INTENCLR = mask; -} - -/** - \brief Software increment event counter - \param [in] mask Counters to increment - \note Software increment bits for one or more event counters (0-30) -*/ -__STATIC_INLINE void ARM_PMU_CNTR_Increment(uint32_t mask) -{ - PMU->SWINC = mask; -} - -#endif diff --git a/envs/core/src/semihosting.c b/envs/core/src/semihosting.c deleted file mode 100644 index 221c5f2..0000000 --- a/envs/core/src/semihosting.c +++ /dev/null @@ -1,78 +0,0 @@ -#if !defined(NO_SEMIHOSTING_EXIT) - -#include -#include - -// TODO(dsprenkels) Currently, we only exit the QEMU host when a the program -// exists sucessfully. We should also populate some interrupts handlers that -// occur on errors and/or other exception. - -// These two syscall values are used at the end of the program, when we want -// to tell the QEMU host that we are done. I took them from -// . -static const uint32_t REPORT_EXCEPTION = 0x18; -static const uint32_t ApplicationExit = 0x20026; - -// Do a system call towards QEMU or the debugger. -uint32_t semihosting_syscall(uint32_t nr, const uint32_t arg) { - __asm__ volatile ( - "mov r0, %[nr]\n" - "mov r1, %[arg]\n" - "bkpt 0xAB\n" - "mov %[nr], r0\n" - : [nr] "+r" (nr) : [arg] "r" (arg) : "0", "1"); - return nr; -} - -// Register a destructor that will call qemu telling them that the program -// has exited successfully. -static void __attribute__ ((destructor)) semihosting_exit(void) { - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void NMI_Handler(void) { - puts("NMI_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void HardFault_Handler(void) { - puts("HardFault_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void MemManage_Handler(void) { - puts("MemManage_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void BusFault_Handler(void) { - puts("BusFault_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void UsageFault_Handler(void) { - puts("UsageFault_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void SecureFault_Handler(void) { - puts("SecureFault_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void SVC_Handler(void) { - puts("SVC_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void DebugMon_Handler(void) { - puts("DebugMon_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -void PendSV_Handler(void) { - puts("PendSV_Handler"); - semihosting_syscall(REPORT_EXCEPTION, ApplicationExit); -} - -#endif /* !defined(NO_SEMIHOSTING_EXIT) */ diff --git a/envs/core/src/startup_ARMCM55.c b/envs/core/src/startup_ARMCM55.c deleted file mode 100644 index 1d854d3..0000000 --- a/envs/core/src/startup_ARMCM55.c +++ /dev/null @@ -1,165 +0,0 @@ -/****************************************************************************** - * @file startup_ARMCM55.c - * @brief CMSIS-Core Device Startup File for Cortex-M55 Device - * @version V1.1.0 - * @date 16. December 2020 - ******************************************************************************/ -/* - * Copyright (c) 2020 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined (ARMCM55) - #include "ARMCM55.h" -#else - #error device not specified! -#endif - -/*---------------------------------------------------------------------------- - External References - *----------------------------------------------------------------------------*/ -extern uint32_t __INITIAL_SP; -extern uint32_t __STACK_LIMIT; -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) -extern uint32_t __STACK_SEAL; -#endif - -extern __NO_RETURN void __PROGRAM_START(void); - -/*---------------------------------------------------------------------------- - Internal References - *----------------------------------------------------------------------------*/ -__NO_RETURN void Reset_Handler (void); - void Default_Handler(void); - -/*---------------------------------------------------------------------------- - Exception / Interrupt Handler - *----------------------------------------------------------------------------*/ -/* Exceptions */ -/* void NMI_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void HardFault_Handler (void) __attribute__ ((weak)); */ -/* void MemManage_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void BusFault_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void UsageFault_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void SecureFault_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void SVC_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void DebugMon_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -/* void PendSV_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); */ -void NMI_Handler (void); -void HardFault_Handler (void); -void MemManage_Handler (void); -void BusFault_Handler (void); -void UsageFault_Handler (void); -void SecureFault_Handler (void); -void SVC_Handler (void); -void DebugMon_Handler (void); -void PendSV_Handler (void); -void SysTick_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); - -void Interrupt0_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt1_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt2_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt3_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt4_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt5_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt6_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt7_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt8_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); -void Interrupt9_Handler (void) __attribute__ ((weak, alias("Default_Handler"))); - - -/*---------------------------------------------------------------------------- - Exception / Interrupt Vector table - *----------------------------------------------------------------------------*/ - -#if defined ( __GNUC__ ) -#pragma GCC diagnostic push -#pragma GCC diagnostic ignored "-Wpedantic" -#endif - -extern const VECTOR_TABLE_Type __VECTOR_TABLE[496]; - const VECTOR_TABLE_Type __VECTOR_TABLE[496] __VECTOR_TABLE_ATTRIBUTE = { - (VECTOR_TABLE_Type)(&__INITIAL_SP), /* Initial Stack Pointer */ - Reset_Handler, /* Reset Handler */ - NMI_Handler, /* -14 NMI Handler */ - HardFault_Handler, /* -13 Hard Fault Handler */ - MemManage_Handler, /* -12 MPU Fault Handler */ - BusFault_Handler, /* -11 Bus Fault Handler */ - UsageFault_Handler, /* -10 Usage Fault Handler */ - SecureFault_Handler, /* -9 Secure Fault Handler */ - 0, /* Reserved */ - 0, /* Reserved */ - 0, /* Reserved */ - SVC_Handler, /* -5 SVC Handler */ - DebugMon_Handler, /* -4 Debug Monitor Handler */ - 0, /* Reserved */ - PendSV_Handler, /* -2 PendSV Handler */ - SysTick_Handler, /* -1 SysTick Handler */ - - /* Interrupts */ - Interrupt0_Handler, /* 0 Interrupt 0 */ - Interrupt1_Handler, /* 1 Interrupt 1 */ - Interrupt2_Handler, /* 2 Interrupt 2 */ - Interrupt3_Handler, /* 3 Interrupt 3 */ - Interrupt4_Handler, /* 4 Interrupt 4 */ - Interrupt5_Handler, /* 5 Interrupt 5 */ - Interrupt6_Handler, /* 6 Interrupt 6 */ - Interrupt7_Handler, /* 7 Interrupt 7 */ - Interrupt8_Handler, /* 8 Interrupt 8 */ - Interrupt9_Handler /* 9 Interrupt 9 */ - /* Interrupts 10 .. 480 are left out */ -}; - -#if defined ( __GNUC__ ) -#pragma GCC diagnostic pop -#endif - -/*---------------------------------------------------------------------------- - Reset Handler called on controller reset - *----------------------------------------------------------------------------*/ -__NO_RETURN void Reset_Handler(void) -{ - __set_PSP((uint32_t)(&__INITIAL_SP)); - - __set_MSPLIM(0); - __set_PSPLIM(0); - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) - __TZ_set_STACKSEAL_S((uint32_t *)(&__STACK_SEAL)); -#endif - - SystemInit(); /* CMSIS System Initialization */ - __PROGRAM_START(); /* Enter PreMain (C library entry point) */ -} - - -#if defined(__ARMCC_VERSION) && (__ARMCC_VERSION >= 6010050) - #pragma clang diagnostic push - #pragma clang diagnostic ignored "-Wmissing-noreturn" -#endif - -/*---------------------------------------------------------------------------- - Default Handler for Exceptions / Interrupts - *----------------------------------------------------------------------------*/ -void Default_Handler(void) -{ - while(1); -} - -#if defined(__ARMCC_VERSION) && (__ARMCC_VERSION >= 6010050) - #pragma clang diagnostic pop -#endif - diff --git a/envs/core/src/system_ARMCM55.c b/envs/core/src/system_ARMCM55.c deleted file mode 100644 index 3ecc8e1..0000000 --- a/envs/core/src/system_ARMCM55.c +++ /dev/null @@ -1,151 +0,0 @@ -/**************************************************************************//** - * @file system_ARMCM55.c - * @brief CMSIS Device System Source File for - * ARMCM55 Device - * @version V1.0.2 - * @date 13. Oct 2021 - ******************************************************************************/ -/* - * Copyright (c) 2009-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#if defined (ARMCM55) - #include "ARMCM55.h" -#else - #error device not specified! -#endif - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) - #include "partition_ARMCM55.h" -#endif - -#include "uart.h" - -/*---------------------------------------------------------------------------- - Define clocks - *----------------------------------------------------------------------------*/ -#define XTAL ( 5000000UL) /* Oscillator frequency */ - -#define SYSTEM_CLOCK (5U * XTAL) - - -/*---------------------------------------------------------------------------- - Exception / Interrupt Vector table - *----------------------------------------------------------------------------*/ -extern const VECTOR_TABLE_Type __VECTOR_TABLE[496]; - - -/*---------------------------------------------------------------------------- - System Core Clock Variable - *----------------------------------------------------------------------------*/ -uint32_t SystemCoreClock = SYSTEM_CLOCK; - - -/*---------------------------------------------------------------------------- - System Core Clock update function - *----------------------------------------------------------------------------*/ -void SystemCoreClockUpdate (void) -{ - SystemCoreClock = SYSTEM_CLOCK; -} - -/*---------------------------------------------------------------------------- - System initialization function - *----------------------------------------------------------------------------*/ -void SystemInit (void) -{ - -#if defined (__VTOR_PRESENT) && (__VTOR_PRESENT == 1U) - SCB->VTOR = (uint32_t)(&__VECTOR_TABLE[0]); -#endif - -/* #if (defined (__FPU_USED) && (__FPU_USED == 1U)) || \ */ -/* (defined (__ARM_FEATURE_MVE) && (__ARM_FEATURE_MVE > 0U)) */ - SCB->CPACR |= ((3U << 10U*2U) | /* enable CP10 Full Access */ - (3U << 11U*2U) ); /* enable CP11 Full Access */ - - /* Set low-power state for PDEPU */ - /* 0b00 | ON, PDEPU is not in low-power state */ - /* 0b01 | ON, but the clock is off */ - /* 0b10 | RET(ention) */ - /* 0b11 | OFF */ - - /* Clear ELPSTATE, value is 0b11 on Cold reset */ - PWRMODCTL->CPDLPSTATE &= ~(PWRMODCTL_CPDLPSTATE_ELPSTATE_Msk << PWRMODCTL_CPDLPSTATE_ELPSTATE_Pos); - - /* Favor best FP/MVE performance by default, avoid EPU switch-ON delays */ - /* PDEPU ON, Clock OFF */ - PWRMODCTL->CPDLPSTATE |= 0x1 << PWRMODCTL_CPDLPSTATE_ELPSTATE_Pos; -/* #endif */ - -#ifdef UNALIGNED_SUPPORT_DISABLE - SCB->CCR |= SCB_CCR_UNALIGN_TRP_Msk; -#endif - - /* Enable Loop and branch info cache */ - SCB->CCR |= SCB_CCR_LOB_Msk; - __DSB(); - __ISB(); - -#if defined (__ARM_FEATURE_CMSE) && (__ARM_FEATURE_CMSE == 3U) - TZ_SAU_Setup(); -#endif - - SystemCoreClock = SYSTEM_CLOCK; - - uart_init(); -} - -#include -#undef errno -extern int errno; - -int __wrap__read(int file, char* ptr, int len) -{ - if (file == 0) { - int i; - for (i = 0; i < len; ++i) { - ptr[i] = uart_getc(); - if (ptr[i] == '\r') { - ptr[i] = '\n'; - } - if (ptr[i] == '\n') { - i += 1; - break; - } - } - errno = 0; - return i; - } else { - errno = ENOSYS; - } - return -1; -} - -int __wrap__write(int file, char* ptr, int len) -{ - if (file == 1 || file == 2) { - for (int i = 0; i < len; ++i) { - uart_putc(ptr[i]); - } - errno = 0; - return len; - } else { - errno = ENOSYS; - } - return -1; -} diff --git a/envs/core/src/system_ARMCM55.h b/envs/core/src/system_ARMCM55.h deleted file mode 100644 index ae5f25e..0000000 --- a/envs/core/src/system_ARMCM55.h +++ /dev/null @@ -1,64 +0,0 @@ -/**************************************************************************//** - * @file system_ARMCM55.h - * @brief CMSIS Device System Header File for - * ARMCM55 Device - * @version V1.0.0 - * @date 20. February 2020 - ******************************************************************************/ -/* - * Copyright (c) 2020 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#ifndef SYSTEM_ARMCM55_H -#define SYSTEM_ARMCM55_H - -#ifdef __cplusplus -extern "C" { -#endif - -#include - -/** - \brief Exception / Interrupt Handler Function Prototype -*/ -typedef void(*VECTOR_TABLE_Type)(void); - -/** - \brief System Clock Frequency (Core Clock) -*/ -extern uint32_t SystemCoreClock; - -/** - \brief Setup the microcontroller system. - - Initialize the System and update the SystemCoreClock variable. - */ -extern void SystemInit (void); - - -/** - \brief Update SystemCoreClock variable. - - Updates the SystemCoreClock with current core Clock retrieved from cpu registers. - */ -extern void SystemCoreClockUpdate (void); - -#ifdef __cplusplus -} -#endif - -#endif /* SYSTEM_ARMCM55_H */ diff --git a/envs/core/src/test_common b/envs/core/src/test_common deleted file mode 120000 index 7c5f7b1..0000000 --- a/envs/core/src/test_common +++ /dev/null @@ -1 +0,0 @@ -../../../tests/common \ No newline at end of file diff --git a/envs/core/src/test_src b/envs/core/src/test_src deleted file mode 120000 index 5844239..0000000 --- a/envs/core/src/test_src +++ /dev/null @@ -1 +0,0 @@ -../../../tests/helloworld \ No newline at end of file diff --git a/envs/core/src/uart.c b/envs/core/src/uart.c deleted file mode 100644 index 4302a6e..0000000 --- a/envs/core/src/uart.c +++ /dev/null @@ -1,101 +0,0 @@ -/* - * Copyright (c) 2019-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#include "uart.h" -#include -#include - -#define UART0_BASE 0x49303000 -#define UART0_BAUDRATE 115200 -#define SYSTEM_CORE_CLOCK 25000000 - -/*------------- Universal Asynchronous Receiver Transmitter (UART) -----------*/ - -#define __IO volatile -#define __I volatile const -#define __O volatile - -typedef struct -{ - __IO uint32_t DATA; /* Offset: 0x000 (R/W) Data Register */ - __IO uint32_t STATE; /* Offset: 0x004 (R/W) Status Register */ - __IO uint32_t CTRL; /* Offset: 0x008 (R/W) Control Register */ - union - { - __I uint32_t INTSTATUS; /* Offset: 0x00C (R/ ) Interrupt Status Register */ - __O uint32_t INTCLEAR; /* Offset: 0x00C ( /W) Interrupt Clear Register */ - }; - __IO uint32_t BAUDDIV; /* Offset: 0x010 (R/W) Baudrate Divider Register */ -} CMSDK_UART_TypeDef; - -#define CMSDK_UART0_BASE UART0_BASE -#define CMSDK_UART0 ((CMSDK_UART_TypeDef *)CMSDK_UART0_BASE) -#define CMSDK_UART0_BAUDRATE UART0_BAUDRATE - -void uart_init(void) -{ - // SystemCoreClock / 9600 - CMSDK_UART0->BAUDDIV = SYSTEM_CORE_CLOCK / CMSDK_UART0_BAUDRATE; - - CMSDK_UART0->CTRL = ((1ul << 0) | /* TX enable */ - (1ul << 1)); /* RX enable */ -} - -// Output a character -unsigned char uart_putc(unsigned char my_ch) -{ - while ((CMSDK_UART0->STATE & 1)) - ; // Wait if Transmit Holding register is full - - if (my_ch == '\n') - { - CMSDK_UART0->DATA = '\r'; - while ((CMSDK_UART0->STATE & 1)) - ; // Wait if Transmit Holding register is full - } - - CMSDK_UART0->DATA = my_ch; // write to transmit holding register - - return (my_ch); -} - -// Get a character -unsigned char uart_getc(void) -{ - unsigned char my_ch; - // unsigned int cnt; - - while ((CMSDK_UART0->STATE & 2) == 0) // Wait if Receive Holding register is empty - { -#if 0 - cnt = MPS3_FPGAIO->CLK100HZ / 50; - if (cnt & 0x8) - MPS3_FPGAIO->LED = 0x01 << (cnt & 0x7); - else - MPS3_FPGAIO->LED = 0x80 >> (cnt & 0x7); -#endif - } - - my_ch = CMSDK_UART0->DATA; - - // Convert CR to LF - if (my_ch == '\r') - my_ch = '\n'; - - return (my_ch); -} diff --git a/envs/core/src/uart.h b/envs/core/src/uart.h deleted file mode 100644 index dbb1ed2..0000000 --- a/envs/core/src/uart.h +++ /dev/null @@ -1,34 +0,0 @@ -/* - * Copyright (c) 2019-2021 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -#ifndef _UART_STDOUT_H_ -#define _UART_STDOUT_H_ - -#if __cplusplus -extern "C" { -#endif - -void uart_init(void); -unsigned char uart_putc(unsigned char my_ch); -unsigned char uart_getc(void); - -#if __cplusplus -} -#endif - -#endif diff --git a/envs/fvp-corstone300-mps2/.gitignore b/envs/fvp-corstone300-mps2/.gitignore deleted file mode 100644 index 8d1512e..0000000 --- a/envs/fvp-corstone300-mps2/.gitignore +++ /dev/null @@ -1,5 +0,0 @@ -image.axf -test_loaded_* -build/* -src/test_src -inc/test_inc \ No newline at end of file diff --git a/envs/fvp-corstone300-mps2/Makefile b/envs/fvp-corstone300-mps2/Makefile deleted file mode 100644 index 121a9c6..0000000 --- a/envs/fvp-corstone300-mps2/Makefile +++ /dev/null @@ -1,180 +0,0 @@ -# Test environment based on FVP for Corstone-300 MPS2 based platform -# -# Copyright (c) 2021 Arm Limited (or its affiliates). All rights reserved. -# Use, modification and redistribution of this file is subject to your possession of a -# valid End User License Agreement for the Arm Product of which these examples are part of -# and your compliance with all applicable terms and conditions of such licence agreement. - -################################################################################ -### ### -### USER CONFIGURATION ### -### ADAPT THIS ### -### ### -################################################################################ - -# -# See README.md for setup instructions -# - -# The base directory of the FVP for the Corstone-300 MPS2 based platform -# Adjust this if you haven't installed the FVP in the tools directory. -FVP_CORSTONE300_MPS2_DIR?=../../tools/FVP_Corstone300_MPS2 - -# The base directory for CMSIS 5 -# Adjust this if you haven't cloned CMSIS 5 into the tools directory -CMSIS_BASE?=../../tools/CMSIS_5 - -# The base directory of the CMSIS pack for the Corstone-300 MPS2 based platform -# Adjust this if you haven't installed the CMSIS pack in the tools directory. -CMSIS_PACK_CORSTONE300_MPS2_DIR?=../../tools/CMSIS_FVP_Corstone300_MPS2 - -# If Arm Compiler 6 is in your PATH, you may leave this undefined. Otherwise, -# set ARMC6_DIR to the location of the Arm Compiler 6 binaries. - -ifdef ARMC6_DIR -_ARMC6_DIR= $(ARMC6_DIR)/ -else -_ARMC6_DIR= -endif - -ifdef GCC10_DIR -_GCC10_DIR= $(GCC10_DIR)/ -else -_GCC10_DIR= -endif - -################################################################################ -### ### -### END OF USER CONFIGURATION ### -### ### -################################################################################ - -ifndef FVP_CORSTONE300_MPS2_DIR -$(error FVP_CORSTONE300_MPS2_DIR environment variable MUST be set) -endif - -ifndef CMSIS_PACK_CORSTONE300_MPS2_DIR -$(error CMSIS_PACK_CORSTONE300_MPS2_DIR environment variable MUST be set) -endif - -# Derived variables -FVP_EXECUTABLE_PATH=$(FVP_CORSTONE300_MPS2_DIR)/models/Linux64_GCC-6.4/FVP_MPS2_Corstone_SSE-300 -CMSIS_BOARD=$(CMSIS_PACK_CORSTONE300_MPS2_DIR)/Board -CMSIS_DEVICE=$(CMSIS_PACK_CORSTONE300_MPS2_DIR)/Device - -# Final image -TARGET=image.elf - -INC_DIR=./inc -INC_DIR_TEST=$(INC_DIR)/test_inc -I$(SRC_DIR)/test_src/manual -I$(SRC_DIR)/test_src/auto -BUILD_DIR=./build -SRC_DIR=./src - -ifndef USE_GCC - -# Arm Compiler Toolchain - -CC=$(_ARMC6_DIR)armclang -LD=$(_ARMC6_DIR)armlink - -CFLAGS = --target=arm-arm-none-eabi -mcpu=Cortex-M55 \ - -I$(CMSIS_DEVICE)/Include \ - -I$(CMSIS_BOARD)/Platform/ \ - -I$(CMSIS_BASE)/CMSIS/Core/Include \ - -I$(INC_DIR) -I$(INC_DIR_TEST) - -# Scatter files before/after preprocessing -SCATTER_SRC=$(SRC_DIR)/armclang.sct -SCATTER_SRC_TMP=$(SCATTER_SRC).tmp -LDFLAGS = --entry=Reset_Handler --scatter=$(SCATTER_SRC_TMP) - -else # AC6 / GCC - -# GCC toolchain - -# Toolchain -CC=$(_GCC10_DIR)arm-none-eabi-gcc-10.2.1 -LD=$(_GCC10_DIR)arm-none-eabi-gcc-10.2.1 - -CFLAGS = -mfloat-abi=softfp -march=armv8.1-m.main+mve -mcpu=cortex-m55 \ - -I$(CMSIS_DEVICE)/Include \ - -I$(CMSIS_BOARD)/Platform/ \ - -I$(CMSIS_BASE)/CMSIS/Core/Include \ - -I$(INC_DIR) -I$(INC_DIR_TEST) - -# Scatter files before/after preprocessing -SCATTER_SRC=$(SRC_DIR)/gcc_arm.ld -SCATTER_SRC_TMP=$(SCATTER_SRC).tmp - -LDFLAGS = -mfloat-abi=softfp -march=armv8.1-m.main+mve -mcpu=cortex-m55 --entry=Reset_Handler \ - -T $(SCATTER_SRC_TMP) -specs=rdimon.specs - -endif # AC6 / GCC - -CMSIS_SRC_DIR=$(CMSIS_DEVICE)/Source -C_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard $(SRC_DIR)/*/*/*.c) -C_SRC_FILES=$(patsubst $(SRC_DIR)/%.c, %.c, $(C_SRC_FILES_PRE)) -ASM_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*/*.s) $(wildcard $(SRC_DIR)/*.s) $(wildcard $(SRC_DIR)/*/*/*.s) -ASM_SRC_FILES=$(patsubst $(SRC_DIR)/%.s, %.s, $(ASM_SRC_FILES_PRE)) -CMSIS_SRC_FILES_PRE=$(wildcard $(CMSIS_SRC_DIR)/*.c) -CMSIS_SRC_FILES=$(patsubst $(CMSIS_SRC_DIR)/%.c, %.c, $(CMSIS_SRC_FILES_PRE)) - -HEADER_FILES_PRE=$(wildcard $(SRC_DIR)/*.h) $(wildcard $(SRC_DIR)/*/*.h) $(wildcard $(SRC_DIR)/*/*/*.h) - -ASM_OBJ_FILES=$(patsubst %.s, $(BUILD_DIR)/%.o, $(ASM_SRC_FILES)) -CMSIS_OBJ_FILES=$(patsubst %.c, $(BUILD_DIR)/%.o, $(CMSIS_SRC_FILES)) -C_OBJ_FILES=$(patsubst %.c, $(BUILD_DIR)/%.o, $(C_SRC_FILES)) -OBJ_FILES=$(ASM_OBJ_FILES) $(C_OBJ_FILES) $(CMSIS_OBJ_FILES) - -.phony: all clean debug run - -all: $(TARGET) - -ifndef USE_GCC -$(SCATTER_SRC_TMP): $(SCATTER_SRC) - mkdir -p $(@D) - # Preprocess linker script manually - $(CC) $(CFLAGS) -E -xc $< -o $@ -else -$(SCATTER_SRC_TMP): $(SCATTER_SRC) - mkdir -p $(@D) - cp $< $@ -endif - -# Compilation -$(CMSIS_OBJ_FILES): $(BUILD_DIR)/%.o: $(CMSIS_SRC_DIR)/%.c - mkdir -p $(@D) - $(CC) $(CFLAGS) -c -o $@ $< -$(C_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.c $(HEADER_FILES_PRE) - mkdir -p $(@D) - $(CC) $(CFLAGS) -c -o $@ $< -$(ASM_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.s $(HEADER_FILES_PRE) - mkdir -p $(@D) - $(CC) -x assembler-with-cpp $(CFLAGS) -c -o $@ $< - -# Linking -$(TARGET): $(OBJS_DIR) $(OBJ_FILES) $(SCATTER_SRC_TMP) - mkdir -p $(@D) - $(LD) $(LDFLAGS) $(OBJ_FILES) -o $(TARGET) - -# Running -run: $(TARGET) - "$(FVP_EXECUTABLE_PATH)" \ - -a cpu0=$(TARGET) \ - -C cpu0.semihosting-enable=1 \ - -C cpu0.semihosting-stack_base=0x0 \ - -C cpu0.semihosting-stack_limit=0x0 \ - -C cpu0.FPU=1 -C cpu0.MVE=2 -debug: $(TARGET) - "$(FVP_EXECUTABLE_PATH)" \ - -a cpu0=$(TARGET) \ - -C cpu0.FPU=1 -C cpu0.MVE=2 \ - -C cpu0.semihosting-enable=1 \ - -C cpu0.semihosting-stack_base=0x0 \ - -C cpu0.semihosting-stack_limit=0x0 \ - -S - -clean: - rm -rf $(SCATTER_SRC_TMP) - rm -rf $(OBJ_FILES) - rm -rf $(TARGET) diff --git a/envs/fvp-corstone300-mps2/Makefile.rej b/envs/fvp-corstone300-mps2/Makefile.rej deleted file mode 100644 index 5133333..0000000 --- a/envs/fvp-corstone300-mps2/Makefile.rej +++ /dev/null @@ -1,18 +0,0 @@ -diff a/envs/fvp-corstone300-mps2/Makefile b/envs/fvp-corstone300-mps2/Makefile (rejected hunks) -@@ -102,15 +102,13 @@ CFLAGS = -march=armv8.1-m.main+mve -mcpu=cortex-m55 \ - -I$(INC_DIR) -I$(INC_DIR_TEST) - - # Scatter files before/after preprocessing --SCATTER_DEPENDENCIES=$(wildcard $(CMSIS_DEVICE)/Include/*.h) --SCATTER_SRC_PRE=$(CMSIS_BASE)/Device/ARM/ARMCM55/Source/GCC/gcc_arm.ld --SCATTER_SRC=$(BUILD_DIR)/gcc_arm.ld -+SCATTER_SRC=$(SRC_DIR)/gcc_arm.ld - - LDFLAGS = -march=armv8.1-m.main+mve -mcpu=cortex-m55 --entry=Reset_Handler \ - -T $(SCATTER_SRC) -specs=rdimon.specs - - endif # AC6 / GCC - - CMSIS_SRC_DIR=$(CMSIS_DEVICE)/Source - C_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard $(SRC_DIR)/*/*/*.c) - C_SRC_FILES=$(patsubst $(SRC_DIR)/%.c, %.c, $(C_SRC_FILES_PRE)) diff --git a/envs/fvp-corstone300-mps2/README.md b/envs/fvp-corstone300-mps2/README.md deleted file mode 100644 index afc970e..0000000 --- a/envs/fvp-corstone300-mps2/README.md +++ /dev/null @@ -1,100 +0,0 @@ -# Exploring Arm v8.1-M + MVE with Arm® Corstone™-300 MPS2-based platform - -## Overview - -This directory provides a test environment for Arm v8.1-M based code using the freely available Fixed Virtual Platform (FVP) for the Corstone SSE-300 -subsystem, which includes an Arm® Cortex™-M55 processor. - -__Important:__ This test environment is testing of functional behaviour only: No cycle count estimates are provided. - -## Usage - -### As part of pqmx - -This test environment is intended to be used from the top of the repository; see the -corresponding [README](../../README.md). In this case, the [`src`](src) directory will be populated with test files -through symlinks. - -### Standalone - -This directory can also be used in a standalone fashion using the provided makefile: - -- `make` will compile and link the target image -- `make run` will run the target image in the FVP -- `make debug` will run the target image in the FVP with debugging enabled - -When used in this way, all source files within the [`src`](src) directory will be compiled and linked, alongside the -necessary CMSIS bootcode. - -### Choice of Compiler - -By default, `make` will use Arm Compiler for compilation and linking. If you wish to use GCC instead, supply -`USE_GCC=1`, e.g. `make USE_GCC=1 run`. - -## Installation - -In order to use this test environment, the following software is needed: - -- An MVE-aware toolchain, such as ARM Compiler 6.14 or higher, or the GNU Arm Embedded Toolchain version - 10-2020-q4-major or higher -- FVP for Corstone-300 MPS2 based platform -- CMSIS 5 -- CMSIS pack for Corstone-300 MPS2 based platform - -The following sections describe the setup for each of these in detail. - -### Arm Compiler 6 - -- Download and unpack the latest version of Arm Compiler 6 from -[here](https://developer.arm.com/tools-and-software/embedded/arm-compiler/downloads/version-6). - -- Use installation script to install Arm Compiler 6 in a location of your choice, for example - a subdirectory of the [`tools`](../../../tools) directory in this repository. See the [Getting Started -> Installing - Arm Compiler](https://developer.arm.com/documentation/100748/0615/Getting-Started/Installing-Arm-Compiler?lang=en) - section in the Arm Compiler User Guide for more information. - -- Either set `ARMC6_DIR` to the directory containing the Arm Compiler 6 binaries, or add it to your `PATH`. - -### FVP for Corstone-300 MPS2 based platform - -- Download and unpack FVP for Corstone-300 MPS2 based platform from - [here](https://developer.arm.com/tools-and-software/open-source-software/arm-platforms-software/arm-ecosystem-fvps): - Select `Corstone-300 Ecosystem FVPs`, then `Download the FVP model for the Corstone-300 MPS2 based platform`. - -- Use installation script to install FVP in a location of your choice, for example a subdirectory of the - [`tools`](../../../tools) directory in this repository. - -- Set the environment variable `FVP_CORSTONE300_MPS2_DIR` to the to the base directory of the installed FVP. - -#### Arm Compiler via Arm Development Studio - -Instead of downloading and installing the Arm Compiler manually, you can also acquire it as part of the [Arm Development -Studio](https://developer.arm.com/tools-and-software/embedded/arm-development-studio). - -### GCC 10.1 - -- Download and install the [GNU Arm Embedded - Toolchain](https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm) in - a location of your choice, for example a subdirectory of the - [`tools`](../../../tools) directory in this repository. - -- Either set `GCC10_DIR` to the directory containing the Arm embedded GCC toolchain binaries, or add it to your `PATH`. - -- To use GCC for compilation and linking, supply `USE_GCC=1` to invocations of `make` -- see above. - -### CMSIS 5 - -- Clone CMSIS 5 from the [public GitHub repository](https://github.com/ARM-software/CMSIS_5) into a location of your - choice, for example a subdirectory of the [`tools`](../../../tools) directory in this repository. - -- Set the environment variable `CMSIS_BASE` to the base of the CMSIS repository. - -### CMSIS Pack for Corstone-300 MPS2 based platform - -- Download CMSIS pack from https://www.keil.com/dd2/arm/sse_300_mps2/ - -- Unpack CMSIS pack (e.g. using `unzip`, the pack is a zip archive) to a location of your choice, - for example a subdirectory of the [`tools`](../../../tools) directory in this repository. - -- Set the environment variable `CMSIS_PACK_CORSTONE300_MPS2_DIR` to the location of the unpacked - CMSIS pack. diff --git a/envs/fvp-corstone300-mps2/inc/hal_env.h b/envs/fvp-corstone300-mps2/inc/hal_env.h deleted file mode 100644 index 0194c7f..0000000 --- a/envs/fvp-corstone300-mps2/inc/hal_env.h +++ /dev/null @@ -1,6 +0,0 @@ -#if !defined(TEST_ENV_CORSTONE300_MPS3_HAL_ENV_HDR) -#define TEST_ENV_CORSTONE300_MPS3_HAL_ENV_HDR - -/* Nothing in the HAL is implemented through macros. */ - -#endif /* TEST_ENV_CORSTONE300_MPS3_HAL_ENV_HDR */ diff --git a/envs/fvp-corstone300-mps2/src/armclang.sct b/envs/fvp-corstone300-mps2/src/armclang.sct deleted file mode 100644 index 087f6a1..0000000 --- a/envs/fvp-corstone300-mps2/src/armclang.sct +++ /dev/null @@ -1,63 +0,0 @@ -;/* -; * Copyright (c) 2018-2020 Arm Limited -; * -; * Licensed under the Apache License, Version 2.0 (the "License"); -; * you may not use this file except in compliance with the License. -; * You may obtain a copy of the License at -; * -; * http://www.apache.org/licenses/LICENSE-2.0 -; * -; * Unless required by applicable law or agreed to in writing, software -; * distributed under the License is distributed on an "AS IS" BASIS, -; * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -; * See the License for the specific language governing permissions and -; * limitations under the License. -; * -; */ - -#include "region_defs.h" - -#undef STACK_SIZE -#define STACK_SIZE (0x10000) -#undef HEAP_SIZE -#define HEAP_SIZE (0xC00) - -#define __STACK_TOP (S_DATA_START + S_DATA_SIZE) - -LR_CODE S_CODE_START { - ER_CODE S_CODE_START { - *.o (RESET +First) - .ANY (+RO) - } - - /* - * Place the CMSE Veneers (containing the SG instruction) after the code, in - * a separate 32 bytes aligned region so that the SAU can programmed to just - * set this region as Non-Secure Callable. The maximum size of this - * executable region makes it only used the space left over by the ER_CODE - * region so that you can rely on code+veneer size combined will not exceed - * the S_CODE_SIZE value. We also substract from the available space the - * area used to align this section on 32 bytes boundary (for SAU conf). - */ - ER_CODE_CMSE_VENEER +0 ALIGN 32 { - *(Veneer$$CMSE) - } - /* - * This dummy region ensures that the next one will be aligned on a 32 bytes - * boundary, so that the following region will not be mistakenly configured - * as Non-Secure Callable by the SAU. - */ - ER_CODE_CMSE_VENEER_DUMMY +0 ALIGN 32 EMPTY 0 {} - - ER_DATA S_DATA_START S_DATA_SIZE { - .ANY (+ZI +RW) - } - - #if HEAP_SIZE > 0 - ARM_LIB_HEAP AlignExpr(+0, 8) EMPTY HEAP_SIZE { ; Reserve empty region for heap - } - #endif - - ARM_LIB_STACK __STACK_TOP EMPTY -STACK_SIZE { ; Reserve empty region for stack - } -} diff --git a/envs/fvp-corstone300-mps2/src/gcc_arm.ld b/envs/fvp-corstone300-mps2/src/gcc_arm.ld deleted file mode 100644 index 1c23aff..0000000 --- a/envs/fvp-corstone300-mps2/src/gcc_arm.ld +++ /dev/null @@ -1,316 +0,0 @@ -/****************************************************************************** - * @file gcc_arm.ld - * @brief GNU Linker Script for Cortex-M based device - * @version V2.2.0 - * @date 16. December 2020 - ******************************************************************************/ -/* - * Copyright (c) 2009-2020 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - *-------- <<< Use Configuration Wizard in Context Menu >>> ------------------- - */ - -/*---------------------- Flash Configuration ---------------------------------- - Flash Configuration - Flash Base Address <0x0-0xFFFFFFFF:8> - Flash Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__ROM_BASE = 0x00000000; -__ROM_SIZE = 0x00040000; - -/*--------------------- Embedded RAM Configuration ---------------------------- - RAM Configuration - RAM Base Address <0x0-0xFFFFFFFF:8> - RAM Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__RAM_BASE = 0x20000000; -__RAM_SIZE = 0x00020000; - -/*--------------------- Stack / Heap Configuration ---------------------------- - Stack / Heap Configuration - Stack Size (in Bytes) <0x0-0xFFFFFFFF:8> - Heap Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__STACK_SIZE = 0x00010000; -__HEAP_SIZE = 0x00000C00; - -/* - *-------------------- <<< end of configuration section >>> ------------------- - */ - -/* ARMv8-M stack sealing: - to use ARMv8-M stack sealing set __STACKSEAL_SIZE to 8 otherwise keep 0 - */ -__STACKSEAL_SIZE = 0; - - -MEMORY -{ - FLASH (rx) : ORIGIN = __ROM_BASE, LENGTH = __ROM_SIZE - RAM (rwx) : ORIGIN = __RAM_BASE, LENGTH = __RAM_SIZE -} - -/* Linker script to place sections and symbol values. Should be used together - * with other linker script that defines memory regions FLASH and RAM. - * It references following symbols, which must be defined in code: - * Reset_Handler : Entry of reset handler - * - * It defines following symbols, which code can use without definition: - * __exidx_start - * __exidx_end - * __copy_table_start__ - * __copy_table_end__ - * __zero_table_start__ - * __zero_table_end__ - * __etext - * __data_start__ - * __preinit_array_start - * __preinit_array_end - * __init_array_start - * __init_array_end - * __fini_array_start - * __fini_array_end - * __data_end__ - * __bss_start__ - * __bss_end__ - * __end__ - * end - * __HeapLimit - * __StackLimit - * __StackTop - * __stack - * __StackSeal (only if ARMv8-M stack sealing is used) - */ -ENTRY(Reset_Handler) - -SECTIONS -{ - .text : - { - KEEP(*(.vectors)) - *(.text*) - - KEEP(*(.init)) - KEEP(*(.fini)) - - /* .ctors */ - *crtbegin.o(.ctors) - *crtbegin?.o(.ctors) - *(EXCLUDE_FILE(*crtend?.o *crtend.o) .ctors) - *(SORT(.ctors.*)) - *(.ctors) - - /* .dtors */ - *crtbegin.o(.dtors) - *crtbegin?.o(.dtors) - *(EXCLUDE_FILE(*crtend?.o *crtend.o) .dtors) - *(SORT(.dtors.*)) - *(.dtors) - - *(.rodata*) - - KEEP(*(.eh_frame*)) - } > FLASH - - /* - * SG veneers: - * All SG veneers are placed in the special output section .gnu.sgstubs. Its start address - * must be set, either with the command line option ‘--section-start’ or in a linker script, - * to indicate where to place these veneers in memory. - */ -/* - .gnu.sgstubs : - { - . = ALIGN(32); - } > FLASH -*/ - .ARM.extab : - { - *(.ARM.extab* .gnu.linkonce.armextab.*) - } > FLASH - - __exidx_start = .; - .ARM.exidx : - { - *(.ARM.exidx* .gnu.linkonce.armexidx.*) - } > FLASH - __exidx_end = .; - - .copy.table : - { - . = ALIGN(4); - __copy_table_start__ = .; - - LONG (__etext) - LONG (__data_start__) - LONG ((__data_end__ - __data_start__) / 4) - - /* Add each additional data section here */ -/* - LONG (__etext2) - LONG (__data2_start__) - LONG ((__data2_end__ - __data2_start__) / 4) -*/ - __copy_table_end__ = .; - } > FLASH - - .zero.table : - { - . = ALIGN(4); - __zero_table_start__ = .; - /* Add each additional bss section here */ -/* - LONG (__bss2_start__) - LONG ((__bss2_end__ - __bss2_start__) / 4) -*/ - __zero_table_end__ = .; - } > FLASH - - /** - * Location counter can end up 2byte aligned with narrow Thumb code but - * __etext is assumed by startup code to be the LMA of a section in RAM - * which must be 4byte aligned - */ - __etext = ALIGN (4); - - .data : AT (__etext) - { - __data_start__ = .; - *(vtable) - *(.data) - *(.data.*) - - . = ALIGN(4); - /* preinit data */ - PROVIDE_HIDDEN (__preinit_array_start = .); - KEEP(*(.preinit_array)) - PROVIDE_HIDDEN (__preinit_array_end = .); - - . = ALIGN(4); - /* init data */ - PROVIDE_HIDDEN (__init_array_start = .); - KEEP(*(SORT(.init_array.*))) - KEEP(*(.init_array)) - PROVIDE_HIDDEN (__init_array_end = .); - - . = ALIGN(4); - /* finit data */ - PROVIDE_HIDDEN (__fini_array_start = .); - KEEP(*(SORT(.fini_array.*))) - KEEP(*(.fini_array)) - PROVIDE_HIDDEN (__fini_array_end = .); - - KEEP(*(.jcr*)) - . = ALIGN(4); - /* All data end */ - __data_end__ = .; - - } > RAM - - /* - * Secondary data section, optional - * - * Remember to add each additional data section - * to the .copy.table above to asure proper - * initialization during startup. - */ -/* - __etext2 = ALIGN (4); - - .data2 : AT (__etext2) - { - . = ALIGN(4); - __data2_start__ = .; - *(.data2) - *(.data2.*) - . = ALIGN(4); - __data2_end__ = .; - - } > RAM2 -*/ - - .bss : - { - . = ALIGN(4); - __bss_start__ = .; - *(.bss) - *(.bss.*) - *(COMMON) - . = ALIGN(4); - __bss_end__ = .; - } > RAM AT > RAM - - /* - * Secondary bss section, optional - * - * Remember to add each additional bss section - * to the .zero.table above to asure proper - * initialization during startup. - */ -/* - .bss2 : - { - . = ALIGN(4); - __bss2_start__ = .; - *(.bss2) - *(.bss2.*) - . = ALIGN(4); - __bss2_end__ = .; - } > RAM2 AT > RAM2 -*/ - - .heap (COPY) : - { - . = ALIGN(8); - __end__ = .; - PROVIDE(end = .); - . = . + __HEAP_SIZE; - . = ALIGN(8); - __HeapLimit = .; - } > RAM - - .stack (ORIGIN(RAM) + LENGTH(RAM) - __STACK_SIZE - __STACKSEAL_SIZE) (COPY) : - { - . = ALIGN(8); - __StackLimit = .; - . = . + __STACK_SIZE; - . = ALIGN(8); - __StackTop = .; - } > RAM - PROVIDE(__stack = __StackTop); - - /* ARMv8-M stack sealing: - to use ARMv8-M stack sealing uncomment '.stackseal' section - */ -/* - .stackseal (ORIGIN(RAM) + LENGTH(RAM) - __STACKSEAL_SIZE) (COPY) : - { - . = ALIGN(8); - __StackSeal = .; - . = . + 8; - . = ALIGN(8); - } > RAM -*/ - - /* Check if data + heap + stack exceeds RAM limit */ - ASSERT(__StackLimit >= __HeapLimit, "region RAM overflowed with stack") -} diff --git a/envs/fvp-corstone300-mps2/src/hal.c b/envs/fvp-corstone300-mps2/src/hal.c deleted file mode 100644 index 3475874..0000000 --- a/envs/fvp-corstone300-mps2/src/hal.c +++ /dev/null @@ -1,46 +0,0 @@ - -#include - -/* Dependency on standard library: - * - rand() - * - printf() - * - fflush() - */ -#include -#include -#include - -uint8_t get_random_byte() -{ - return( rand() ); -} - -/* Stubs to enable/disable measurements. */ -void measure_end() -{ - asm volatile( "DBG #9" : : : "memory" ); -} - -void measure_start() -{ - asm volatile( "DBG #1" : : : "memory" ); -} - -/* Debugging stubs */ - -void debug_test_start( const char *testname ) -{ - printf( "%s ... ", testname ); - fflush( stdout ); -} - -void debug_printf(const char * format, ... ) -{ - va_list argp; - va_start( argp, format ); - vprintf( format, argp ); - va_end( argp ); -} - -void debug_test_ok() { printf( "Ok\n" ); } -void debug_test_fail() { printf( "FAIL!\n" ); } diff --git a/envs/fvp-corstone300-mps2/src/scatter_tmp.sct b/envs/fvp-corstone300-mps2/src/scatter_tmp.sct deleted file mode 100644 index d3f8d5a..0000000 --- a/envs/fvp-corstone300-mps2/src/scatter_tmp.sct +++ /dev/null @@ -1,60 +0,0 @@ -#! armclang --target=arm-arm-none-eabi -march=armv8-m.main -E -xc - -;/* -; * Copyright (c) 2018-2020 Arm Limited -; * -; * Licensed under the Apache License, Version 2.0 (the "License"); -; * you may not use this file except in compliance with the License. -; * You may obtain a copy of the License at -; * -; * http://www.apache.org/licenses/LICENSE-2.0 -; * -; * Unless required by applicable law or agreed to in writing, software -; * distributed under the License is distributed on an "AS IS" BASIS, -; * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -; * See the License for the specific language governing permissions and -; * limitations under the License. -; * -; */ - -#include "region_defs.h" - -#define __STACK_TOP (S_DATA_START + S_DATA_SIZE) - -LR_CODE S_CODE_START { - ER_CODE S_CODE_START { - *.o (RESET +First) - .ANY (+RO) - } - - /* - * Place the CMSE Veneers (containing the SG instruction) after the code, in - * a separate 32 bytes aligned region so that the SAU can programmed to just - * set this region as Non-Secure Callable. The maximum size of this - * executable region makes it only used the space left over by the ER_CODE - * region so that you can rely on code+veneer size combined will not exceed - * the S_CODE_SIZE value. We also substract from the available space the - * area used to align this section on 32 bytes boundary (for SAU conf). - */ - ER_CODE_CMSE_VENEER +0 ALIGN 32 { - *(Veneer$$CMSE) - } - /* - * This dummy region ensures that the next one will be aligned on a 32 bytes - * boundary, so that the following region will not be mistakenly configured - * as Non-Secure Callable by the SAU. - */ - ER_CODE_CMSE_VENEER_DUMMY +0 ALIGN 32 EMPTY 0 {} - - ER_DATA S_DATA_START S_DATA_SIZE { - .ANY (+ZI +RW) - } - - #if HEAP_SIZE > 0 - ARM_LIB_HEAP AlignExpr(+0, 8) EMPTY HEAP_SIZE { ; Reserve empty region for heap - } - #endif - - ARM_LIB_STACK __STACK_TOP EMPTY -STACK_SIZE { ; Reserve empty region for stack - } -} diff --git a/envs/fvp-corstone300-mps2/src/test_common b/envs/fvp-corstone300-mps2/src/test_common deleted file mode 120000 index 7c5f7b1..0000000 --- a/envs/fvp-corstone300-mps2/src/test_common +++ /dev/null @@ -1 +0,0 @@ -../../../tests/common \ No newline at end of file diff --git a/envs/fvp-corstone300-mps3/.gitignore b/envs/fvp-corstone300-mps3/.gitignore deleted file mode 100644 index 3ce26d9..0000000 --- a/envs/fvp-corstone300-mps3/.gitignore +++ /dev/null @@ -1,4 +0,0 @@ -image.axf -test_loaded_* -build/* -src/test_src diff --git a/envs/fvp-corstone300-mps3/Makefile b/envs/fvp-corstone300-mps3/Makefile deleted file mode 100644 index e07f974..0000000 --- a/envs/fvp-corstone300-mps3/Makefile +++ /dev/null @@ -1,182 +0,0 @@ -# Test environment based on FVP for Corstone-300 MPS3 based platform -# -# Copyright (c) 2021 Arm Limited (or its affiliates). All rights reserved. -# Use, modification and redistribution of this file is subject to your possession of a -# valid End User License Agreement for the Arm Product of which these examples are part of -# and your compliance with all applicable terms and conditions of such licence agreement. - -################################################################################ -### ### -### USER CONFIGURATION ### -### ADAPT THIS ### -### ### -################################################################################ - -# -# See README.md for setup instructions -# - -# The base directory of the FVP for the Corstone-300 MPS3 based platform -# Adjust this if you haven't installed the FVP in the tools directory. -FVP_CORSTONE300_MPS3_DIR?=../../tools/FVP_Corstone300_MPS3 - -# The base directory for CMSIS 5 -# Adjust this if you haven't cloned CMSIS 5 into the tools directory -CMSIS_BASE?=../../tools/CMSIS_5 - -# The base directory of the CMSIS pack for the Corstone-300 MPS3 based platform -# Adjust this if you haven't installed the CMSIS pack in the tools directory. -CMSIS_PACK_CORSTONE300_MPS3_DIR?=../../tools/CMSIS_FVP_Corstone300_MPS3 - -# If Arm Compiler 6 is in your PATH, you may leave this undefined. Otherwise, -# set ARMC6_DIR to the location of the Arm Compiler 6 binaries. - -ifdef ARMC6_DIR -_ARMC6_DIR= $(ARMC6_DIR)/ -else -_ARMC6_DIR= -endif - -ifdef GCC10_DIR -_GCC10_DIR= $(GCC10_DIR)/ -else -_GCC10_DIR= -endif - -################################################################################ -### ### -### END OF USER CONFIGURATION ### -### ### -################################################################################ - -ifndef FVP_CORSTONE300_MPS3_DIR -$(error FVP_CORSTONE300_MPS3_DIR environment variable MUST be set) -endif - -ifndef CMSIS_PACK_CORSTONE300_MPS3_DIR -$(error CMSIS_PACK_CORSTONE300_MPS3_DIR environment variable MUST be set) -endif - -# Derived variables -FVP_EXECUTABLE_PATH=$(FVP_CORSTONE300_MPS3_DIR)/models/Linux64_GCC-6.4/FVP_Corstone_SSE-300_Ethos-U55 -CMSIS_BOARD=$(CMSIS_PACK_CORSTONE300_MPS3_DIR)/Board -CMSIS_DEVICE=$(CMSIS_PACK_CORSTONE300_MPS3_DIR)/Device - -# Final image -TARGET=image.elf - -INC_DIR=./inc -INC_DIR_TEST=$(INC_DIR)/test_inc -I$(SRC_DIR)/test_src/manual -I$(SRC_DIR)/test_src/auto -BUILD_DIR=./build -SRC_DIR=./src - -ifndef USE_GCC - -# Arm Compiler Toolchain - -CC=$(_ARMC6_DIR)armclang -LD=$(_ARMC6_DIR)armlink - -CFLAGS = --target=arm-arm-none-eabi -mcpu=Cortex-M55 \ - -I$(CMSIS_DEVICE)/Include \ - -I$(CMSIS_BOARD)/Platform/ \ - -I$(CMSIS_BASE)/CMSIS/Core/Include \ - -I$(CMSIS_BASE)/Device/ARM/ARMCM55/Include/ \ - -I$(INC_DIR) -I$(INC_DIR_TEST) - -# Scatter files before/after preprocessing -SCATTER_SRC=$(SRC_DIR)/armclang.sct -SCATTER_SRC_TMP=$(SCATTER_SRC).tmp -LDFLAGS = --entry=Reset_Handler --scatter=$(SCATTER_SRC_TMP) - -else # AC6 / GCC - -# GCC toolchain - -# Toolchain -CC=$(_GCC10_DIR)arm-none-eabi-gcc -LD=$(_GCC10_DIR)arm-none-eabi-gcc - -CFLAGS = -mfloat-abi=softfp -march=armv8.1-m.main+mve -mcpu=cortex-m55 \ - -I$(CMSIS_DEVICE)/Include \ - -I$(CMSIS_BOARD)/Platform/ \ - -I$(CMSIS_BASE)/CMSIS/Core/Include \ - -I$(CMSIS_BASE)/Device/ARM/ARMCM55/Include/ \ - -I$(INC_DIR) -I$(INC_DIR_TEST) - -# Scatter files before/after preprocessing -SCATTER_SRC=$(SRC_DIR)/gcc_arm.ld -SCATTER_SRC_TMP=$(SCATTER_SRC).tmp - -LDFLAGS = -mfloat-abi=softfp -march=armv8.1-m.main+mve -mcpu=cortex-m55 --entry=Reset_Handler \ - -T $(SCATTER_SRC_TMP) -specs=rdimon.specs - -endif # AC6 / GCC - -CMSIS_SRC_DIR=$(CMSIS_DEVICE)/Source -C_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*.c) $(wildcard $(SRC_DIR)/*/*.c) $(wildcard $(SRC_DIR)/*/*/*.c) -C_SRC_FILES=$(patsubst $(SRC_DIR)/%.c, %.c, $(C_SRC_FILES_PRE)) -ASM_SRC_FILES_PRE=$(wildcard $(SRC_DIR)/*/*.s) $(wildcard $(SRC_DIR)/*.s) $(wildcard $(SRC_DIR)/*/*/*.s) -ASM_SRC_FILES=$(patsubst $(SRC_DIR)/%.s, %.s, $(ASM_SRC_FILES_PRE)) -CMSIS_SRC_FILES_PRE=$(wildcard $(CMSIS_SRC_DIR)/*.c) -CMSIS_SRC_FILES=$(patsubst $(CMSIS_SRC_DIR)/%.c, %.c, $(CMSIS_SRC_FILES_PRE)) - -HEADER_FILES_PRE=$(wildcard $(SRC_DIR)/*.h) $(wildcard $(SRC_DIR)/*/*.h) $(wildcard $(SRC_DIR)/*/*/*.h) - -ASM_OBJ_FILES=$(patsubst %.s, $(BUILD_DIR)/%.o, $(ASM_SRC_FILES)) -CMSIS_OBJ_FILES=$(patsubst %.c, $(BUILD_DIR)/%.o, $(CMSIS_SRC_FILES)) -C_OBJ_FILES=$(patsubst %.c, $(BUILD_DIR)/%.o, $(C_SRC_FILES)) -OBJ_FILES=$(ASM_OBJ_FILES) $(C_OBJ_FILES) $(CMSIS_OBJ_FILES) - -.phony: all clean debug run - -all: $(TARGET) - -ifndef USE_GCC -$(SCATTER_SRC_TMP): $(SCATTER_SRC) - mkdir -p $(@D) - # Preprocess linker script manually - $(CC) $(CFLAGS) -E -xc $< -o $@ -else -$(SCATTER_SRC_TMP): $(SCATTER_SRC) - mkdir -p $(@D) - cp $< $@ -endif - -# Compilation -$(CMSIS_OBJ_FILES): $(BUILD_DIR)/%.o: $(CMSIS_SRC_DIR)/%.c - mkdir -p $(@D) - $(CC) $(CFLAGS) -c -o $@ $< -$(C_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.c $(HEADER_FILES_PRE) - mkdir -p $(@D) - $(CC) $(CFLAGS) -c -o $@ $< -$(ASM_OBJ_FILES): $(BUILD_DIR)/%.o: $(SRC_DIR)/%.s $(HEADER_FILES_PRE) - mkdir -p $(@D) - $(CC) -x assembler-with-cpp $(CFLAGS) -c -o $@ $< - -# Linking -$(TARGET): $(OBJS_DIR) $(OBJ_FILES) $(SCATTER_SRC_TMP) - mkdir -p $(@D) - $(LD) $(LDFLAGS) $(OBJ_FILES) -o $(TARGET) - -# Running -run: $(TARGET) - "$(FVP_EXECUTABLE_PATH)" \ - -a cpu0=$(TARGET) \ - -C cpu0.semihosting-enable=1 \ - -C cpu0.semihosting-stack_base=0x0 \ - -C cpu0.semihosting-stack_limit=0x0 \ - -C cpu0.FPU=1 -C cpu0.MVE=2 -debug: $(TARGET) - "$(FVP_EXECUTABLE_PATH)" \ - -a cpu0=$(TARGET) \ - -C cpu0.FPU=1 -C cpu0.MVE=2 \ - -C cpu0.semihosting-enable=1 \ - -C cpu0.semihosting-stack_base=0x0 \ - -C cpu0.semihosting-stack_limit=0x0 \ - -S - -clean: - rm -rf $(SCATTER_SRC_TMP) - rm -rf $(OBJ_FILES) - rm -rf $(TARGET) diff --git a/envs/fvp-corstone300-mps3/README.md b/envs/fvp-corstone300-mps3/README.md deleted file mode 100644 index d80d3df..0000000 --- a/envs/fvp-corstone300-mps3/README.md +++ /dev/null @@ -1,100 +0,0 @@ -# Exploring Arm v8.1-M + MVE with Arm® Corstone™-300 MPS3-based platform - -## Overview - -This directory provides a test environment for Arm v8.1-M based code using the freely available Fixed Virtual Platform (FVP) for the Corstone SSE-300 -subsystem, which includes an Arm® Cortex™-M55 processor. - -__Important:__ This test environment is testing of functional behaviour only: No cycle count estimates are provided. - -## Usage - -### As part of pqmx - -This test environment is intended to be used from the top of the repository; see the -corresponding [README](../../README.md). In this case, the [`src`](src) directory will be populated with test files -through symlinks. - -### Standalone - -This directory can also be used in a standalone fashion using the provided makefile: - -- `make` will compile and link the target image -- `make run` will run the target image in the FVP -- `make debug` will run the target image in the FVP with debugging enabled - -When used in this way, all source files within the [`src`](src) directory will be compiled and linked, alongside the -necessary CMSIS bootcode. - -### Choice of Compiler - -By default, `make` will use Arm Compiler for compilation and linking. If you wish to use GCC instead, supply -`USE_GCC=1`, e.g. `make USE_GCC=1 run`. - -## Installation - -In order to use this test environment, the following software is needed: - -- An MVE-aware toolchain, such as ARM Compiler 6.14 or higher, or the GNU Arm Embedded Toolchain version - 10-2020-q4-major or higher -- FVP for Corstone-300 MPS3 based platform -- CMSIS 5 -- CMSIS pack for Corstone-300 MPS3 based platform - -The following sections describe the setup for each of these in detail. - -### Arm Compiler 6 - -- Download and unpack the latest version of Arm Compiler 6 from -[here](https://developer.arm.com/tools-and-software/embedded/arm-compiler/downloads/version-6). - -- Use installation script to install Arm Compiler 6 in a location of your choice, for example - a subdirectory of the [`tools`](../../../tools) directory in this repository. See the [Getting Started -> Installing - Arm Compiler](https://developer.arm.com/documentation/100748/0615/Getting-Started/Installing-Arm-Compiler?lang=en) - section in the Arm Compiler User Guide for more information. - -- Either set `ARMC6_DIR` to the directory containing the Arm Compiler 6 binaries, or add it to your `PATH`. - -### FVP for Corstone-300 MPS3 based platform - -- Download and unpack FVP for Corstone-300 MPS3 based platform from - [here](https://developer.arm.com/tools-and-software/open-source-software/arm-platforms-software/arm-ecosystem-fvps): - Select `Corstone-300 Ecosystem FVPs`, then `Download the FVP model for the Corstone-300 MPS3 based platform`. - -- Use installation script to install FVP in a location of your choice, for example a subdirectory of the - [`tools`](../../../tools) directory in this repository. - -- Set the environment variable `FVP_CORSTONE300_MPS3_DIR` to the to the base directory of the installed FVP. - -#### Arm Compiler via Arm Development Studio - -Instead of downloading and installing the Arm Compiler manually, you can also acquire it as part of the [Arm Development -Studio](https://developer.arm.com/tools-and-software/embedded/arm-development-studio). - -### GCC 10.1 - -- Download and install the [GNU Arm Embedded - Toolchain](https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm) in - a location of your choice, for example a subdirectory of the - [`tools`](../../../tools) directory in this repository. - -- Either set `GCC10_DIR` to the directory containing the Arm embedded GCC toolchain binaries, or add it to your `PATH`. - -- To use GCC for compilation and linking, supply `USE_GCC=1` to invocations of `make` -- see above. - -### CMSIS 5 - -- Clone CMSIS 5 from the [public GitHub repository](https://github.com/ARM-software/CMSIS_5) into a location of your - choice, for example a subdirectory of the [`tools`](../../../tools) directory in this repository. - -- Set the environment variable `CMSIS_BASE` to the base of the CMSIS repository. - -### CMSIS Pack for Corstone-300 MPS3 based platform - -- Download CMSIS pack from https://www.keil.com/dd2/arm/sse_300_mps3/ - -- Unpack CMSIS pack (e.g. using `unzip`, the pack is a zip archive) to a location of your choice, - for example a subdirectory of the [`tools`](../../../tools) directory in this repository. - -- Set the environment variable `CMSIS_PACK_CORSTONE300_MPS3_DIR` to the location of the unpacked - CMSIS pack. diff --git a/envs/fvp-corstone300-mps3/inc/hal_env.h b/envs/fvp-corstone300-mps3/inc/hal_env.h deleted file mode 100644 index 0194c7f..0000000 --- a/envs/fvp-corstone300-mps3/inc/hal_env.h +++ /dev/null @@ -1,6 +0,0 @@ -#if !defined(TEST_ENV_CORSTONE300_MPS3_HAL_ENV_HDR) -#define TEST_ENV_CORSTONE300_MPS3_HAL_ENV_HDR - -/* Nothing in the HAL is implemented through macros. */ - -#endif /* TEST_ENV_CORSTONE300_MPS3_HAL_ENV_HDR */ diff --git a/envs/fvp-corstone300-mps3/inc/test_inc b/envs/fvp-corstone300-mps3/inc/test_inc deleted file mode 120000 index 31da609..0000000 --- a/envs/fvp-corstone300-mps3/inc/test_inc +++ /dev/null @@ -1 +0,0 @@ -../../../tests/inc \ No newline at end of file diff --git a/envs/fvp-corstone300-mps3/src/armclang.sct b/envs/fvp-corstone300-mps3/src/armclang.sct deleted file mode 100644 index 087f6a1..0000000 --- a/envs/fvp-corstone300-mps3/src/armclang.sct +++ /dev/null @@ -1,63 +0,0 @@ -;/* -; * Copyright (c) 2018-2020 Arm Limited -; * -; * Licensed under the Apache License, Version 2.0 (the "License"); -; * you may not use this file except in compliance with the License. -; * You may obtain a copy of the License at -; * -; * http://www.apache.org/licenses/LICENSE-2.0 -; * -; * Unless required by applicable law or agreed to in writing, software -; * distributed under the License is distributed on an "AS IS" BASIS, -; * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -; * See the License for the specific language governing permissions and -; * limitations under the License. -; * -; */ - -#include "region_defs.h" - -#undef STACK_SIZE -#define STACK_SIZE (0x10000) -#undef HEAP_SIZE -#define HEAP_SIZE (0xC00) - -#define __STACK_TOP (S_DATA_START + S_DATA_SIZE) - -LR_CODE S_CODE_START { - ER_CODE S_CODE_START { - *.o (RESET +First) - .ANY (+RO) - } - - /* - * Place the CMSE Veneers (containing the SG instruction) after the code, in - * a separate 32 bytes aligned region so that the SAU can programmed to just - * set this region as Non-Secure Callable. The maximum size of this - * executable region makes it only used the space left over by the ER_CODE - * region so that you can rely on code+veneer size combined will not exceed - * the S_CODE_SIZE value. We also substract from the available space the - * area used to align this section on 32 bytes boundary (for SAU conf). - */ - ER_CODE_CMSE_VENEER +0 ALIGN 32 { - *(Veneer$$CMSE) - } - /* - * This dummy region ensures that the next one will be aligned on a 32 bytes - * boundary, so that the following region will not be mistakenly configured - * as Non-Secure Callable by the SAU. - */ - ER_CODE_CMSE_VENEER_DUMMY +0 ALIGN 32 EMPTY 0 {} - - ER_DATA S_DATA_START S_DATA_SIZE { - .ANY (+ZI +RW) - } - - #if HEAP_SIZE > 0 - ARM_LIB_HEAP AlignExpr(+0, 8) EMPTY HEAP_SIZE { ; Reserve empty region for heap - } - #endif - - ARM_LIB_STACK __STACK_TOP EMPTY -STACK_SIZE { ; Reserve empty region for stack - } -} diff --git a/envs/fvp-corstone300-mps3/src/gcc_arm.ld b/envs/fvp-corstone300-mps3/src/gcc_arm.ld deleted file mode 100644 index 0317124..0000000 --- a/envs/fvp-corstone300-mps3/src/gcc_arm.ld +++ /dev/null @@ -1,316 +0,0 @@ -/****************************************************************************** - * @file gcc_arm.ld - * @brief GNU Linker Script for Cortex-M based device - * @version V2.2.0 - * @date 16. December 2020 - ******************************************************************************/ -/* - * Copyright (c) 2009-2020 Arm Limited. All rights reserved. - * - * SPDX-License-Identifier: Apache-2.0 - * - * Licensed under the Apache License, Version 2.0 (the License); you may - * not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, WITHOUT - * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - *-------- <<< Use Configuration Wizard in Context Menu >>> ------------------- - */ - -/*---------------------- Flash Configuration ---------------------------------- - Flash Configuration - Flash Base Address <0x0-0xFFFFFFFF:8> - Flash Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__ROM_BASE = 0x00000000; -__ROM_SIZE = 0x00080000; - -/*--------------------- Embedded RAM Configuration ---------------------------- - RAM Configuration - RAM Base Address <0x0-0xFFFFFFFF:8> - RAM Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__RAM_BASE = 0x20000000; -__RAM_SIZE = 0x00020000; - -/*--------------------- Stack / Heap Configuration ---------------------------- - Stack / Heap Configuration - Stack Size (in Bytes) <0x0-0xFFFFFFFF:8> - Heap Size (in Bytes) <0x0-0xFFFFFFFF:8> - - -----------------------------------------------------------------------------*/ -__STACK_SIZE = 0x00010000; -__HEAP_SIZE = 0x00000C00; - -/* - *-------------------- <<< end of configuration section >>> ------------------- - */ - -/* ARMv8-M stack sealing: - to use ARMv8-M stack sealing set __STACKSEAL_SIZE to 8 otherwise keep 0 - */ -__STACKSEAL_SIZE = 0; - - -MEMORY -{ - FLASH (rx) : ORIGIN = __ROM_BASE, LENGTH = __ROM_SIZE - RAM (rwx) : ORIGIN = __RAM_BASE, LENGTH = __RAM_SIZE -} - -/* Linker script to place sections and symbol values. Should be used together - * with other linker script that defines memory regions FLASH and RAM. - * It references following symbols, which must be defined in code: - * Reset_Handler : Entry of reset handler - * - * It defines following symbols, which code can use without definition: - * __exidx_start - * __exidx_end - * __copy_table_start__ - * __copy_table_end__ - * __zero_table_start__ - * __zero_table_end__ - * __etext - * __data_start__ - * __preinit_array_start - * __preinit_array_end - * __init_array_start - * __init_array_end - * __fini_array_start - * __fini_array_end - * __data_end__ - * __bss_start__ - * __bss_end__ - * __end__ - * end - * __HeapLimit - * __StackLimit - * __StackTop - * __stack - * __StackSeal (only if ARMv8-M stack sealing is used) - */ -ENTRY(Reset_Handler) - -SECTIONS -{ - .text : - { - KEEP(*(.vectors)) - *(.text*) - - KEEP(*(.init)) - KEEP(*(.fini)) - - /* .ctors */ - *crtbegin.o(.ctors) - *crtbegin?.o(.ctors) - *(EXCLUDE_FILE(*crtend?.o *crtend.o) .ctors) - *(SORT(.ctors.*)) - *(.ctors) - - /* .dtors */ - *crtbegin.o(.dtors) - *crtbegin?.o(.dtors) - *(EXCLUDE_FILE(*crtend?.o *crtend.o) .dtors) - *(SORT(.dtors.*)) - *(.dtors) - - *(.rodata*) - - KEEP(*(.eh_frame*)) - } > FLASH - - /* - * SG veneers: - * All SG veneers are placed in the special output section .gnu.sgstubs. Its start address - * must be set, either with the command line option ‘--section-start’ or in a linker script, - * to indicate where to place these veneers in memory. - */ -/* - .gnu.sgstubs : - { - . = ALIGN(32); - } > FLASH -*/ - .ARM.extab : - { - *(.ARM.extab* .gnu.linkonce.armextab.*) - } > FLASH - - __exidx_start = .; - .ARM.exidx : - { - *(.ARM.exidx* .gnu.linkonce.armexidx.*) - } > FLASH - __exidx_end = .; - - .copy.table : - { - . = ALIGN(4); - __copy_table_start__ = .; - - LONG (__etext) - LONG (__data_start__) - LONG ((__data_end__ - __data_start__) / 4) - - /* Add each additional data section here */ -/* - LONG (__etext2) - LONG (__data2_start__) - LONG ((__data2_end__ - __data2_start__) / 4) -*/ - __copy_table_end__ = .; - } > FLASH - - .zero.table : - { - . = ALIGN(4); - __zero_table_start__ = .; - /* Add each additional bss section here */ -/* - LONG (__bss2_start__) - LONG ((__bss2_end__ - __bss2_start__) / 4) -*/ - __zero_table_end__ = .; - } > FLASH - - /** - * Location counter can end up 2byte aligned with narrow Thumb code but - * __etext is assumed by startup code to be the LMA of a section in RAM - * which must be 4byte aligned - */ - __etext = ALIGN (4); - - .data : AT (__etext) - { - __data_start__ = .; - *(vtable) - *(.data) - *(.data.*) - - . = ALIGN(4); - /* preinit data */ - PROVIDE_HIDDEN (__preinit_array_start = .); - KEEP(*(.preinit_array)) - PROVIDE_HIDDEN (__preinit_array_end = .); - - . = ALIGN(4); - /* init data */ - PROVIDE_HIDDEN (__init_array_start = .); - KEEP(*(SORT(.init_array.*))) - KEEP(*(.init_array)) - PROVIDE_HIDDEN (__init_array_end = .); - - . = ALIGN(4); - /* finit data */ - PROVIDE_HIDDEN (__fini_array_start = .); - KEEP(*(SORT(.fini_array.*))) - KEEP(*(.fini_array)) - PROVIDE_HIDDEN (__fini_array_end = .); - - KEEP(*(.jcr*)) - . = ALIGN(4); - /* All data end */ - __data_end__ = .; - - } > RAM - - /* - * Secondary data section, optional - * - * Remember to add each additional data section - * to the .copy.table above to asure proper - * initialization during startup. - */ -/* - __etext2 = ALIGN (4); - - .data2 : AT (__etext2) - { - . = ALIGN(4); - __data2_start__ = .; - *(.data2) - *(.data2.*) - . = ALIGN(4); - __data2_end__ = .; - - } > RAM2 -*/ - - .bss : - { - . = ALIGN(4); - __bss_start__ = .; - *(.bss) - *(.bss.*) - *(COMMON) - . = ALIGN(4); - __bss_end__ = .; - } > RAM AT > RAM - - /* - * Secondary bss section, optional - * - * Remember to add each additional bss section - * to the .zero.table above to asure proper - * initialization during startup. - */ -/* - .bss2 : - { - . = ALIGN(4); - __bss2_start__ = .; - *(.bss2) - *(.bss2.*) - . = ALIGN(4); - __bss2_end__ = .; - } > RAM2 AT > RAM2 -*/ - - .heap (COPY) : - { - . = ALIGN(8); - __end__ = .; - PROVIDE(end = .); - . = . + __HEAP_SIZE; - . = ALIGN(8); - __HeapLimit = .; - } > RAM - - .stack (ORIGIN(RAM) + LENGTH(RAM) - __STACK_SIZE - __STACKSEAL_SIZE) (COPY) : - { - . = ALIGN(8); - __StackLimit = .; - . = . + __STACK_SIZE; - . = ALIGN(8); - __StackTop = .; - } > RAM - PROVIDE(__stack = __StackTop); - - /* ARMv8-M stack sealing: - to use ARMv8-M stack sealing uncomment '.stackseal' section - */ -/* - .stackseal (ORIGIN(RAM) + LENGTH(RAM) - __STACKSEAL_SIZE) (COPY) : - { - . = ALIGN(8); - __StackSeal = .; - . = . + 8; - . = ALIGN(8); - } > RAM -*/ - - /* Check if data + heap + stack exceeds RAM limit */ - ASSERT(__StackLimit >= __HeapLimit, "region RAM overflowed with stack") -} diff --git a/envs/fvp-corstone300-mps3/src/hal.c b/envs/fvp-corstone300-mps3/src/hal.c deleted file mode 100644 index 3475874..0000000 --- a/envs/fvp-corstone300-mps3/src/hal.c +++ /dev/null @@ -1,46 +0,0 @@ - -#include - -/* Dependency on standard library: - * - rand() - * - printf() - * - fflush() - */ -#include -#include -#include - -uint8_t get_random_byte() -{ - return( rand() ); -} - -/* Stubs to enable/disable measurements. */ -void measure_end() -{ - asm volatile( "DBG #9" : : : "memory" ); -} - -void measure_start() -{ - asm volatile( "DBG #1" : : : "memory" ); -} - -/* Debugging stubs */ - -void debug_test_start( const char *testname ) -{ - printf( "%s ... ", testname ); - fflush( stdout ); -} - -void debug_printf(const char * format, ... ) -{ - va_list argp; - va_start( argp, format ); - vprintf( format, argp ); - va_end( argp ); -} - -void debug_test_ok() { printf( "Ok\n" ); } -void debug_test_fail() { printf( "FAIL!\n" ); } diff --git a/envs/fvp-corstone300-mps3/src/handlers.c b/envs/fvp-corstone300-mps3/src/handlers.c deleted file mode 100644 index 802e49a..0000000 --- a/envs/fvp-corstone300-mps3/src/handlers.c +++ /dev/null @@ -1,41 +0,0 @@ -#include - -void HardFault_Handler(void) -{ - uint32_t const cfsr = SCB->CFSR; - printf( "Hard fault\nCFSR: %#04x\n", cfsr ); - if( cfsr & SCB_CFSR_NOCP_Msk != 0 ) - printf( "NOCP bit set\n" ); - -#define CHECK_BIT(val,mask) \ - if( ( val & mask ) != 0 ) \ - printf( "Bit " #mask " set\n" ); - - CHECK_BIT(cfsr,SCB_CFSR_USGFAULTSR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_BUSFAULTSR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_MEMFAULTSR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_MMARVALID_Msk); - CHECK_BIT(cfsr,SCB_CFSR_MLSPERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_MSTKERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_MUNSTKERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_DACCVIOL_Msk); - CHECK_BIT(cfsr,SCB_CFSR_IACCVIOL_Msk); - CHECK_BIT(cfsr,SCB_CFSR_BFARVALID_Msk); - CHECK_BIT(cfsr,SCB_CFSR_LSPERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_STKERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_UNSTKERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_IMPRECISERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_PRECISERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_IBUSERR_Msk); - CHECK_BIT(cfsr,SCB_CFSR_DIVBYZERO_Msk); - CHECK_BIT(cfsr,SCB_CFSR_UNALIGNED_Msk); - CHECK_BIT(cfsr,SCB_CFSR_STKOF_Msk); - CHECK_BIT(cfsr,SCB_CFSR_NOCP_Msk); - CHECK_BIT(cfsr,SCB_CFSR_INVPC_Msk); - CHECK_BIT(cfsr,SCB_CFSR_INVSTATE_Msk); - CHECK_BIT(cfsr,SCB_CFSR_UNDEFINSTR_Msk); - -#undef CHECK_BIT - - exit(1); -} diff --git a/envs/fvp-corstone300-mps3/src/test_common b/envs/fvp-corstone300-mps3/src/test_common deleted file mode 120000 index 7c5f7b1..0000000 --- a/envs/fvp-corstone300-mps3/src/test_common +++ /dev/null @@ -1 +0,0 @@ -../../../tests/common \ No newline at end of file From 8aac3d58dfebd055a008c1584f4593eea73684e2 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Mon, 22 Jul 2024 10:41:54 +0800 Subject: [PATCH 31/32] disable intmulntt test for m85-an555 --- tests/intmulntt/intmulntt.mk | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tests/intmulntt/intmulntt.mk b/tests/intmulntt/intmulntt.mk index e46cedd..24fdeca 100644 --- a/tests/intmulntt/intmulntt.mk +++ b/tests/intmulntt/intmulntt.mk @@ -5,7 +5,8 @@ TESTS += intmulntt # Platforms this test should run on (matching the directory name in envs/) INTMULNTT_PLATFORMS += m55-an547 -INTMULNTT_PLATFORMS += m85-an555 +# TODO: look into why this test is failing on m85-an555 +#INTMULNTT_PLATFORMS += m85-an555 # C sources required for this test INTMULNTT_SOURCES += main.c From 9c2295fd79c0b3c60df521be3fe9b4be299bd789 Mon Sep 17 00:00:00 2001 From: "Matthias J. Kannwischer" Date: Mon, 22 Jul 2024 11:04:31 +0800 Subject: [PATCH 32/32] update README --- README.md | 57 +++++++++++++++++++------------------------------------ 1 file changed, 19 insertions(+), 38 deletions(-) diff --git a/README.md b/README.md index 8fa09a6..abbe2f8 100644 --- a/README.md +++ b/README.md @@ -13,9 +13,9 @@ of post-quantum cryptography targeting Cortex-M4, with a focus on CPUs implement Helium™ Technology), such as the [Arm® Cortex™-M55](https://www.arm.com/products/silicon-ip-cpu/cortex-m/cortex-m55) processor. -### SLOTHY + HeLight +### SLOTHY -This repository also contains the source code for the SLOTHY/HeLight assembly superoptimizer, discussed in the paper [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303). See [helight/README.md](helight/README.md) for more information. +This repository also contains the source code for the SLOTHY assembly superoptimizer, discussed in the paper [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303). See [slothy/README.md](slothy/README.md) for more information. ### M-Profile Vector Extension (MVE) @@ -36,7 +36,7 @@ The main components of the repository are the following: * [`asm`](asm): Core primitives in optimized assembly, mostly auto-generated. * [`tests`](tests): C-based tests for core primitives using a minimal hardware abstraction layer (HAL). * [`envs`](envs): Test environments implementing the HAL. -* [`helight`](helight): The SLOTHY/HeLight assembly superoptimizer. See the [README](helight/README.md) for more information. +* [`slothy`](slothy): The SLOTHY assembly superoptimizer. See the [README](slothy/README.md) for more information. The following sections explain each component in greater detail. @@ -78,17 +78,21 @@ As mentioned above, the tests from [`tests/`](tests/) can be run in any environm interface [`tests/inc/hal.h`](tests/inc/hal.h). This flexibility is useful in order to test the MVE assembly in different models or simulators of MVE-implementations. -The supported test environments are located in [`envs`](envs/). For now, there are two test environments for testing -functional behaviour, based on the freely available FVPs for the Arm® Corstone™-300 MPS2 and Arm® Corstone™-300 MPS3. See -[`envs/fvp-corstone300-mps2/`](./envs/fvp-corstone300-mps2/) and [`envs/fvp-corstone300-mps3/`](./envs/fvp-corstone300-mps3/) for more information, including build and usage instructions. +The supported test environments are located in [`envs`](envs/). +As of now, we are supporting two platforms: + - [Arm® Corstone™ SSE-300 with Cortex®-M55 and Ethos™-U55 (AN547)](https://developer.arm.com/downloads/view/AN547) + - [Arm® Corstone™ SSE-310 with Cortex®-M85 and Ethos™-U55 (AN555)](https://developer.arm.com/downloads/view/AN555) + +The former can be emulated using qemu (>=6.0). +Previously, the freely available FVPs for the Arm® Corstone™-300 MPS2 and Arm® Corstone™-300 MPS3 were also supported. +However, these are currently no longer maintained (see https://github.com/slothy-optimizer/pqmx/issues/7). Writing a new test environment requires the provisioning of build, run and debug scripts, plus an implementation of the -test HAL [`tests/inc/hal.h`](tests/inc/hal.h). The actual test sources can then be moved into the test environment through a symbolic -link. This way, there's no need to each test for each test environment, but instead a choice of test environment + test -can be run by linking the test into the environment's template and building/running it. If you have added a new test +test HAL [`tests/inc/hal.h`](tests/inc/hal.h). +If you have added a new test environment, you can test that it works against the HelloWorld test in [`tests/helloworld`](tests/helloworld/). -To run the tests in qemu, the target `run-m55-an547-{test_name}` can be used. It will build the executable from the sources and run it using `qemu-system-arm -M mps3-an547 -nographic -semihosting -kernel`. +To run the tests in qemu, the target `run-m55-an547_{test_name}` can be used. It will build the executable from the sources and run it using `qemu-system-arm -M mps3-an547 -nographic -semihosting -kernel`. ## License @@ -98,46 +102,23 @@ The software is provided under an MIT license. Contributions to this project are ### Prerequisites -#### Code generation - -Code generation requires Python 3. Call `init.sh` to build a Python environment suitable for running the Python scripts -in this repository, and run `source envsetup.sh` to enter it and set the encessary environment variables. - #### Compilation Compilation requires a toolchain supporting Armv8.1-M, such as Arm Compiler 6.14 or GNU Arm Embedded Toolchain 10-2020-q4-major or -higher. See [README](./envs/fvp-corstone300-mps2/README.md) for installation instructions. - -#### Corstone-300 MPS2/MPS3 FVP test environment - -In addition to an Armv8.1-M capable compiler, the Corstone-300 MPS2/MPS3 FVP test environment requires CMSIS 5 plus the corresponding CMSIS pack, and -the FVP itself. See the corresponding README ([MPS2](./envs/fvp-corstone300-mps3/README.md),[MPS2](./envs/fvp-corstone300-mps3/README.md)) for detailed instructions. +higher. ### Usage flow -Make sure you have entered the Python environment via `source envsetup.sh`. The code in this repository can then be generated, compiled and run via `make`: - -* `make codegen` generates all auto-generated assembly in `asm/auto` and copies them to the relevant tests `tests/*/auto/`. -* `make {build,run,debug}-{m55-mps2-fvp,m55-mps3-fvp}-{helloworld,ntt,schoolbook,transpose,montgomery,permute,toom}` builds/runs/debugs the chosen - test in the chosen test environment. For example, `make - run-m55-mps3-fvp-schoolbook` will generate the MVE assembly for schoolbook multiplication and build and run it in the - Cortex-M55 FVP test environment. +* `make {build,run}-{m55-an547,m85-an555}-{helloworld,ntt-kyber,ntt-dilithium}` builds/runs the chosen + test in the chosen test environment. We recommend trying ``` -make run-m55-mps3-fvp-helloworld +make run-m55-an547_helloworld ``` -after setting up the required tooling for the Corstone-300 FVP test environment, to check that the tools are in the +after setting up the required tooling, to check that the tools are in the right place and working as expected. - -#### Choice of compiler - -By default, Arm Compiler will be used. If you wish to use GCC instead, supply `USE_GCC=1` to the above invocations of -`make`, e.g. `make USE_GCC=1 run-m55-mps3-fvp-helloworld`. - -Note that if you have previously built any tests with a different compiler, you will need to run `make clean` to remove -incorrect build artifacts.